Skip to main content

Showing 1–7 of 7 results for author: Bonato, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04786  [pdf, ps, other

    cs.DC

    Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

    Authors: Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, Jeff Hammond, Torsten Hoefler

    Abstract: The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    ACM Class: C.2

  2. arXiv:2506.21406  [pdf, ps, other

    cs.NI

    Flowcut Switching: High-Performance Adaptive Routing with In-Order Delivery Guarantees

    Authors: Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Abdulla Bataineh, David Hewson, Duncan Roweth, Torsten Hoefler

    Abstract: Network latency severely impacts the performance of applications running on supercomputers. Adaptive routing algorithms route packets over different available paths to reduce latency and improve network utilization. However, if a switch routes packets belonging to the same network flow on different paths, they might arrive at the destination out-of-order due to differences in the latency of these… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  3. arXiv:2505.08936  [pdf, ps, other

    cs.DC

    ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage

    Authors: Siyuan Shen, Tommaso Bonato, Zhiyi Hu, Pasquale Jordan, Tiancheng Chen, Torsten Hoefler

    Abstract: Network simulators play a crucial role in evaluating the performance of large-scale systems. However, existing simulators rely heavily on synthetic microbenchmarks or narrowly focus on specific domains, limiting their ability to provide comprehensive performance insights. In this work, we introduce ATLAHS, a flexible, extensible, and open-source toolchain designed to trace real-world applications… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: 13 pages

    ACM Class: I.6.3

  4. arXiv:2407.21625  [pdf, other

    cs.NI

    ARCANE: Adaptive Routing with Caching and Aware Network Exploration

    Authors: Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, Torsten Hoefler

    Abstract: Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. Existing solutions designed for Ethernet, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilizations as datacenter topologies (and network failures as a conseque… ▽ More

    Submitted 23 May, 2025; v1 submitted 31 July, 2024; originally announced July 2024.

  5. arXiv:2404.01630  [pdf, other

    cs.NI

    FASTFLOW: Flexible Adaptive Congestion Control for High-Performance Datacenters

    Authors: Tommaso Bonato, Abdul Kabbani, Daniele De Sensi, Rong Pan, Yanfang Le, Costin Raiciu, Mark Handley, Timo Schneider, Nils Blach, Ahmad Ghalayini, Daniel Alves, Michael Papamichael, Adrian Caulfield, Torsten Hoefler

    Abstract: The increasing demand of machine learning (ML) workloads in datacenters places significant stress on current congestion control (CC) algorithms, many of which struggle to maintain performance at scale. These workloads generate bursty, synchronized traffic that requires both rapid response and fairness across flows. Unfortunately, existing CC algorithms that rely heavily on delay as a primary conge… ▽ More

    Submitted 20 September, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  6. arXiv:2401.09356  [pdf, other

    cs.DC cs.LG cs.NI cs.PF

    Swing: Short-cutting Rings for Higher Bandwidth Allreduce

    Authors: Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler

    Abstract: The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely u… ▽ More

    Submitted 4 March, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    ACM Class: C.2.4; C.2.2

    Journal ref: NSDI 2024

  7. arXiv:2209.01346  [pdf, other

    cs.DC cs.AI cs.AR cs.NI cs.PF

    HammingMesh: A Network Topology for Large-Scale Deep Learning

    Authors: Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

    Abstract: Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale t… ▽ More

    Submitted 21 October, 2022; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: published at ACM/IEEE Supercomputing (SC22)