SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication
Authors:
Mikhail Khalilov,
Siyuan Shen,
Marcin Chrapek,
Tiancheng Chen,
Kenji Nakano,
Peter-Jan Gootzen,
Salvatore Di Girolamo,
Rami Nudelman,
Gil Bloch,
Sreevatsa Anantharamu,
Mahmoud Elhaddad,
Jithin Jose,
Abdul Kabbani,
Scott Moe,
Konstantin Taranov,
Zhuolong Yu,
Jie Zhang,
Nicola Mazzoletti,
Torsten Hoefler
Abstract:
RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing ha…
▽ More
RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for inter-datacenter training.
△ Less
Submitted 10 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
Bento and the Art of Repeated Research
Authors:
Peter-Jan Gootzen,
Animesh Trivedi
Abstract:
Bento provides a new approach to developing file systems, with safety and high-velocity development in mind. This is achieved by using Rust, a modern and memory-safe systems programming language, and by providing a framework to run a single file system implementation in kernel space with the VFS or in user space with FUSE. In this paper, the benchmarking experiments from the Bento paper are repeat…
▽ More
Bento provides a new approach to developing file systems, with safety and high-velocity development in mind. This is achieved by using Rust, a modern and memory-safe systems programming language, and by providing a framework to run a single file system implementation in kernel space with the VFS or in user space with FUSE. In this paper, the benchmarking experiments from the Bento paper are repeated. We fail to exactly reproduce the results of the Bento paper, but more or less find the same patterns albeit with more outlying results. Additionally we unsuccessfully run a standardized test suite, and expand the set of experiments with latency benchmarks and throughput benchmarks using a RAM block device. The latency benchmarks show that ext4 with journaling consistently outperforms Bento-fs and the RAM throughput benchmarks show no additional consistent performance pattern. During this experimentation, a set of 12 bugs was encountered and analyzed. We find that the ratio of memory related bugs is lower than other systems programming projects that use C as opposed to Rust, thus supporting the claims of the Bento framework.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.