An Extensible Software Transport Layer for GPU Networking
Authors:
Yang Zhou,
Zhongjie Chen,
Ziming Mao,
ChonLam Lao,
Shuo Yang,
Pravein Govindan Kannan,
Jiaqi Gao,
Yilong Zhao,
Yongji Wu,
Kaichao You,
Fengyuan Ren,
Zhiying Xu,
Costin Raiciu,
Ion Stoica
Abstract:
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UC…
▽ More
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 3.3x higher performance compared to an industry solution.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
Benchmarking tunnel and encryption methodologies in cloud environments
Authors:
Pravein Govindan Kannan,
Brent Salisbury,
Palanivel Kodeswaran,
Sayandeep Sen
Abstract:
The recent past has seen the adoption of multi-cloud deployments by enterprises due to availability, features, and regulatory requirements. A typical deployment involves parts of an application/workloads running inside a private cloud with the other parts spread across multiple on-prem/public clouds. Typical cluster-to-cluster networking in such deployments involve the establishment of site-to-sit…
▽ More
The recent past has seen the adoption of multi-cloud deployments by enterprises due to availability, features, and regulatory requirements. A typical deployment involves parts of an application/workloads running inside a private cloud with the other parts spread across multiple on-prem/public clouds. Typical cluster-to-cluster networking in such deployments involve the establishment of site-to-site encrypted tunnels to connect the workloads.
In this report, we benchmark the performance of various tunneling and encryption technologies to provide directions on their use in multi-cloud deployments. Based on the various experiments conducted on three different testbeds, we present quantifiable data which can be leveraged by operators and cloud providers tasked with design and development decisions of multi-cloud network connectivity and orchestration.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.