-
FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters
Authors:
Hasibul Jamil,
Abdul Alim,
Laurent Schares,
Pavlos Maniotis,
Liran Schour,
Ali Sydney,
Abdullah Kayi,
Tevfik Kosar,
Bengi Karacali
Abstract:
The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. T…
▽ More
The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. FlowTracer aids in debugging network inefficiencies by providing detailed visibility into traffic distribution and helping to identify the root causes of performance degradation, such as issues caused by hash collisions. By offering flow-level insights, FlowTracer enables system operators to optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce.
△ Less
Submitted 24 October, 2024; v1 submitted 22 October, 2024;
originally announced October 2024.
-
The infrastructure powering IBM's Gen AI model development
Authors:
Talia Gershon,
Seetharami Seelam,
Brian Belgodere,
Milton Bonilla,
Lan Hoang,
Danny Barnett,
I-Hsin Chung,
Apoorve Mohan,
Ming-Hung Chen,
Lixiang Luo,
Robert Walkup,
Constantinos Evangelinos,
Shweta Salaria,
Marc Dombrowa,
Yoonho Park,
Apo Kayi,
Liran Schour,
Alim Alim,
Ali Sydney,
Pavlos Maniotis,
Laurent Schares,
Bernard Metzler,
Bengi Karacali-Akyamac,
Sophia Wen,
Tatsuhiro Chiba
, et al. (122 additional authors not shown)
Abstract:
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering effi…
▽ More
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.
△ Less
Submitted 13 January, 2025; v1 submitted 7 July, 2024;
originally announced July 2024.
-
C-Share: Optical Circuits Sharing for Software-Defined Data-Centers
Authors:
Yaniv Ben-Itzhak,
Cosmin Caba,
Liran Schour,
Shay Vargaftik
Abstract:
Integrating optical circuit switches in data-centers is an on-going research challenge. In recent years, state-of-the-art solutions introduce hybrid packet/circuit architectures for different optical circuit switch technologies, control techniques, and traffic rerouting methods. These solutions are based on separated packet and circuit planes which do not have the ability to utilize an optical cir…
▽ More
Integrating optical circuit switches in data-centers is an on-going research challenge. In recent years, state-of-the-art solutions introduce hybrid packet/circuit architectures for different optical circuit switch technologies, control techniques, and traffic rerouting methods. These solutions are based on separated packet and circuit planes which do not have the ability to utilize an optical circuit with flows that do not arrive from or delivered to switches directly connected to the circuit's end-points. Moreover, current SDN-based elephant flow rerouting methods require a forwarding rule for each flow, which raise scalability issues. In this paper, we present C-Share -- a practical, scalable SDN-based circuit sharing solution for data center networks. C-Share inherently enable elephant flows to share optical circuits by exploiting a flat upper tier network topology. C-Share is based on a scalable and decoupled SDN-based elephant flow rerouting method comprised of elephant flow detection, tagging and identification, which is utilized by using a prevalent network sampling method (e.g., sFlow). C-Share requires only a single OpenFlow rule for each optical circuit, and therefore significantly reduces the required OpenFlow rule entry footprint and setup rule rate. It also mitigates the OpenFlow outbound latency for subsequent elephant flows. We implement a proof-of-concept system for C-Share based on Mininet, and test the scalability of C-Share by using an event driven simulation. Our results show a consistent increase in the mice/elephant flow separation in the network which, in turn, improves both network throughput and flow completion time.
△ Less
Submitted 15 September, 2016;
originally announced September 2016.