Skip to main content

Showing 1–3 of 3 results for author: Schour, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.17078  [pdf, other

    cs.NI cs.DC

    FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

    Authors: Hasibul Jamil, Abdul Alim, Laurent Schares, Pavlos Maniotis, Liran Schour, Ali Sydney, Abdullah Kayi, Tevfik Kosar, Bengi Karacali

    Abstract: The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. T… ▽ More

    Submitted 24 October, 2024; v1 submitted 22 October, 2024; originally announced October 2024.

    Comments: Submitted for peer reviewing in IEEE ICC 2025

  2. arXiv:2407.05467  [pdf, other

    cs.DC cs.AI

    The infrastructure powering IBM's Gen AI model development

    Authors: Talia Gershon, Seetharami Seelam, Brian Belgodere, Milton Bonilla, Lan Hoang, Danny Barnett, I-Hsin Chung, Apoorve Mohan, Ming-Hung Chen, Lixiang Luo, Robert Walkup, Constantinos Evangelinos, Shweta Salaria, Marc Dombrowa, Yoonho Park, Apo Kayi, Liran Schour, Alim Alim, Ali Sydney, Pavlos Maniotis, Laurent Schares, Bernard Metzler, Bengi Karacali-Akyamac, Sophia Wen, Tatsuhiro Chiba , et al. (122 additional authors not shown)

    Abstract: AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering effi… ▽ More

    Submitted 13 January, 2025; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: Corresponding Authors: Talia Gershon, Seetharami Seelam,Brian Belgodere, Milton Bonilla

  3. C-Share: Optical Circuits Sharing for Software-Defined Data-Centers

    Authors: Yaniv Ben-Itzhak, Cosmin Caba, Liran Schour, Shay Vargaftik

    Abstract: Integrating optical circuit switches in data-centers is an on-going research challenge. In recent years, state-of-the-art solutions introduce hybrid packet/circuit architectures for different optical circuit switch technologies, control techniques, and traffic rerouting methods. These solutions are based on separated packet and circuit planes which do not have the ability to utilize an optical cir… ▽ More

    Submitted 15 September, 2016; originally announced September 2016.