Skip to main content

Showing 1–5 of 5 results for author: Hosmer, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.00215  [pdf, other

    cs.LG

    Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

    Authors: Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez , et al. (5 additional authors not shown)

    Abstract: Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To susta… ▽ More

    Submitted 9 May, 2025; v1 submitted 30 September, 2024; originally announced October 2024.

    Comments: 13 pages including references. 8 Figures. Under review to HPCA 2025 Industry Track

  2. arXiv:2405.02803  [pdf, other

    cs.LG cs.DC

    Is Flash Attention Stable?

    Authors: Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

    Abstract: Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of instability during training, often taking the form of loss spikes. Numeric deviation has emerged as a potential cause of this training instability, although quantify… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

  3. LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

    Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu

    Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exi… ▽ More

    Submitted 18 October, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: ACL 2024

  4. arXiv:2403.08058  [pdf, other

    cs.LG cs.CL

    CHAI: Clustered Head Attention for Efficient LLM Inference

    Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

    Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and comput… ▽ More

    Submitted 27 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  5. arXiv:2312.14385  [pdf, other

    cs.DC cs.LG cs.MM

    Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

    Authors: Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

    Abstract: As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation m… ▽ More

    Submitted 5 May, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: Published at 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)