Skip to main content

Showing 1–18 of 18 results for author: Georganas, E

.
  1. arXiv:2505.19349  [pdf, ps, other

    cs.AR

    DECA: A Near-Core LLM Decompression Accelerator Supporting Out-of-Order Invocation

    Authors: Gerasimos Gerogiannis, Stijn Eyerman, Evangelos Georganas, Wim Heirman, Josep Torrellas

    Abstract: To alleviate the memory bandwidth bottleneck in Large Language Model (LLM) inference workloads, weight matrices are stored in memory in quantized and sparsified formats. Hence, before tiles of these matrices can be processed by in-core generalized matrix multiplication (GeMM) hardware engines, they need to be dequantized and de-sparsified. This is currently performed in software with vector operat… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  2. arXiv:2503.13565  [pdf, other

    cs.CL cs.AI cs.LG

    ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

    Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke

    Abstract: Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance rat… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  3. arXiv:2404.15204  [pdf, other

    cs.PL cs.AI cs.AR cs.DC cs.LG

    Towards a high-performance AI compiler with upstream MLIR

    Authors: Renato Golin, Lorenzo Chelini, Adam Siemieniuk, Kavitha Madhu, Niranjan Hasabnis, Hans Pabst, Evangelos Georganas, Alexander Heinecke

    Abstract: This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization,… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 13 pages, 8 figures, presented at CGO C4ML 2024 & MLIR Workshop EuroLLVM 2024

  4. arXiv:2304.12576  [pdf, other

    cs.DC cs.AI

    Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

    Authors: Evangelos Georganas, Dhiraj Kalamkar, Kirill Voronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, Alexander Heinecke

    Abstract: During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms… ▽ More

    Submitted 15 March, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

  5. arXiv:2204.10943  [pdf, other

    cs.DC cs.AI

    FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

    Authors: Rui Ma, Evangelos Georganas, Alexander Heinecke, Andrew Boutros, Eriko Nurvitadhi

    Abstract: Rapid advances in artificial intelligence (AI) technology have led to significant accuracy improvements in a myriad of application domains at the cost of larger and more compute-intensive models. Training such models on massive amounts of data typically requires scaling to many compute nodes and relies heavily on collective communication algorithms, such as all-reduce, to exchange the weight gradi… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: 5 pages, 4 figures

  6. arXiv:2104.08002  [pdf, other

    cs.LG cs.AI cs.DC

    Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

    Authors: Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, Bharat Kaul

    Abstract: Convolutional neural networks (CNNs) have found many applications in tasks involving two-dimensional (2D) data, such as image classification and image processing. Therefore, 2D convolution layers have been heavily optimized on CPUs and GPUs. However, in many applications - for example genomics and speech recognition, the data can be one-dimensional (1D). Such applications can benefit from optimize… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

  7. arXiv:2104.06700  [pdf, other

    cs.LG cs.DC

    DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

    Authors: Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K. Ahmed, Sasikanth Avancha

    Abstract: Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep G… ▽ More

    Submitted 16 April, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

  8. Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads

    Authors: Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Deepti Aggarwal, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Narendra Chaudhary, Abhisek Kundu, Denise Kutnick, Frank Laub, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst, Brian Retford, Barukh Ziv, Alexander Heinecke

    Abstract: During the past decade, novel Deep Learning (DL) algorithms, workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload and hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, r… ▽ More

    Submitted 30 November, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

  9. arXiv:2005.04680  [pdf, other

    cs.DC cs.IR cs.LG cs.PF

    Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

    Authors: Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, Alexander Heinecke

    Abstract: During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  10. The Parallelism Motifs of Genomic Data Analysis

    Authors: Katherine Yelick, Aydin Buluc, Muaaz Awan, Ariful Azad, Benjamin Brock, Rob Egan, Saliya Ekanayake, Marquita Ellis, Evangelos Georganas, Giulia Guidi, Steven Hofmeyr, Oguz Selvitopi, Cristina Teodoropol, Leonid Oliker

    Abstract: Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from… ▽ More

    Submitted 20 January, 2020; originally announced January 2020.

  11. arXiv:1906.06440  [pdf, other

    cs.LG cs.DC stat.ML

    High-Performance Deep Learning via a Single Building Block

    Authors: Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke

    Abstract: Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the… ▽ More

    Submitted 17 June, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

  12. arXiv:1905.12322  [pdf, other

    cs.LG stat.ML

    A Study of BFLOAT16 for Deep Learning Training

    Authors: Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, Pradeep Dubey

    Abstract: This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can repr… ▽ More

    Submitted 13 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

  13. arXiv:1810.09958  [pdf, other

    cs.DC cs.LG

    ISA Mapper: A Compute and Hardware Agnostic Deep Learning Compiler

    Authors: Matthew Sotoudeh, Anand Venkat, Michael Anderson, Evangelos Georganas, Alexander Heinecke, Jason Knight

    Abstract: Domain specific accelerators present new challenges and opportunities for code generation onto novel instruction sets, communication fabrics, and memory architectures. In this paper we introduce an intermediate representation (IR) which enables both deep learning computational kernels and hardware capabilities to be described in the same IR. We then formulate and apply instruction mapping to det… ▽ More

    Submitted 12 October, 2018; originally announced October 2018.

  14. arXiv:1809.07014  [pdf, other

    cs.DC q-bio.GN

    Extreme Scale De Novo Metagenome Assembly

    Authors: Evangelos Georganas, Rob Egan, Steven Hofmeyr, Eugene Goltsman, Bill Arndt, Andrew Tritt, Aydin Buluc, Leonid Oliker, Katherine Yelick

    Abstract: Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipM… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

    Comments: Accepted to SC18

  15. arXiv:1808.05567  [pdf, other

    cs.DC

    Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

    Authors: Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, Alexander Heinecke

    Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication form… ▽ More

    Submitted 20 August, 2018; v1 submitted 16 August, 2018; originally announced August 2018.

    Comments: Accepted to SC18

  16. arXiv:1802.00930  [pdf, other

    cs.NE cs.LG math.NA

    Mixed Precision Training of Convolutional Neural Networks using Integer Operations

    Authors: Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov

    Abstract: The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only Ale… ▽ More

    Submitted 23 February, 2018; v1 submitted 3 February, 2018; originally announced February 2018.

    Comments: Published as a conference paper at ICLR 2018

  17. arXiv:1705.11147  [pdf, other

    cs.DC

    Extreme-Scale De Novo Genome Assembly

    Authors: Evangelos Georganas, Steven Hofmeyr, Rob Egan, Aydin Buluc, Leonid Oliker, Daniel Rokhsar, Katherine Yelick

    Abstract: De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components, ea… ▽ More

    Submitted 31 May, 2017; originally announced May 2017.

    Comments: To appear as a chapter in Exascale Scientific Applications: Programming Approaches for Scalability, Performance, and Portability, Straatsma, Antypas, Williams (editors), CRC Press, 2017

  18. arXiv:1402.1285  [pdf, other

    cs.DC cs.MS cs.PF

    Constructing Performance Models for Dense Linear Algebra Algorithms on Cray XE Systems

    Authors: Jorge González-Domínguez, Evangelos Georganas, Yili Zheng, María J. Martín

    Abstract: Hiding or minimizing the communication cost is key in order to obtain good performance on large-scale systems. While communication overlapping attempts to hide communications cost, 2.5D communication avoiding algorithms improve performance scalability by reducing the volume of data transfers at the cost of extra memory usage. Both approaches can be used together or separately and the best choice d… ▽ More

    Submitted 6 February, 2014; originally announced February 2014.