Skip to main content

Showing 1–24 of 24 results for author: Giannoula, C

.
  1. Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

    Authors: Zhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, Gennady Pekhimenko

    Abstract: Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by EuroSys 2025

  2. arXiv:2503.06433  [pdf, other

    cs.DC cs.AI

    Seesaw: High-throughput LLM Inference via Model Re-sharding

    Authors: Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

    Abstract: To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stag… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  3. arXiv:2502.15470  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

    Authors: Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gómez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu

    Abstract: Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory… ▽ More

    Submitted 27 February, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

    Comments: To appear in ASPLOS 2025

  4. arXiv:2408.06995  [pdf, other

    cs.CV

    Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

    Authors: Cheng Chen, Christina Giannoula, Andreas Moshovos

    Abstract: Diffusion models are emerging models that generate images by iteratively denoising random Gaussian noise using deep neural networks. These models typically exhibit high computational and memory demands, necessitating effective post-training quantization for high-performance inference. Recent works propose low-bitwidth (e.g., 8-bit or 4-bit) quantization for diffusion models, however 4-bit integer… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  5. arXiv:2408.05810  [pdf, other

    cs.AR

    Evaluating the Effectiveness of Microarchitectural Hardware Fault Detection for Application-Specific Requirements

    Authors: Konstantinos-Nikolaos Papadopoulos, Christina Giannoula, Nikolaos-Charalampos Papadopoulos, Nektarios Koziris, José M. G. Merayo, Dionisios N. Pnevmatikatos

    Abstract: Reliability is necessary in safety-critical applications spanning numerous domains. Conventional hardware-based fault tolerance techniques, such as component redundancy, ensure reliability, typically at the expense of significantly increased power consumption, and almost double (or more) hardware area. To mitigate these costs, microarchitectural fault tolerance methods try to lower overheads by le… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  6. arXiv:2407.00867  other

    cs.DC

    Proceedings of 3rd Workshop on Heterogeneous Composable and Disaggregated Systems

    Authors: Christian Pinto, Dong Li, Thaleia Dimitra Doudali, Christina Giannoula, Jie Ren

    Abstract: The future of computing systems is inevitably embracing a disaggregated and composable pattern: from clusters of computers to pools of resources that can be dynamically combined together and tailored around applications requirements. Transitioning to this new paradigm requires ground-breaking research, ranging from new hardware architectures up to new models and abstractions at all levels of the s… ▽ More

    Submitted 22 April, 2024; originally announced July 2024.

    Comments: Proceedings of 3rd Workshop on Heterogeneous Composable and Disaggregated Systems

  7. arXiv:2406.06900  [pdf, other

    cs.DC

    SmartPQ: An Adaptive Concurrent Priority Queue for NUMA Architectures

    Authors: Christina Giannoula, Foteini Strati, Dimitrios Siakavaras, Georgios Goumas, Nectarios Koziris

    Abstract: Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several NUMA-oblivious implementations can scale up to a high number of threads, exploiting the potential parallelism of insert operation, NUMA-oblivious implementation… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  8. arXiv:2404.12512  [pdf, other

    cs.CR cs.LG

    Proteus: Preserving Model Confidentiality during Graph Optimizations

    Authors: Yubo Gao, Maryam Haghifam, Christina Giannoula, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar

    Abstract: Deep learning (DL) models have revolutionized numerous domains, yet optimizing them for computational efficiency remains a challenging endeavor. Development of new DL models typically involves two parties: the model developers and performance optimizers. The collaboration between the parties often necessitates the model developers exposing the model architecture and computational graph to the opti… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  9. arXiv:2402.16731  [pdf, other

    cs.AR cs.DC cs.LG cs.PF

    PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures

    Authors: Christina Giannoula, Peiming Yang, Ivan Fernandez, Jiacheng Yang, Sankeerth Durvasula, Yu Xin Li, Mohammad Sadrosadati, Juan Gomez Luna, Onur Mutlu, Gennady Pekhimenko

    Abstract: Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple p… ▽ More

    Submitted 6 April, 2025; v1 submitted 26 February, 2024; originally announced February 2024.

  10. arXiv:2401.06145  [pdf, other

    cs.DC cs.CV cs.LG cs.PF

    Minuet: Accelerating 3D Sparse Convolutions on GPUs

    Authors: Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James Gleeson, Gennady Pekhimenko

    Abstract: Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be e… ▽ More

    Submitted 1 December, 2023; originally announced January 2024.

  11. arXiv:2310.18813  [pdf, other

    cs.LG cs.DC

    The Synergy of Speculative Decoding and Batching in Serving Large Language Models

    Authors: Qidong Su, Christina Giannoula, Gennady Pekhimenko

    Abstract: Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To stu… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  12. arXiv:2301.09674  [pdf, other

    cs.AR cs.DC cs.PF

    Architectural Support for Efficient Data Movement in Disaggregated Systems

    Authors: Christina Giannoula, Kailong Huang, Jonathan Tang, Nectarios Koziris, Georgios Goumas, Zeshan Chishti, Nandita Vijaykumar

    Abstract: Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high p… ▽ More

    Submitted 23 January, 2023; originally announced January 2023.

    Comments: To appear in the Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2023 and the ACM SIGMETRICS 2023 conference. arXiv admin note: text overlap with arXiv:2301.00414

  13. arXiv:2301.00414  [pdf, other

    cs.AR cs.DC cs.PF

    DaeMon: Architectural Support for Efficient Data Movement in Disaggregated Systems

    Authors: Christina Giannoula, Kailong Huang, Jonathan Tang, Nectarios Koziris, Georgios Goumas, Zeshan Chishti, Nandita Vijaykumar

    Abstract: Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high p… ▽ More

    Submitted 18 January, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

    Comments: To appear in the Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2023 and the ACM SIGMETRICS 2023 conference

  14. arXiv:2211.05908  [pdf, other

    cs.AR cs.DC cs.DS cs.PF cs.SE

    Accelerating Irregular Applications via Efficient Synchronization and Data Access Techniques

    Authors: Christina Giannoula

    Abstract: Irregular applications comprise an increasingly important workload domain for many fields, including bioinformatics, chemistry, physics, social sciences and machine learning. Therefore, achieving high performance and energy efficiency in the execution of emerging irregular applications is of vital importance. This dissertation studies the root causes of inefficiency of irregular applications in mo… ▽ More

    Submitted 14 November, 2022; v1 submitted 10 November, 2022; originally announced November 2022.

    Comments: PhD Thesis

  15. Accelerating Time Series Analysis via Processing using Non-Volatile Memories

    Authors: Ivan Fernandez, Christina Giannoula, Aditya Manglik, Ricardo Quislant, Nika Mansouri Ghiasi, Juan Gómez-Luna, Eladio Gutierrez, Oscar Plata, Onur Mutlu

    Abstract: Time Series Analysis (TSA) is a critical workload to extract valuable information from collections of sequential data, e.g., detecting anomalies in electrocardiograms. Subsequence Dynamic Time Warping (sDTW) is the state-of-the-art algorithm for high-accuracy TSA. We find that the performance and energy efficiency of sDTW on conventional CPU and GPU platforms are heavily burdened by the latency an… ▽ More

    Submitted 12 July, 2024; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Published in IEEE Access, 2024, volume 12

    Journal ref: IEEE Access, vol. 12, pp. 36727-36742, 2024

  16. arXiv:2210.08508  [pdf, ps, other

    cs.AR cs.DC

    RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory

    Authors: Nika Mansouri Ghiasi, Mohammad Sadrosadati, Geraldo F. Oliveira, Konstantinos Kanellopoulos, Rachata Ausavarungnirun, Juan Gómez Luna, João Ferreira, Jeremie S. Kim, Christina Giannoula, Nandita Vijaykumar, Jisung Park, Onur Mutlu

    Abstract: Recent nano-technological advances enable the Monolithic 3D (M3D) integration of multiple memory and logic layers in a single chip, allowing for fine-grained connections between layers and significantly alleviating main memory bottlenecks. We show for a variety of workloads, on a state-of-the-art M3D-based system, that the performance and energy bottlenecks shift from main memory to the processor… ▽ More

    Submitted 8 June, 2025; v1 submitted 16 October, 2022; originally announced October 2022.

  17. arXiv:2206.00938  [pdf, other

    cs.AR

    Exploiting Near-Data Processing to Accelerate Time Series Analysis

    Authors: Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, Onur Mutlu

    Abstract: Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic inte… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Comments: To appear in ISVLSI 2022 Special Session on Processing in Memory. arXiv admin note: text overlap with arXiv:2010.02079

  18. arXiv:2204.00900  [pdf, ps, other

    cs.AR cs.DC cs.PF

    Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

    Authors: Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2201.05072

  19. arXiv:2201.05072  [pdf, other

    cs.AR cs.DC cs.PF

    SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

    Authors: Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me… ▽ More

    Submitted 23 May, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: To appear in the Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2022 and the ACM SIGMETRICS 2022 conference

  20. arXiv:2110.01709  [pdf, other

    cs.AR cs.DC cs.PF

    Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

    Authors: Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

    Abstract: Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads… ▽ More

    Submitted 3 April, 2023; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: Invited paper to appear at Workshop on Computing with Unconventional Technologies (CUT) 2021 https://sites.google.com/umn.edu/cut-2021/home. arXiv admin note: substantial text overlap with arXiv:2105.03814

  21. arXiv:2105.03814  [pdf, other

    cs.AR cs.DC cs.PF

    Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

    Authors: Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

    Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bo… ▽ More

    Submitted 4 May, 2022; v1 submitted 8 May, 2021; originally announced May 2021.

    Comments: Our open source software is available at https://github.com/CMU-SAFARI/prim-benchmarks

  22. arXiv:2101.07557  [pdf, other

    cs.AR cs.DC

    SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

    Authors: Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficien… ▽ More

    Submitted 13 February, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

    Comments: To appear in the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27)

  23. arXiv:2010.02079  [pdf, other

    cs.AR

    NATSA: A Near-Data Processing Accelerator for Time Series Analysis

    Authors: Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, Onur Mutlu

    Abstract: Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic inte… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: To appear in the 38th IEEE International Conference on Computer Design (ICCD 2020)

  24. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations

    Authors: Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

    Abstract: Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruc… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.