Skip to main content

Showing 1–50 of 55 results for author: Ben-Nun, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.11467  [pdf, ps, other

    cs.AI cs.SE

    Modeling Code: Is Text All You Need?

    Authors: Daniel Nichols, Konstantinos Parasyris, Harshitha Menon, Brian R. Bartoldson, Giorgis Georgakoudis, Tal Ben-Nun, Abhinav Bhatele

    Abstract: Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and gr… ▽ More

    Submitted 15 July, 2025; originally announced July 2025.

  2. arXiv:2505.08135  [pdf, ps, other

    cs.SE cs.AI cs.DC cs.PF

    Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions

    Authors: Keita Teranishi, Harshitha Menon, William F. Godoy, Prasanna Balaprakash, David Bau, Tal Ben-Nun, Abhinav Bhatele, Franz Franchetti, Michael Franusich, Todd Gamblin, Giorgis Georgakoudis, Tom Goldstein, Arjun Guha, Steven Hahn, Costin Iancu, Zheming Jin, Terry Jones, Tze Meng Low, Het Mankad, Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Daniel Nichols, Konstantinos Parasyris, Swaroop Pophale, Pedro Valero-Lara , et al. (3 additional authors not shown)

    Abstract: We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with lever… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 12 pages, 1 Figure, Accepted at "The 1st International Workshop on Foundational Large Language Models Advances for HPC" LLM4HPC to be held in conjunction with ISC High Performance 2025

  3. A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows

    Authors: Jens Domke, Mohamed Wahib, Anshu Dubey, Tal Ben-Nun, Erik W. Draeger

    Abstract: Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together and adapt to novel hardware and vendor libraries without starting from scratch every few years. The current ecosystem and sparsely-connected community are not suf… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: article is still in press; DOI was already assgined by publisher; publication will appear in Computing in Science & Engineering (CiSE) https://www.computer.org/csdl/magazine/cs

  4. arXiv:2505.01912  [pdf, other

    cs.LG cond-mat.mtrl-sci cs.AI

    BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

    Authors: Evan R. Antoniuk, Shehtab Zaman, Tal Ben-Nun, Peggy Li, James Diffenderfer, Busra Demirci, Obadiah Smolenski, Tim Hsu, Anna M. Hiszpanski, Kenneth Chiu, Bhavya Kailkhura, Brian Van Essen

    Abstract: Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OO… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  5. arXiv:2503.18929  [pdf, other

    cs.LG

    Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

    Authors: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura

    Abstract: Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buff… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  6. arXiv:2411.16462  [pdf, ps, other

    cs.LG cs.DC

    Lion Cub: Minimizing Communication Overhead in Distributed Lion

    Authors: Satoki Ishikawa, Tal Ben-Nun, Brian Van Essen, Rio Yokota, Nikoli Dryden

    Abstract: Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors are the output of a sign operation, enabling stra… ▽ More

    Submitted 4 July, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: ICML 2025 Workshop "Tiny Titans: The Next Wave of On-Device Learning for Foundational Models (TTODLer-FM)"

  7. arXiv:2408.05636  [pdf, other

    cs.CL cs.LG

    Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

    Authors: Jacob K Christopher, Brian R Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, Ferdinando Fioretto

    Abstract: Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this… ▽ More

    Submitted 10 February, 2025; v1 submitted 10 August, 2024; originally announced August 2024.

    Comments: Published at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025)

  8. Low-Depth Spatial Tree Algorithms

    Authors: Yves Baumann, Tal Ben-Nun, Maciej Besta, Lukas Gianinazzi, Torsten Hoefler, Piotr Luczynski

    Abstract: Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weig… ▽ More

    Submitted 8 August, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

    ACM Class: F.2.2

    Journal ref: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024, San Francisco, CA, USA, May 27-31 (2024) 180-192

  9. Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

    Authors: Lukas Gianinazzi, Alexandros Nikolaos Ziogas, Langwen Huang, Piotr Luczynski, Saleh Ashkboos, Florian Scheidl, Armon Carigiet, Chio Ge, Nabil Abubaker, Maciej Besta, Tal Ben-Nun, Torsten Hoefler

    Abstract: We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit th… ▽ More

    Submitted 20 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    ACM Class: F.2.1

    Journal ref: PPoPP'24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (2024) 404-416

  10. arXiv:2310.02065  [pdf, other

    cs.DC cs.LG

    VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

    Authors: Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler

    Abstract: The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized spars… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: Accepted by 2023 International Conference on High Performance Computing, Networking, Storage and Analysis, 2023 (SC'23)

  11. arXiv:2309.15432  [pdf, other

    cs.PL

    ComPile: A Large IR Dataset from Production Sources

    Authors: Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William Moses, Jose M Monsalve Diaz, Mircea Trofin, Johannes Doerfert

    Abstract: Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, t… ▽ More

    Submitted 30 April, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

  12. arXiv:2308.12093  [pdf, other

    cs.LG cs.PF

    Cached Operator Reordering: A Unified View for Fast GNN Training

    Authors: Julia Bazinska, Andrei Ivanov, Tal Ben-Nun, Nikoli Dryden, Maciej Besta, Siyuan Shen, Torsten Hoefler

    Abstract: Graph Neural Networks (GNNs) are a powerful tool for handling structured graph data and addressing tasks such as node classification, graph classification, and clustering. However, the sparse nature of GNN computation poses new challenges for performance optimization compared to traditional deep neural networks. We address these challenges by providing a unified view of GNN computation, I/O, and m… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  13. Maximum Flows in Parametric Graph Templates

    Authors: Tal Ben-Nun, Lukas Gianinazzi, Torsten Hoefler, Yishai Oltchik

    Abstract: Execution graphs of parallel loop programs exhibit a nested, repeating structure. We show how such graphs that are the result of nested repetition can be represented by succinct parametric structures. This parametric graph template representation allows us to reason about the execution graph of a parallel program at a cost that only depends on the program size. We develop structurally-parametric p… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2011.07001

    ACM Class: F.2.2

  14. arXiv:2306.16178  [pdf, other

    cs.SE

    FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs

    Authors: Philipp Schaad, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Alexandros Nikolaos Ziogas, Torsten Hoefler

    Abstract: The current hardware landscape and application scale is driving performance engineers towards writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are oft… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  15. Bridging Control-Centric and Data-Centric Optimization

    Authors: Tal Ben-Nun, Berke Ates, Alexandru Calotoiu, Torsten Hoefler

    Abstract: With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset - instruction-driven optimization augmented with scalar-based dataflow - whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: CGO'23

  16. arXiv:2304.07613  [pdf, other

    cs.LG

    STen: Productive and Efficient Sparsity in PyTorch

    Authors: Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Saleh Ashkboos, Torsten Hoefler

    Abstract: As deep learning models grow, sparsity is becoming an increasingly critical component of deep neural networks, enabling improved performance and reduced storage. However, existing frameworks offer poor support for sparsity. Specialized sparsity engines focus exclusively on sparse inference, while general frameworks primarily focus on sparse tensors in classical formats and neglect the broader spar… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

  17. arXiv:2303.08142  [pdf, other

    cs.SE cs.DC cs.LG

    Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

    Authors: Lukas Trümper, Tal Ben-Nun, Philipp Schaad, Alexandru Calotoiu, Torsten Hoefler

    Abstract: Performance optimization is an increasingly challenging but often repetitive task. While each platform has its quirks, the underlying code transformations rely on data movement and computational characteristics that recur across applications. This paper proposes to leverage those similarities by constructing an embedding space for subprograms. The continuous space captures both static and dynamic… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  18. arXiv:2301.01048  [pdf, other

    cs.DC cs.LG

    A Theory of I/O-Efficient Sparse Neural Network Inference

    Authors: Niels Gleinig, Tal Ben-Nun, Torsten Hoefler

    Abstract: As the accuracy of machine learning models increases at a fast rate, so does their demand for energy and compute resources. On a low level, the major part of these resources is consumed by data movement between different memory units. Modern hardware architectures contain a form of fast memory (e.g., cache, registers), which is small, and a slow memory (e.g., DRAM), which is larger but expensive t… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  19. arXiv:2212.13768  [pdf, other

    cs.DC cs.PL

    Python FPGA Programming with Data-Centric Multi-Level Design

    Authors: Johannes de Fine Licht, Tiziano De Matteis, Tal Ben-Nun, Andreas Kuster, Oliver Rausch, Manuel Burger, Carl-Johannes Johnsen, Torsten Hoefler

    Abstract: Although high-level synthesis (HLS) tools have significantly improved programmer productivity over hardware description languages, developing for FPGAs remains tedious and error prone. Programmers must learn and implement a large set of vendor-specific syntax, patterns, and tricks to optimize (or even successfully compile) their applications, while dealing with ever-changing toolflows from the FPG… ▽ More

    Submitted 28 December, 2022; originally announced December 2022.

  20. arXiv:2210.04598  [pdf, other

    cs.DC cs.PF

    Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

    Authors: Carl-Johannes Johnsen, Tiziano De Matteis, Tal Ben-Nun, Johannes de Fine Licht, Torsten Hoefler

    Abstract: The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple c… ▽ More

    Submitted 19 September, 2022; originally announced October 2022.

  21. Boosting Performance Optimization with Interactive Data Movement Visualization

    Authors: Philipp Schaad, Tal Ben-Nun, Torsten Hoefler

    Abstract: Optimizing application performance in today's hardware architecture landscape is an important, but increasingly complex task, often requiring detailed performance analyses. In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection. Performance visualizations can assist in the diagnosis of performance problems, but g… ▽ More

    Submitted 24 August, 2022; v1 submitted 21 June, 2022; originally announced July 2022.

  22. arXiv:2206.14786  [pdf, other

    cs.LG physics.ao-ph

    ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts

    Authors: Saleh Ashkboos, Langwen Huang, Nikoli Dryden, Tal Ben-Nun, Peter Dueben, Lukas Gianinazzi, Luca Kummer, Torsten Hoefler

    Abstract: Post-processing ensemble prediction systems can improve the reliability of weather forecasting, especially for extreme event prediction. In recent years, different machine learning models have been developed to improve the quality of weather post-processing. However, these models require a comprehensive dataset of weather simulations to produce high-accuracy results, which comes at a high computat… ▽ More

    Submitted 7 November, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted version of the paper

  23. arXiv:2206.08301  [pdf, other

    cs.DC

    Deinsum: Practically I/O Optimal Multilinear Algebra

    Authors: Alexandros Nikolaos Ziogas, Grzegorz Kwasniewski, Tal Ben-Nun, Timo Schneider, Torsten Hoefler

    Abstract: Multilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem. State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated f… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  24. arXiv:2205.04934  [pdf, other

    cs.DS cs.DC

    The spatial computer: A model for energy-efficient parallel computation

    Authors: Lukas Gianinazzi, Tal Ben-Nun, Maciej Besta, Saleh Ashkboos, Yves Baumann, Piotr Luczynski, Torsten Hoefler

    Abstract: We present a new parallel model of computation suitable for spatial architectures, for which the energy used for communication heavily depends on the distance of the communicating processors. In our model, processors have locations on a conceptual two-dimensional grid, and their distance therein determines their communication cost. In particular, we introduce the energy cost of a spatial computati… ▽ More

    Submitted 17 January, 2023; v1 submitted 10 May, 2022; originally announced May 2022.

    ACM Class: F.2.0

  25. arXiv:2205.04148  [pdf, other

    cs.DC

    Productive Performance Engineering for Weather and Climate Modeling with Python

    Authors: Tal Ben-Nun, Linus Groner, Florian Deconinck, Tobias Wicky, Eddie Davis, Johann Dahm, Oliver D. Elbert, Rhea George, Jeremy McGibbon, Lukas Trümper, Elynn Wu, Oliver Fuhrer, Thomas Schulthess, Torsten Hoefler

    Abstract: Earth system models are developed with a tight coupling to target hardware, often containing specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout. We present a detailed account of optimizing the Finite Volume Cubed-Sphere Dynamical Core (FV3), improving productivity and performance. By using a… ▽ More

    Submitted 25 August, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

  26. arXiv:2112.11879  [pdf, other

    cs.PL cs.DC cs.PF

    Lifting C Semantics for Dataflow Optimization

    Authors: Alexandru Calotoiu, Tal Ben-Nun, Grzegorz Kwasniewski, Johannes de Fine Licht, Timo Schneider, Philipp Schaad, Torsten Hoefler

    Abstract: C is the lingua franca of programming and almost any device can be programmed using C. However, programming mod-ern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly expressing parallelism as well as device-specific properties such as memory hierarchies. The resulting code is often hard to understand, debug, and modify for different architectures. We propose to lift… ▽ More

    Submitted 24 May, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

  27. arXiv:2110.10802  [pdf, other

    cs.LG cs.DC cs.PF

    A Data-Centric Optimization Framework for Machine Learning

    Authors: Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler

    Abstract: Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeli… ▽ More

    Submitted 29 August, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

    Comments: 13 pages, 12 figures, published at Proceedings of the ACM International Conference on Supercomputing (ICS'22)

  28. arXiv:2108.09337  [pdf, other

    cs.DC cs.CC cs.PF

    On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

    Authors: Grzegorz Kwasniewski, Marko Kabić, Tal Ben-Nun, Alexandros Nikolaos Ziogas, Jens Eirik Saethre, André Gaillard, Timo Schneider, Maciej Besta, Anton Kozhevnikov, Joost VandeVondele, Torsten Hoefler

    Abstract: Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving p… ▽ More

    Submitted 25 April, 2023; v1 submitted 20 August, 2021; originally announced August 2021.

    Comments: 15 pages (including references), 11 figures. arXiv admin note: substantial text overlap with arXiv:2010.05975

    Journal ref: Published at Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November, 2021(SC'21)

  29. arXiv:2107.00555  [pdf, other

    cs.PL cs.DC cs.PF

    Productivity, Portability, Performance: Data-Centric Python

    Authors: Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, Torsten Hoefler

    Abstract: Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we presen… ▽ More

    Submitted 23 August, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

  30. arXiv:2106.03594  [pdf, other

    cs.LG

    Learning Combinatorial Node Labeling Algorithms

    Authors: Lukas Gianinazzi, Maximilian Fries, Nikoli Dryden, Tal Ben-Nun, Maciej Besta, Torsten Hoefler

    Abstract: We present a novel neural architecture to solve graph optimization problems where the solution consists of arbitrary node labels, allowing us to solve hard problems like graph coloring. We train our model using reinforcement learning, specifically policy gradients, which gives us both a greedy and a probabilistic policy. Our architecture builds on a graph attention network and uses several inducti… ▽ More

    Submitted 10 May, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

    ACM Class: I.2.2; I.2.8

  31. Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

    Authors: Grzegorz Kwasniewski, Tal Ben-Nun, Lukas Gianinazzi, Alexandru Calotoiu, Timo Schneider, Alexandros Nikolaos Ziogas, Maciej Besta, Torsten Hoefler

    Abstract: Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O low… ▽ More

    Submitted 15 May, 2021; originally announced May 2021.

    Comments: 13 pages, 4 figures, published at Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'21)

  32. arXiv:2102.00554  [pdf, other

    cs.LG cs.AI cs.AR cs.CV cs.NE

    Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

    Authors: Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste

    Abstract: The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten traini… ▽ More

    Submitted 31 January, 2021; originally announced February 2021.

    Comments: 90 pages, 26 figures

  33. arXiv:2101.08734  [pdf, other

    cs.DC cs.LG

    Clairvoyant Prefetching for Distributed Machine Learning I/O

    Authors: Nikoli Dryden, Roman Böhringer, Tal Ben-Nun, Torsten Hoefler

    Abstract: I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote n… ▽ More

    Submitted 10 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

    Comments: 13 pages, 16 figures; major revisions

  34. arXiv:2012.01470  [pdf, other

    cs.PL cs.LG

    Deep Data Flow Analysis

    Authors: Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle

    Abstract: Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow anal… ▽ More

    Submitted 20 November, 2020; originally announced December 2020.

    Comments: 9 pages, plus appendices. arXiv admin note: text overlap with arXiv:2003.10536

  35. arXiv:2011.07001  [pdf, other

    cs.DS

    Parametric Graph Templates: Properties and Algorithms

    Authors: Tal Ben-Nun, Lukas Gianinazzi, Torsten Hoefler, Yishai Oltchik

    Abstract: Hierarchical structure and repetition are prevalent in graphs originating from nature or engineering. These patterns can be represented by a class of parametric-structure graphs, which are defined by templates that generate structure by way of repeated instantiation. We propose a class of parametric graph templates that can succinctly represent a wide variety of graphs. Using parametric graph temp… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    MSC Class: 68W05 ACM Class: G.2.2

  36. arXiv:2010.15218  [pdf, other

    cs.DC

    StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

    Authors: Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, Torsten Hoefler

    Abstract: Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterat… ▽ More

    Submitted 11 January, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

  37. arXiv:2010.14684  [pdf, other

    cs.DC cs.AR cs.DS

    Substream-Centric Maximum Matchings on FPGA

    Authors: Maciej Besta, Marc Fischer, Tal Ben-Nun, Dimitri Stanojevic, Johannes De Fine Licht, Torsten Hoefler

    Abstract: Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we for… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: Best Paper finalist at ACM FPGA'19, invited to special issue of ACM TRETS'20

    Journal ref: Proceedings of the ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2020. Proceedings of the 27th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2019

  38. arXiv:2010.05975  [pdf, other

    cs.DC

    On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

    Authors: Grzegorz Kwasniewski, Tal Ben-Nun, Alexandros Nikolaos Ziogas, Timo Schneider, Maciej Besta, Torsten Hoefler

    Abstract: Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorizat… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: 13 pages without references, 12 figures, submitted to PPoPP 2021: 26th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

  39. arXiv:2007.00072  [pdf, other

    cs.LG stat.ML

    Data Movement Is All You Need: A Case Study on Optimizing Transformers

    Authors: Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler

    Abstract: Transformers are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvement… ▽ More

    Submitted 8 November, 2021; v1 submitted 30 June, 2020; originally announced July 2020.

    Comments: 22 pages, 8 figures; MLSys 2021 camera ready

  40. arXiv:2005.08748  [pdf, other

    cs.LG eess.SP physics.ao-ph stat.ML

    Deep Learning for Post-Processing Ensemble Weather Forecasts

    Authors: Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler

    Abstract: Quantifying uncertainty in weather forecasts is critical, especially for predicting extreme weather events. This is typically accomplished with ensemble prediction systems, which consist of many perturbed numerical weather simulations, or trajectories, run in parallel. These systems are associated with a high computational cost and often involve statistical post-processing steps to inexpensively i… ▽ More

    Submitted 21 September, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

  41. Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Authors: Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler

    Abstract: Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating… ▽ More

    Submitted 21 August, 2025; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 2021, DOI: https://doi.org/10.1109/TPDS.2020.3040606

    ACM Class: C.1.4; D.1.3; I.2

    Journal ref: IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1725-1739, 1 July 2021

  42. arXiv:2003.10536  [pdf, other

    cs.LG cs.PF cs.PL stat.ML

    ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

    Authors: Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather

    Abstract: The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in… ▽ More

    Submitted 23 March, 2020; originally announced March 2020.

    Comments: 20 pages, author preprint

  43. A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations

    Authors: Alexandros Nikolaos Ziogas, Tal Ben-Nun, Guillermo Indalecio Fernández, Timo Schneider, Mathieu Luisier, Torsten Hoefler

    Abstract: The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electro-thermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarse-and fine-grained data-movement characteristics that can be used for per… ▽ More

    Submitted 18 December, 2019; originally announced December 2019.

    Comments: 13 pages, 13 figures, SC19

  44. Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming

    Authors: Alexandros Nikolaos Ziogas, Tal Ben-Nun, Guillermo Indalecio Fernández, Timo Schneider, Mathieu Luisier, Torsten Hoefler

    Abstract: Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated FinFETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centri… ▽ More

    Submitted 18 December, 2019; originally announced December 2019.

    Comments: 12 pages, 18 figures, SC19

  45. arXiv:1911.00630  [pdf, other

    cs.LG physics.ao-ph stat.ML

    Predicting Weather Uncertainty with Deep Convnets

    Authors: Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler

    Abstract: Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations. To provide accurate estimation, dozens of such computationally intensive simulations must be run. We show that deep neural networks can be used on a small set of numerical weather simulations to estimate the spread of a… ▽ More

    Submitted 4 December, 2019; v1 submitted 1 November, 2019; originally announced November 2019.

    Comments: Poster presentation at NeurIPS2019 "Machine Learning and the Physical Sciences" Workshop

    MSC Class: I.2.10; I.2.1 ACM Class: I.2.10; I.2.1

  46. arXiv:1908.08986  [pdf, other

    cs.CV cs.LG stat.ML

    Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

    Authors: Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry

    Abstract: Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of aspecific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that… ▽ More

    Submitted 12 August, 2019; originally announced August 2019.

  47. Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Authors: Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

    Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, w… ▽ More

    Submitted 21 August, 2025; v1 submitted 12 August, 2019; originally announced August 2019.

    Comments: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 2020, Best Paper Nomination

  48. arXiv:1903.06697  [pdf, other

    cs.DC cs.AR

    Graph Processing on FPGAs: Taxonomy, Survey, Challenges

    Authors: Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, Torsten Hoefler

    Abstract: Graph processing has become an important part of various areas, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Various graphs, for example web or social networks, may contain up to trillions of edges. The sheer size of such datasets, combined with the irregular nature of graph processing, poses unique challenges for the runtime and… ▽ More

    Submitted 27 April, 2019; v1 submitted 24 February, 2019; originally announced March 2019.

  49. arXiv:1902.10345  [pdf, other

    cs.PL cs.DC cs.PF

    Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures

    Authors: Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler

    Abstract: The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate repr… ▽ More

    Submitted 2 January, 2020; v1 submitted 27 February, 2019; originally announced February 2019.

  50. arXiv:1901.10183  [pdf, other

    cs.DC cs.LG cs.PF

    A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

    Authors: Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, Torsten Hoefler

    Abstract: We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Dee… ▽ More

    Submitted 13 June, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: Accepted to IPDPS 2019