-
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
Authors:
Carolin Penke,
Chelsea Maria John,
Jan Ebert,
Stefan Kesselheim,
Andreas Herten
Abstract:
The training of large language models (LLMs) requires substantial computational resources, complex software stacks, and carefully designed workflows to achieve scalability and efficiency. This report presents best practices and insights gained from the OpenGPT-X project, a German initiative focused on developing open, multilingual LLMs optimized for European languages. We detail the use of high-pe…
▽ More
The training of large language models (LLMs) requires substantial computational resources, complex software stacks, and carefully designed workflows to achieve scalability and efficiency. This report presents best practices and insights gained from the OpenGPT-X project, a German initiative focused on developing open, multilingual LLMs optimized for European languages. We detail the use of high-performance computing (HPC) systems, primarily JUWELS Booster at JSC, for training Teuken-7B, a 7-billion-parameter transformer model. The report covers system architecture, training infrastructure, software choices, profiling and benchmarking tools, as well as engineering and operational challenges.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
Authors:
Chelsea Maria John,
Stepan Nassyr,
Carolin Penke,
Andreas Herten
Abstract:
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardw…
▽ More
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.
△ Less
Submitted 29 October, 2024; v1 submitted 19 September, 2024;
originally announced September 2024.
-
Application-Driven Exascale: The JUPITER Benchmark Suite
Authors:
Andreas Herten,
Sebastian Achilles,
Damian Alvarez,
Jayesh Badwaik,
Eric Behle,
Mathis Bode,
Thomas Breuer,
Daniel Caviedes-Voullième,
Mehdi Cherti,
Adel Dabah,
Salem El Sayed,
Wolfgang Frings,
Ana Gonzalez-Nicolas,
Eric B. Gregory,
Kaveh Haghighi Mood,
Thorsten Hater,
Jenia Jitsev,
Chelsea Maria John,
Jan H. Meinke,
Catrin I. Meyer,
Pavel Mezentsev,
Jan-Oliver Mirus,
Stepan Nassyr,
Carolin Penke,
Manoel Römmer
, et al. (6 additional authors not shown)
Abstract:
Benchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale er…
▽ More
Benchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale era, this is even more important and necessitates alignment with a vision of Open Science and reproducibility. In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open source software with this work at https://github.com/FZJ-JSC/jubench.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
Noise2Noise Denoising of CRISM Hyperspectral Data
Authors:
Robert Platt,
Rossella Arcucci,
Cédric M. John
Abstract:
Hyperspectral data acquired by the Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) have allowed for unparalleled mapping of the surface mineralogy of Mars. Due to sensor degradation over time, a significant portion of the recently acquired data is considered unusable. Here a new data-driven model architecture, Noise2Noise4Mars (N2N4M), is introduced to remove noise from CRISM images.…
▽ More
Hyperspectral data acquired by the Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) have allowed for unparalleled mapping of the surface mineralogy of Mars. Due to sensor degradation over time, a significant portion of the recently acquired data is considered unusable. Here a new data-driven model architecture, Noise2Noise4Mars (N2N4M), is introduced to remove noise from CRISM images. Our model is self-supervised and does not require zero-noise target data, making it well suited for use in Planetary Science applications where high quality labelled data is scarce. We demonstrate its strong performance on synthetic-noise data and CRISM images, and its impact on downstream classification performance, outperforming benchmark methods on most metrics. This allows for detailed analysis for critical sites of interest on the Martian surface, including proposed lander sites.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.