-
Comprehensive Performance Modeling and System Design Insights for Foundation Models
Authors:
Shashank Subramanian,
Ermal Rrapaj,
Peter Harrington,
Smeet Chheda,
Steven Farrell,
Brian Austin,
Samuel Williams,
Nicholas Wright,
Wahid Bhimji
Abstract:
Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex de…
▽ More
Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex design space and highlight its key components. We find that different transformer types demand different parallelism and system characteristics at different training regimes. Large Language Models are performant with 3D parallelism and amplify network needs only at pre-training scales with reduced dependence on accelerator capacity and bandwidth. On the other hand, long-sequence transformers, representative of scientific foundation models, place a more uniform dependence on network and capacity with necessary 4D parallelism. Our analysis emphasizes the need for closer performance modeling of different transformer types keeping system features in mind and demonstrates a path towards this. Our code is available as open-source.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.
-
Ookami: An A64FX Computing Resource
Authors:
A. C. Calder,
E. Siegmann,
C. Feldman,
S. Chheda,
D. C. Smolarski,
F. D. Swesty,
A. Curtis,
J. Dey,
D. Carlson,
B. Michalowicz,
R. J. Harrison
Abstract:
We present a look at Ookami, a project providing community access to a testbed supercomputer with the ARM-based A64FX processors developed by a collaboration between RIKEN and Fujitsu and deployed in the Japanese supercomputer Fugaku. We describe the project, provide details about the user base and education/training program, and present highlights from performance studies of two astrophysical sim…
▽ More
We present a look at Ookami, a project providing community access to a testbed supercomputer with the ARM-based A64FX processors developed by a collaboration between RIKEN and Fujitsu and deployed in the Japanese supercomputer Fugaku. We describe the project, provide details about the user base and education/training program, and present highlights from performance studies of two astrophysical simulation codes.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
A Further Study of Linux Kernel Hugepages on A64FX with FLASH, an Astrophysical Simulation Code
Authors:
Catherine Feldman,
Smeet Chheda,
Alan C. Calder,
Eva Siegmann,
John Dey,
Tony Curtis,
Robert J. Harrison
Abstract:
We present an expanded study of the performance of FLASH when using Linux Kernel Hugepages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is a multi-scale, multi-physics simulation code written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. Our initial study used only the Fujitsu compiler to utilize standard hugepages (hp), but fu…
▽ More
We present an expanded study of the performance of FLASH when using Linux Kernel Hugepages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is a multi-scale, multi-physics simulation code written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. Our initial study used only the Fujitsu compiler to utilize standard hugepages (hp), but further investigation allowed us to utilize hp for multiple compilers by linking to the Fujitsu library libmpg and transparent hugepages (thp) by enabling it at the node level. By comparing the results of hardware counters and in-code timers, we found that hp and thp do not significantly impact the runtime performance of FLASH. Interestingly, there is a significant reduction in the TLB misses, differences in cache and memory access counters, and strange behavior is observed when using thp.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
On Using Linux Kernel Huge Pages with FLASH, an Astrophysical Simulation Code
Authors:
Alan C. Calder,
Catherine Feldman,
Eva Siegmann,
John Dey,
Anthony Curtis,
Smeet Chheda,
Robert J. Harrison
Abstract:
We present efforts at improving the performance of FLASH, a multi-scale, multi-physics simulation code principally for astrophysical applications, by using huge pages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. We explored options for enabling the use of huge pages with…
▽ More
We present efforts at improving the performance of FLASH, a multi-scale, multi-physics simulation code principally for astrophysical applications, by using huge pages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. We explored options for enabling the use of huge pages with several compilers, but we were only able to successfully use huge pages when compiling with the Fujitsu compiler. The use of huge pages substantially reduced the number of translation lookaside buffer misses, but overall performance gains were marginal.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.