Search | arXiv e-print repository

Comprehensive Performance Modeling and System Design Insights for Foundation Models

Authors: Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji

Abstract: Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex de… ▽ More Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex design space and highlight its key components. We find that different transformer types demand different parallelism and system characteristics at different training regimes. Large Language Models are performant with 3D parallelism and amplify network needs only at pre-training scales with reduced dependence on accelerator capacity and bandwidth. On the other hand, long-sequence transformers, representative of scientific foundation models, place a more uniform dependence on network and capacity with necessary 4D parallelism. Our analysis emphasizes the need for closer performance modeling of different transformer types keeping system features in mind and demonstrates a path towards this. Our code is available as open-source. △ Less

Submitted 30 September, 2024; originally announced October 2024.

Comments: 17 pages, PMBS 2024

arXiv:2311.04259 [pdf, other]

Ookami: An A64FX Computing Resource

Authors: A. C. Calder, E. Siegmann, C. Feldman, S. Chheda, D. C. Smolarski, F. D. Swesty, A. Curtis, J. Dey, D. Carlson, B. Michalowicz, R. J. Harrison

Abstract: We present a look at Ookami, a project providing community access to a testbed supercomputer with the ARM-based A64FX processors developed by a collaboration between RIKEN and Fujitsu and deployed in the Japanese supercomputer Fugaku. We describe the project, provide details about the user base and education/training program, and present highlights from performance studies of two astrophysical sim… ▽ More We present a look at Ookami, a project providing community access to a testbed supercomputer with the ARM-based A64FX processors developed by a collaboration between RIKEN and Fujitsu and deployed in the Japanese supercomputer Fugaku. We describe the project, provide details about the user base and education/training program, and present highlights from performance studies of two astrophysical simulation codes. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 9 pages, 3 figures, submitted to the Proceedings of 15th International Conference on Numerical Modeling of Space Plasma Flows

arXiv:2309.04652 [pdf, other]

doi 10.1145/3569951.3597583

A Further Study of Linux Kernel Hugepages on A64FX with FLASH, an Astrophysical Simulation Code

Authors: Catherine Feldman, Smeet Chheda, Alan C. Calder, Eva Siegmann, John Dey, Tony Curtis, Robert J. Harrison

Abstract: We present an expanded study of the performance of FLASH when using Linux Kernel Hugepages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is a multi-scale, multi-physics simulation code written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. Our initial study used only the Fujitsu compiler to utilize standard hugepages (hp), but fu… ▽ More We present an expanded study of the performance of FLASH when using Linux Kernel Hugepages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is a multi-scale, multi-physics simulation code written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. Our initial study used only the Fujitsu compiler to utilize standard hugepages (hp), but further investigation allowed us to utilize hp for multiple compilers by linking to the Fujitsu library libmpg and transparent hugepages (thp) by enabling it at the node level. By comparing the results of hardware counters and in-code timers, we found that hp and thp do not significantly impact the runtime performance of FLASH. Interestingly, there is a significant reduction in the TLB misses, differences in cache and memory access counters, and strange behavior is observed when using thp. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 10 pages, 2 figures, 7 tables. Proceedings for Practice and Experience in Advanced Research Computing (PEARC '23), July 23--27, 2023, Portland, OR, USA

ACM Class: C.1.4; I.6.0; J.2

Journal ref: Practice and Experience in Advanced Research Computing (PEARC '23). Association for Computing Machinery, New York, NY, USA, 186-195. (July 2023)

arXiv:2207.13685 [pdf, ps, other]

On Using Linux Kernel Huge Pages with FLASH, an Astrophysical Simulation Code

Authors: Alan C. Calder, Catherine Feldman, Eva Siegmann, John Dey, Anthony Curtis, Smeet Chheda, Robert J. Harrison

Abstract: We present efforts at improving the performance of FLASH, a multi-scale, multi-physics simulation code principally for astrophysical applications, by using huge pages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. We explored options for enabling the use of huge pages with… ▽ More We present efforts at improving the performance of FLASH, a multi-scale, multi-physics simulation code principally for astrophysical applications, by using huge pages on Ookami, an HPE Apollo 80 A64FX platform. FLASH is written principally in modern Fortran and makes use of the PARAMESH library to manage a block-structured adaptive mesh. We explored options for enabling the use of huge pages with several compilers, but we were only able to successfully use huge pages when compiling with the Fujitsu compiler. The use of huge pages substantially reduced the number of translation lookaside buffer misses, but overall performance gains were marginal. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: 6 pages, 1 figure, accepted to Embracing Arm for HPC, An IEEE Cluster 2022 Workshop

Showing 1–4 of 4 results for author: Chheda, S