-
GPU Sharing with Triples Mode
Authors:
Chansup Byun,
Albert Reuther,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alexander Bonn,
Daniel Burrill,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Piotr Luszczek,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
There is a tremendous amount of interest in AI/ML technologies due to the proliferation of generative AI applications such as ChatGPT. This trend has significantly increased demand on GPUs, which are the workhorses for training AI models. Due to the high costs of GPUs and lacking supply, it has become of interest to optimize GPU usage in HPC centers. MIT Lincoln Laboratory Supercomputing Center (L…
▽ More
There is a tremendous amount of interest in AI/ML technologies due to the proliferation of generative AI applications such as ChatGPT. This trend has significantly increased demand on GPUs, which are the workhorses for training AI models. Due to the high costs of GPUs and lacking supply, it has become of interest to optimize GPU usage in HPC centers. MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed an easy-to-use GPU sharing feature supported by LLSC-developed tools including LLsub and LLMapReduce. This approach overcomes some of the limitations with the existing methods for GPU sharing. This allows users to apply GPU sharing whenever possible while they are developing their AI/ML models and/or doing parametric study on their AI models or executing other GPU applications. Based on our initial experimental results with GPU sharing, GPU sharing with triples mode is easy to use and achieved significant improvement in GPU usage and throughput performance for certain types of AI applications.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
LLload: An Easy-to-Use HPC Utilization Tool
Authors:
Chansup Byun,
Albert Reuther,
Julie Mullen,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alexander Bonn,
Daniel Burrill,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Piotr Luszczek,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities fo…
▽ More
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users' jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
HPC with Enhanced User Separation
Authors:
Andrew Prout,
Albert Reuther,
Michael Houle,
Michael Jones,
Peter Michaleas,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alex Bonn,
Daniel Burrill,
Chansup Byun,
Vijay Gadepally,
Matthew Hubbell,
Hayden Jananthan,
Piotr Luszczek,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
HPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Supercomputing Center has deployed on its systems to manage the security implications of these workf…
▽ More
HPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Supercomputing Center has deployed on its systems to manage the security implications of these workflows by providing enforced separation for processes, filesystem access, network traffic, and accelerators to make every user feel like they are running on a personal HPC.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.