Empirically-Calibrated H100 Node Power Models for Reducing Uncertainty in AI Training Energy Estimation
Authors:
Alex C. Newkirk,
Jared Fernandez,
Jonathan Koomey,
Imran Latif,
Emma Strubell,
Arman Shehabi,
Constantine Samaras
Abstract:
As AI's energy demand continues to grow, it is critical to enhance the understanding of characteristics of this demand, to improve grid infrastructure planning and environmental assessment. By combining empirical measurements from Brookhaven National Laboratory during AI training on 8-GPU H100 systems with open-source benchmarking data, we develop statistical models relating computational intensit…
▽ More
As AI's energy demand continues to grow, it is critical to enhance the understanding of characteristics of this demand, to improve grid infrastructure planning and environmental assessment. By combining empirical measurements from Brookhaven National Laboratory during AI training on 8-GPU H100 systems with open-source benchmarking data, we develop statistical models relating computational intensity to node-level power consumption. We measure the gap between manufacturer-rated thermal design power (TDP) and actual power demand during AI training. Our analysis reveals that even computationally intensive workloads operate at only 76% of the 10.2 kW TDP rating. Our architecture-specific model, calibrated to floating-point operations, predicts energy consumption with 11.4% mean absolute percentage error, significantly outperforming TDP-based approaches (27-37% error). We identified distinct power signatures between transformer and CNN architectures, with transformers showing characteristic fluctuations that may impact grid stability.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node
Authors:
Imran Latif,
Alex C. Newkirk,
Matthew R. Carbone,
Arslan Munir,
Yuewei Lin,
Jonathan Koomey,
Xi Yu,
Zhiuha Dong
Abstract:
The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during th…
▽ More
The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.
△ Less
Submitted 20 December, 2024; v1 submitted 11 December, 2024;
originally announced December 2024.