SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Kakolyris, Andreas Kosmas; Masouros, Dimosthenis; Vavaroutsos, Petros; Xydis, Sotirios; Soudris, Dimitrios

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2408.05235 (cs)

[Submitted on 5 Aug 2024]

Title:SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Authors:Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

View PDF HTML (experimental)

Abstract:As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that \textit{throttLL'eM} achieves up to 43.8\% lower energy consumption and an energy efficiency improvement of at least $1.71\times$ under SLOs, when compared to NVIDIA's Triton server.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Cite as:	arXiv:2408.05235 [cs.DC]
	(or arXiv:2408.05235v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2408.05235

Submission history

From: Andreas Kosmas Kakolyris [view email]
[v1] Mon, 5 Aug 2024 09:07:06 UTC (14,542 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators