NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Jiang, Xuanlin; Zhou, Yang; Cao, Shiyi; Stoica, Ion; Yu, Minlan

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2411.01142 (cs)

[Submitted on 2 Nov 2024]

Title:NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Authors:Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu

View PDF HTML (experimental)

Abstract:Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted.
We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2411.01142 [cs.DC]
	(or arXiv:2411.01142v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2411.01142

Submission history

From: Yang Zhou [view email]
[v1] Sat, 2 Nov 2024 05:15:44 UTC (460 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators