Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Li, Hanchen; Liu, Yuhan; Cheng, Yihua; Du, Kuntai; Jiang, Junchen

Computer Science > Networking and Internet Architecture

arXiv:2503.14647 (cs)

[Submitted on 18 Mar 2025]

Title:Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Authors:Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, Junchen Jiang

View PDF HTML (experimental)

Abstract:Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

Subjects:	Networking and Internet Architecture (cs.NI)
Cite as:	arXiv:2503.14647 [cs.NI]
	(or arXiv:2503.14647v1 [cs.NI] for this version)
	https://doi.org/10.48550/arXiv.2503.14647

Submission history

From: Hanchen Li [view email]
[v1] Tue, 18 Mar 2025 18:52:03 UTC (10,134 KB)

Computer Science > Networking and Internet Architecture

Title:Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Networking and Internet Architecture

Title:Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators