LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Liang, Manlai; Zhang, JiaMing; Li, Xiong; Li, Jinlong

Computer Science > Machine Learning

arXiv:2504.04704 (cs)

[Submitted on 7 Apr 2025]

Title:LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Authors:Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li

View PDF HTML (experimental)

Abstract:The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modifiation of the inference infrastructure and significant computation overhead. Base on the fact that the Large Lanuage models are autoregresssive models, we propose {\it LagKV}, a KV allocation strategy only relying on straight forward comparison among KV themself. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on LongBench and PasskeyRetrieval show that, our approach achieves nearly zero loss when the ratio is $2\times$ and $\approx 90\%$ of the original model performance for $8\times$. Especially in the 64-digit passkey retrieval task, our mehod outperforms the attention weight based method $H_2O$ over $60\%$ with same compression ratios. Our code is available at \url{this https URL}.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.04704 [cs.LG]
	(or arXiv:2504.04704v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.04704

Submission history

From: Manlai Liang [view email]
[v1] Mon, 7 Apr 2025 03:22:15 UTC (165 KB)

Computer Science > Machine Learning

Title:LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators