DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Yao, Xiaozhe; Hu, Qinghao; Klimovic, Ana

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2312.05215 (cs)

[Submitted on 8 Dec 2023 (v1), last revised 25 Mar 2025 (this version, v3)]

Title:DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Authors:Xiaozhe Yao, Qinghao Hu, Ana Klimovic

View PDF HTML (experimental)

Abstract:Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Comments:	EuroSys 2025'
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2312.05215 [cs.DC]
	(or arXiv:2312.05215v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2312.05215

Submission history

From: Xiaozhe Yao [view email]
[v1] Fri, 8 Dec 2023 18:07:05 UTC (3,920 KB)
[v2] Fri, 1 Nov 2024 21:56:48 UTC (801 KB)
[v3] Tue, 25 Mar 2025 14:48:01 UTC (804 KB)

Full-text links:

Access Paper:

view license

Current browse context:

< prev | next >

new | recent | 2023-12

Change to browse by:

cs.DC
cs.LG

References & Citations

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators