Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Liu, Dong; Lai, Zhixin; Wang, Yite; Wu, Jing; Yu, Yanxuan; Wan, Zhongwei; Lengerich, Benjamin; Wu, Ying Nian

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2409.01990v2 (cs)

[Submitted on 3 Sep 2024 (v1), revised 11 Dec 2024 (this version, v2), latest version 14 Apr 2025 (v5)]

Title:Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Authors:Dong Liu, Zhixin Lai, Yite Wang, Jing Wu, Yanxuan Yu, Zhongwei Wan, Benjamin Lengerich, Ying Nian Wu

View PDF HTML (experimental)

Abstract:As Large Language Models (LLMs) become popular, the need for efficient design for ML models on LLMs grows. We are amazed by the excellent output by the LLMs, yet we are still troubled with slow inference speed and large memory consumption of contemporary LLMs. This paper focuses on modern efficient inference technologies on LLMs and illustrates them from two perspectives: model and system design. These methodologies optimize LLM inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2409.01990 [cs.DC]
	(or arXiv:2409.01990v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2409.01990

Submission history

From: Dong Liu [view email]
[v1] Tue, 3 Sep 2024 15:35:01 UTC (7 KB)
[v2] Wed, 11 Dec 2024 11:39:41 UTC (71 KB)
[v3] Mon, 13 Jan 2025 10:02:27 UTC (104 KB)
[v4] Mon, 24 Feb 2025 06:57:40 UTC (124 KB)
[v5] Mon, 14 Apr 2025 07:09:15 UTC (147 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators