FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Zihao; Chen, Lequn; Lai, Ruihang; Lin, Wuwei; Zhang, Yineng; Wang, Stephanie; Chen, Tianqi; Kasikci, Baris; Grover, Vinod; Krishnamurthy, Arvind; Ceze, Luis

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2501.01005 (cs)

[Submitted on 2 Jan 2025 (v1), last revised 21 Apr 2025 (this version, v2)]

Title:FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Authors:Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

View PDF HTML (experimental)

Abstract:Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Comments:	Accepted by MLSys 2025, code available at this http URL
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2501.01005 [cs.DC]
	(or arXiv:2501.01005v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2501.01005

Submission history

From: Zihao Ye [view email]
[v1] Thu, 2 Jan 2025 02:02:20 UTC (1,326 KB)
[v2] Mon, 21 Apr 2025 20:10:11 UTC (1,376 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators