Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Li, Jinze; Xu, Yixing; Huang, Haiduo; Yin, Xuanwu; Li, Dong; Ngai, Edith C. H.; Barsoum, Emad

Computer Science > Computation and Language

arXiv:2503.10135 (cs)

[Submitted on 13 Mar 2025 (v1), last revised 30 Jun 2025 (this version, v2)]

Title:Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Authors:Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C.H. Ngai, Emad Barsoum

View PDF HTML (experimental)

Abstract:Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.

Comments:	Accepted to the 42nd International Conference on Machine Learning (ICML 2025). Code: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2503.10135 [cs.CL]
	(or arXiv:2503.10135v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.10135

Submission history

From: Jinze Li [view email]
[v1] Thu, 13 Mar 2025 07:55:38 UTC (1,971 KB)
[v2] Mon, 30 Jun 2025 04:51:00 UTC (1,972 KB)

Computer Science > Computation and Language

Title:Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators