Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Chen, Zhiyang; Xu, Daliang; Shen, Haiyang; Xu, Mengwei; Wang, Shangguang; Ma, Yun

Computer Science > Computation and Language

arXiv:2510.15312 (cs)

[Submitted on 17 Oct 2025 (v1), last revised 23 Oct 2025 (this version, v3)]

Title:Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Authors:Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, Yun Ma

View PDF HTML (experimental)

Abstract:Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents this http URL, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2510.15312 [cs.CL]
	(or arXiv:2510.15312v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2510.15312

Submission history

From: Zhiyang Chen [view email]
[v1] Fri, 17 Oct 2025 04:59:43 UTC (689 KB)
[v2] Mon, 20 Oct 2025 01:50:51 UTC (689 KB)
[v3] Thu, 23 Oct 2025 09:30:23 UTC (689 KB)

Computer Science > Computation and Language

Title:Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators