Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Oh, Seungeun; Kim, Jinhyuk; Park, Jihong; Ko, Seung-Woo; Quek, Tony Q. S.; Kim, Seong-Lyun

Computer Science > Machine Learning

arXiv:2412.12687 (cs)

[Submitted on 17 Dec 2024 (v1), last revised 18 Mar 2025 (this version, v3)]

Title:Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Authors:Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony Q. S. Quek, Seong-Lyun Kim

View PDF HTML (experimental)

Abstract:This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$\times$ faster token throughput than HLM without skipping.

Comments:	7 pages, 6 figures; to be presented at IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2025
Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Cite as:	arXiv:2412.12687 [cs.LG]
	(or arXiv:2412.12687v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.12687

Submission history

From: Seungeun Oh [view email]
[v1] Tue, 17 Dec 2024 09:08:18 UTC (37,279 KB)
[v2] Wed, 18 Dec 2024 08:14:35 UTC (37,277 KB)
[v3] Tue, 18 Mar 2025 10:50:58 UTC (37,279 KB)

Computer Science > Machine Learning

Title:Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators