QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Tong, Xinyang; Ding, Pengxiang; Fan, Yiguo; Wang, Donglin; Zhang, Wenjie; Cui, Can; Sun, Mingyang; Zhao, Han; Zhang, Hongyin; Dang, Yonghao; Huang, Siteng; Lyu, Shangke

Computer Science > Robotics

arXiv:2412.15576 (cs)

[Submitted on 20 Dec 2024 (v1), last revised 27 May 2025 (this version, v5)]

Title:QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Authors:Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu

View PDF HTML (experimental)

Abstract:This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is this https URL.

Comments:	Accepted to ICRA 2025; Github page: this https URL
Subjects:	Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.15576 [cs.RO]
	(or arXiv:2412.15576v5 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2412.15576

Submission history

From: Xinyang Tong [view email]
[v1] Fri, 20 Dec 2024 05:17:06 UTC (15,329 KB)
[v2] Mon, 23 Dec 2024 06:06:17 UTC (4,475 KB)
[v3] Tue, 11 Mar 2025 14:09:50 UTC (4,855 KB)
[v4] Thu, 24 Apr 2025 08:00:36 UTC (4,855 KB)
[v5] Tue, 27 May 2025 07:05:44 UTC (4,856 KB)

Computer Science > Robotics

Title:QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators