DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Wen, Junjie; Zhu, Yichen; Li, Jinming; Tang, Zhibin; Shen, Chaomin; Feng, Feifei

Computer Science > Robotics

arXiv:2502.05855 (cs)

[Submitted on 9 Feb 2025 (v1), last revised 9 Aug 2025 (this version, v3)]

Title:DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Authors:Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng

View PDF HTML (experimental)

Abstract:Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA's adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.

Comments:	The webpage is at this https URL. DexVLA is accepted by CoRL 2025
Subjects:	Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.05855 [cs.RO]
	(or arXiv:2502.05855v3 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2502.05855

Submission history

From: Yichen Zhu [view email]
[v1] Sun, 9 Feb 2025 11:25:56 UTC (15,891 KB)
[v2] Tue, 13 May 2025 10:55:53 UTC (34,551 KB)
[v3] Sat, 9 Aug 2025 10:58:17 UTC (34,552 KB)

Computer Science > Robotics

Title:DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators