ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Dey, Sombit; Zaech, Jan-Nico; Nikolov, Nikolay; Van Gool, Luc; Paudel, Danda Pani

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.15250 (cs)

[Submitted on 23 Sep 2024 (v1), last revised 20 May 2025 (this version, v3)]

Title:ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Authors:Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel

View PDF HTML (experimental)

Abstract:Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA -- which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77\% and 66\% for grasping and lifting in visual OOD tasks. Comprehensive evaluations, episode rollouts and model weights are available on the ReVLA Page

Comments:	Accepted at ICRA-2025, Atlanta
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:	arXiv:2409.15250 [cs.CV]
	(or arXiv:2409.15250v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.15250

Submission history

From: Sombit Dey [view email]
[v1] Mon, 23 Sep 2024 17:47:59 UTC (9,095 KB)
[v2] Thu, 13 Mar 2025 12:18:17 UTC (9,095 KB)
[v3] Tue, 20 May 2025 17:23:45 UTC (9,096 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators