RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Han, Mingfei; Ma, Liang; Zhumakhanova, Kamila; Radionova, Ekaterina; Zhang, Jingyi; Chang, Xiaojun; Liang, Xiaodan; Laptev, Ivan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.08591 (cs)

[Submitted on 11 Dec 2024 (v1), last revised 19 Mar 2025 (this version, v2)]

Title:RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Authors:Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

View PDF HTML (experimental)

Abstract:Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.

Comments:	CVPR2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Cite as:	arXiv:2412.08591 [cs.CV]
	(or arXiv:2412.08591v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.08591

Submission history

From: Mingfei Han [view email]
[v1] Wed, 11 Dec 2024 18:10:21 UTC (40,209 KB)
[v2] Wed, 19 Mar 2025 10:05:05 UTC (40,209 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators