Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Jain, Vidhi; Attarian, Maria; Joshi, Nikhil J; Wahid, Ayzaan; Driess, Danny; Vuong, Quan; Sanketi, Pannag R; Sermanet, Pierre; Welker, Stefan; Chan, Christine; Gilitschenski, Igor; Bisk, Yonatan; Dwibedi, Debidatta

Computer Science > Robotics

arXiv:2403.12943 (cs)

[Submitted on 19 Mar 2024 (v1), last revised 27 Aug 2024 (this version, v2)]

Title:Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Authors:Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

View PDF HTML (experimental)

Abstract:Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at this https URL

Comments:	Robotics: Science & Systems (RSS) 2024. this https URL
Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.12943 [cs.RO]
	(or arXiv:2403.12943v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2403.12943

Submission history

From: Vidhi Jain [view email]
[v1] Tue, 19 Mar 2024 17:47:37 UTC (4,658 KB)
[v2] Tue, 27 Aug 2024 23:15:11 UTC (12,681 KB)

Computer Science > Robotics

Title:Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators