Target-Driven Structured Transformer Planner for Vision-Language Navigation

Zhao, Yusheng; Chen, Jinyu; Gao, Chen; Wang, Wenguan; Yang, Lirong; Ren, Haibing; Xia, Huaxia; Liu, Si

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.11201 (cs)

[Submitted on 19 Jul 2022]

Title:Target-Driven Structured Transformer Planner for Vision-Language Navigation

Authors:Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, Si Liu

View PDF

Abstract:Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at this https URL .

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2207.11201 [cs.CV]
	(or arXiv:2207.11201v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.11201

Submission history

From: Yusheng Zhao [view email]
[v1] Tue, 19 Jul 2022 06:46:21 UTC (42,681 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Target-Driven Structured Transformer Planner for Vision-Language Navigation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Target-Driven Structured Transformer Planner for Vision-Language Navigation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators