Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Ishikawa, Shintaro; Sugiura, Komei

Computer Science > Robotics

arXiv:2107.00811 (cs)

[Submitted on 2 Jul 2021]

Title:Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Authors:Shintaro Ishikawa, Komei Sugiura

View PDF

Abstract:Currently, domestic service robots have an insufficient ability to interact naturally through language. This is because understanding human instructions is complicated by various ambiguities and missing information. In existing methods, the referring expressions that specify the relationships between objects are insufficiently modeled. In this paper, we propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image, rather than the whole image. Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets. We extend the UNITER approach by introducing a new architecture for handling the target candidates. Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.

Comments:	Accepted for presentation at IROS2021
Subjects:	Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2107.00811 [cs.RO]
	(or arXiv:2107.00811v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2107.00811

Submission history

From: Komei Sugiura [view email]
[v1] Fri, 2 Jul 2021 03:11:02 UTC (26,784 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.RO

< prev | next >

new | recent | 2021-07

Change to browse by:

cs
cs.CL
cs.CV

References & Citations

DBLP - CS Bibliography

listing | bibtex

Komei Sugiura

export BibTeX citation

Computer Science > Robotics

Title:Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators