Vision Language Transformers: A Survey

Fields, Clayton; Kennington, Casey

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.03254 (cs)

[Submitted on 6 Jul 2023]

Title:Vision Language Transformers: A Survey

Authors:Clayton Fields, Casey Kennington

View PDF

Abstract:Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2307.03254 [cs.CV]
	(or arXiv:2307.03254v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.03254

Submission history

From: Clayton Fields [view email]
[v1] Thu, 6 Jul 2023 19:08:56 UTC (5,253 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vision Language Transformers: A Survey

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vision Language Transformers: A Survey

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators