Towards Local Visual Modeling for Image Captioning

Ma, Yiwei; Ji, Jiayi; Sun, Xiaoshuai; Zhou, Yiyi; Ji, Rongrong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2302.06098 (cs)

[Submitted on 13 Feb 2023]

Title:Towards Local Visual Modeling for Image Captioning

Authors:Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

View PDF

Abstract:In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2302.06098 [cs.CV]
	(or arXiv:2302.06098v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2302.06098

Submission history

From: Yiwei Ma [view email]
[v1] Mon, 13 Feb 2023 04:42:00 UTC (2,806 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Local Visual Modeling for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Local Visual Modeling for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators