Explain Images with Multimodal Recurrent Neural Networks

Mao, Junhua; Xu, Wei; Yang, Yi; Wang, Jiang; Yuille, Alan L.

Computer Science > Computer Vision and Pattern Recognition

arXiv:1410.1090 (cs)

[Submitted on 4 Oct 2014]

Title:Explain Images with Multimodal Recurrent Neural Networks

Authors:Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan L. Yuille

View PDF

Abstract:In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
ACM classes:	I.2.6; I.2.7; I.2.10
Cite as:	arXiv:1410.1090 [cs.CV]
	(or arXiv:1410.1090v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1410.1090

Submission history

From: Junhua Mao [view email]
[v1] Sat, 4 Oct 2014 20:24:34 UTC (346 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Explain Images with Multimodal Recurrent Neural Networks

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Explain Images with Multimodal Recurrent Neural Networks

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators