Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Gain, Baban; Bandyopadhyay, Dibyanayan; Mukherjee, Samrat; Adak, Chandranath; Ekbal, Asif

Computer Science > Computation and Language

arXiv:2308.16075 (cs)

[Submitted on 30 Aug 2023 (v1), last revised 23 Jun 2025 (this version, v2)]

Title:Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Authors:Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal

View PDF HTML (experimental)

Abstract:Neural Machine Translation (NMT) has made remarkable progress using large-scale textual data, but the potential of incorporating multimodal inputs, especially visual information, remains underexplored in high-resource settings. While prior research has focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model handle textual noise. Multimodal models slightly outperform text-only models in noisy settings, even when random images are used. The study's experiments translate from English to Hindi, Bengali, and Malayalam, significantly outperforming state-of-the-art benchmarks. Interestingly, the effect of visual context varies with the level of source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features perform better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, and opens up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information to improve translation across various environments. Our code is publicly available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.16075 [cs.CL]
	(or arXiv:2308.16075v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.16075

Submission history

From: Baban Gain [view email]
[v1] Wed, 30 Aug 2023 14:52:14 UTC (10,036 KB)
[v2] Mon, 23 Jun 2025 19:07:19 UTC (2,360 KB)

Computer Science > Computation and Language

Title:Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators