Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Wada, Yuiga; Kaneda, Kanta; Saito, Daichi; Sugiura, Komei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.18091 (cs)

[Submitted on 28 Feb 2024]

Title:Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Authors:Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura

View PDF HTML (experimental)

Abstract:Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2402.18091 [cs.CV]
	(or arXiv:2402.18091v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.18091

Submission history

From: Yuiga Wada [view email]
[v1] Wed, 28 Feb 2024 06:24:39 UTC (16,896 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators