Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Zhao, Jianfei; Zhang, Feng; Sun, Xin; Feng, Chong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.14257 (cs)

[Submitted on 20 May 2025 (v1), last revised 3 Sep 2025 (this version, v2)]

Title:Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Authors:Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng

View PDF HTML (experimental)

Abstract:Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that the majority of the visual information is absorbed into the semantic representations. However, the model's attention distribution does not exhibit sufficient emphasis on semantic representations. This misalignment between the attention distribution and the actual information flow undermines the model's visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model's visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model's conservativeness, enabling flexible control to meet diverse real-world requirements.

Comments:	Accepted to Findings of EMNLP 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.14257 [cs.CV]
	(or arXiv:2505.14257v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.14257

Submission history

From: Jianfei Zhao [view email]
[v1] Tue, 20 May 2025 12:10:13 UTC (2,239 KB)
[v2] Wed, 3 Sep 2025 11:34:49 UTC (2,058 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators