LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Shen, Leqi; He, Tao; Gong, Guoqiang; Yang, Fan; Zhang, Yifeng; Liu, Pengzhang; Zhao, Sicheng; Ding, Guiguang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.11205 (cs)

[Submitted on 14 Mar 2025]

Title:LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Authors:Leqi Shen, Tao He, Guoqiang Gong, Fan Yang, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding

View PDF HTML (experimental)

Abstract:Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.11205 [cs.CV]
	(or arXiv:2503.11205v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.11205

Submission history

From: Leqi Shen [view email]
[v1] Fri, 14 Mar 2025 08:49:52 UTC (388 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators