Apollo: An Exploration of Video Understanding in Large Multimodal Models

Zohar, Orr; Wang, Xiaohan; Dubois, Yann; Mehta, Nikhil; Xiao, Tong; Hansen-Estruch, Philippe; Yu, Licheng; Wang, Xiaofang; Juefei-Xu, Felix; Zhang, Ning; Yeung-Levy, Serena; Xia, Xide

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.10360 (cs)

[Submitted on 13 Dec 2024]

Title:Apollo: An Exploration of Video Understanding in Large Multimodal Models

Authors:Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia

View PDF HTML (experimental)

Abstract:Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs.
We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation.
Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

Comments:	this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.10360 [cs.CV]
	(or arXiv:2412.10360v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.10360

Submission history

From: Orr Zohar Mr [view email]
[v1] Fri, 13 Dec 2024 18:53:24 UTC (31,798 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Apollo: An Exploration of Video Understanding in Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Apollo: An Exploration of Video Understanding in Large Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators