Chrono: A Simple Blueprint for Representing Time in MLLMs

Meinardus, Boris; Rodriguez, Hector; Batra, Anil; Rohrbach, Anna; Rohrbach, Marcus

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.18113 (cs)

[Submitted on 26 Jun 2024 (v1), last revised 11 Mar 2025 (this version, v5)]

Title:Chrono: A Simple Blueprint for Representing Time in MLLMs

Authors:Boris Meinardus, Hector Rodriguez, Anil Batra, Anna Rohrbach, Marcus Rohrbach

View PDF HTML (experimental)

Abstract:The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. Interestingly, we find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM. Through extensive ablations across different MLLM architectures, finetuning and zero-shot settings, and different datasets, we achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA.

Comments:	Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.18113 [cs.CV]
	(or arXiv:2406.18113v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.18113

Submission history

From: Boris Meinardus [view email]
[v1] Wed, 26 Jun 2024 06:59:09 UTC (6,672 KB)
[v2] Wed, 24 Jul 2024 06:43:07 UTC (6,671 KB)
[v3] Mon, 14 Oct 2024 06:50:19 UTC (6,671 KB)
[v4] Fri, 21 Feb 2025 00:49:07 UTC (15,466 KB)
[v5] Tue, 11 Mar 2025 10:03:46 UTC (15,466 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Chrono: A Simple Blueprint for Representing Time in MLLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Chrono: A Simple Blueprint for Representing Time in MLLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators