SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Thoker, Fida Mohammad; Jiang, Letian; Zhao, Chen; Bagad, Piyush; Doughty, Hazel; Ghanem, Bernard; Snoek, Cees G. M.

Abstract:Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.

Comments:	Under Review
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.05706 [cs.CV]
	(or arXiv:2504.05706v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.05706

Computer Science > Computer Vision and Pattern Recognition

Title:SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators