Scaling 4D Representations
Authors:
João Carreira,
Dilara Gokay,
Michael King,
Chuhan Zhang,
Ignacio Rocco,
Aravindh Mahendran,
Thomas Albert Keck,
Joseph Heyward,
Skanda Koppula,
Etienne Pot,
Goker Erdogan,
Yana Hasson,
Yi Yang,
Klaus Greff,
Guillaume Le Moing,
Sjoerd van Steenkiste,
Daniel Zoran,
Drew A. Hudson,
Pedro Vélez,
Luisa Polanía,
Luke Friedman,
Chris Duvarney,
Ross Goroshin,
Kelsey Allen,
Jacob Walker
, et al. (10 additional authors not shown)
Abstract:
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose…
▽ More
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations. Pretrained models are available at https://github.com/google-deepmind/representations4d .
△ Less
Submitted 9 July, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
Moving Off-the-Grid: Scene-Grounded Video Representations
Authors:
Sjoerd van Steenkiste,
Daniel Zoran,
Yi Yang,
Yulia Rubanova,
Rishabh Kabra,
Carl Doersch,
Dilara Gokay,
Joseph Heyward,
Etienne Pot,
Klaus Greff,
Drew A. Hudson,
Thomas Albert Keck,
Joao Carreira,
Alexey Dosovitskiy,
Mehdi S. M. Sajjadi,
Thomas Kipf
Abstract:
Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative…
▽ More
Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.
△ Less
Submitted 8 November, 2024;
originally announced November 2024.