-
Learning from Streaming Video with Orthogonal Gradients
Authors:
Tengda Han,
Dilara Gokay,
Joseph Heyward,
Chuhan Zhang,
Daniel Zoran,
Viorica Pătrăucean,
João Carreira,
Dima Damen,
Andrew Zisserman
Abstract:
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training p…
▽ More
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Scaling 4D Representations
Authors:
João Carreira,
Dilara Gokay,
Michael King,
Chuhan Zhang,
Ignacio Rocco,
Aravindh Mahendran,
Thomas Albert Keck,
Joseph Heyward,
Skanda Koppula,
Etienne Pot,
Goker Erdogan,
Yana Hasson,
Yi Yang,
Klaus Greff,
Guillaume Le Moing,
Sjoerd van Steenkiste,
Daniel Zoran,
Drew A. Hudson,
Pedro Vélez,
Luisa Polanía,
Luke Friedman,
Chris Duvarney,
Ross Goroshin,
Kelsey Allen,
Jacob Walker
, et al. (10 additional authors not shown)
Abstract:
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose…
▽ More
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Moving Off-the-Grid: Scene-Grounded Video Representations
Authors:
Sjoerd van Steenkiste,
Daniel Zoran,
Yi Yang,
Yulia Rubanova,
Rishabh Kabra,
Carl Doersch,
Dilara Gokay,
Joseph Heyward,
Etienne Pot,
Klaus Greff,
Drew A. Hudson,
Thomas Albert Keck,
Joao Carreira,
Alexey Dosovitskiy,
Mehdi S. M. Sajjadi,
Thomas Kipf
Abstract:
Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative…
▽ More
Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.
△ Less
Submitted 8 November, 2024;
originally announced November 2024.
-
BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors:
Carl Doersch,
Pauline Luc,
Yi Yang,
Dilara Gokay,
Skanda Koppula,
Ankush Gupta,
Joseph Heyward,
Ignacio Rocco,
Ross Goroshin,
João Carreira,
Andrew Zisserman
Abstract:
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulat…
▽ More
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/
△ Less
Submitted 23 May, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Learning from One Continuous Video Stream
Authors:
João Carreira,
Michael King,
Viorica Pătrăucean,
Dilara Gokay,
Cătălin Ionescu,
Yi Yang,
Daniel Zoran,
Joseph Heyward,
Carl Doersch,
Yusuf Aytar,
Dima Damen,
Andrew Zisserman
Abstract:
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str…
▽ More
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.
△ Less
Submitted 28 March, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Authors:
Carl Doersch,
Yi Yang,
Mel Vecerik,
Dilara Gokay,
Ankush Gupta,
Yusuf Aytar,
Joao Carreira,
Andrew Zisserman
Abstract:
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on loc…
▽ More
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.
△ Less
Submitted 30 August, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Graph2Pix: A Graph-Based Image to Image Translation Framework
Authors:
Dilara Gokay,
Enis Simsar,
Efehan Atici,
Alper Ahmetoglu,
Atif Emre Yuksel,
Pinar Yanardag
Abstract:
In this paper, we propose a graph-based image-to-image translation framework for generating images. We use rich data collected from the popular creativity platform Artbreeder (http://artbreeder.com), where users interpolate multiple GAN-generated images to create artworks. This unique approach of creating new images leads to a tree-like structure where one can track historical data about the creat…
▽ More
In this paper, we propose a graph-based image-to-image translation framework for generating images. We use rich data collected from the popular creativity platform Artbreeder (http://artbreeder.com), where users interpolate multiple GAN-generated images to create artworks. This unique approach of creating new images leads to a tree-like structure where one can track historical data about the creation of a particular image. Inspired by this structure, we propose a novel graph-to-image translation model called Graph2Pix, which takes a graph and corresponding images as input and generates a single image as output. Our experiments show that Graph2Pix is able to outperform several image-to-image translation frameworks on benchmark metrics, including LPIPS (with a 25% improvement) and human perception studies (n=60), where users preferred the images generated by our method 81.5% of the time. Our source code and dataset are publicly available at https://github.com/catlab-team/graph2pix.
△ Less
Submitted 22 August, 2021;
originally announced August 2021.
-
Analytical Derivation of the Impulse Response for the Bounded 2-D Diffusion Channel
Authors:
Fatih Dinc,
Bayram Cevdet Akdeniz,
Ecda Erol,
Dilara Gokay,
Ezgi Tekgul,
Ali Emre Pusane,
Tuna Tugcu
Abstract:
This paper focuses on the derivation of the distribution of diffused particles absorbed by an agent in a bounded environment. In particular, we analogously consider to derive the impulse response of a molecular communication channel in 2-D and 3-D environment. In 2-D, the channel involves a point transmitter that releases molecules to a circular absorbing receiver that absorbs incoming molecules i…
▽ More
This paper focuses on the derivation of the distribution of diffused particles absorbed by an agent in a bounded environment. In particular, we analogously consider to derive the impulse response of a molecular communication channel in 2-D and 3-D environment. In 2-D, the channel involves a point transmitter that releases molecules to a circular absorbing receiver that absorbs incoming molecules in an environment surrounded by a circular reflecting boundary. Considering this setup, the joint distribution of the molecules on the circular absorbing receiver with respect to time and angle is derived. Using this distribution, the channel characteristics are examined. Furthermore, we also extend this channel model to 3-D using a cylindrical receiver and investigate the channel properties. We also propose how to obtain an analytical solution for the unbounded 2-D channel from our derived solutions, as no analytical derivation for this channel is present in the literature.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.