Video OWL-ViT: Temporally-consistent open-world localization in video

Heigold, Georg; Minderer, Matthias; Gritsenko, Alexey; Bewley, Alex; Keysers, Daniel; Lučić, Mario; Yu, Fisher; Kipf, Thomas

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.11093 (cs)

[Submitted on 22 Aug 2023]

Title:Video OWL-ViT: Temporally-consistent open-world localization in video

Authors:Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf

View PDF

Abstract:We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.

Comments:	ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2308.11093 [cs.CV]
	(or arXiv:2308.11093v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.11093

Submission history

From: Thomas Kipf [view email]
[v1] Tue, 22 Aug 2023 00:21:32 UTC (10,429 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video OWL-ViT: Temporally-consistent open-world localization in video

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video OWL-ViT: Temporally-consistent open-world localization in video

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators