Skip to main content

Showing 1–48 of 48 results for author: Alahari, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.13987  [pdf, other

    cs.CV cs.AI

    Entropy Rectifying Guidance for Diffusion and Flow Models

    Authors: Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, Karteek Alahari

    Abstract: Guidance techniques are commonly used in diffusion and flow models to improve image quality and consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) -- the most widely adopted guidance technique -- contrasts conditional and unconditional predictions to improve the generated images. This results, however,… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  2. arXiv:2503.00677  [pdf, other

    cs.CV

    Advancing Prompt-Based Methods for Replay-Independent General Continual Learning

    Authors: Zhiqi Kang, Liyuan Wang, Xingxing Zhang, Karteek Alahari

    Abstract: General continual learning (GCL) is a broad concept to describe real-world continual learning (CL) problems, which are often characterized by online data streams without distinct transitions between tasks, i.e., blurry task boundaries. Such requirements result in poor initial performance, limited generalizability, and severe catastrophic forgetting, heavily impacting the effectiveness of mainstrea… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

    Comments: ICLR 2025

  3. arXiv:2411.04873  [pdf, other

    cs.CV

    Boosting Latent Diffusion with Perceptual Objectives

    Authors: Tariq Berrada, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek

    Abstract: Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion mo… ▽ More

    Submitted 16 January, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: Pre-print

  4. arXiv:2411.03177  [pdf, other

    cs.CV cs.AI

    On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

    Authors: Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Reyhane Askari-Hemmat, Yohann Benchetrit, Marton Havasi, Matthew Muckley, Karteek Alahari, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal

    Abstract: Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes fo… ▽ More

    Submitted 20 January, 2025; v1 submitted 5 November, 2024; originally announced November 2024.

    Comments: Accepted as a conference paper (poster) for NeurIPS 2024

  5. arXiv:2312.13314  [pdf, other

    cs.CV cs.AI cs.LG

    Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

    Authors: Tariq Berrada, Jakob Verbeek, Camille Couprie, Karteek Alahari

    Abstract: Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally dema… ▽ More

    Submitted 8 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

  6. arXiv:2311.18572  [pdf, other

    cs.CV

    Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

    Authors: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari

    Abstract: Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source a… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Extended version of our ICVGIP paper

  7. arXiv:2308.09610  [pdf, other

    cs.CV

    On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

    Authors: Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts, drastically reducing catastrophic forgetting. However, there is a tradeoff between the number of learned parameters and the performance, making such models computationally expensive. In this work, we aim to reduce this cost while maintaining competitive perfor… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: In The First Workshop on Visual Continual Learning (ICCVW 2023); Oral

  8. arXiv:2308.02668  [pdf, other

    cs.CV cs.AI cs.LG

    Guided Distillation for Semi-Supervised Instance Segmentation

    Authors: Tariq Berrada, Camille Couprie, Karteek Alahari, Jakob Verbeek

    Abstract: Although instance segmentation methods have improved considerably, the dominant paradigm is to rely on fully-annotated training images, which are tedious to obtain. To alleviate this reliance, and boost results, semi-supervised approaches leverage unlabeled data as an additional training signal that limits overfitting to the labeled samples. In this context, we present novel design choices to sign… ▽ More

    Submitted 14 December, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: Accepted at the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024)

  9. arXiv:2307.08528  [pdf, other

    cs.CV cs.LG

    Multi-Domain Learning with Modulation Adapters

    Authors: Ekaterina Iakovleva, Karteek Alahari, Jakob Verbeek

    Abstract: Deep convolutional networks are ubiquitous in computer vision, due to their excellent performance across different tasks for various domains. Models are, however, often trained in isolation for each task, failing to exploit relatedness between tasks and domains to learn more compact models that generalise better in low-data regimes. Multi-domain learning aims to handle related tasks, such as image… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  10. arXiv:2306.07483  [pdf, other

    cs.CV

    Semi-supervised learning made simple with self-supervised clustering

    Authors: Enrico Fini, Pietro Astolfi, Karteek Alahari, Xavier Alameda-Pineda, Julien Mairal, Moin Nabi, Elisa Ricci

    Abstract: Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations. However, in many real-world scenarios, labels are partially available, motivating a recent line of work on semi-supervised methods inspired by self-supervised principles. In this paper, we propose a conceptually simple yet empirically powerful approach to turn clustering-based… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 - Code available at https://github.com/pietroastolfi/suave-daino

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023) 3187-3197

  11. arXiv:2304.11063  [pdf, other

    cs.CL cs.AI

    Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions

    Authors: Lina Mezghani, Piotr Bojanowski, Karteek Alahari, Sainbayar Sukhbaatar

    Abstract: The success of transformer models trained with a language modeling objective brings a promising opportunity to the reinforcement learning framework. Decision Transformer is a step towards this direction, showing how to train transformers with a similar next-step prediction objective on offline data. Another important development in this area is the recent emergence of large-scale datasets collecte… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Journal ref: Reincarnating Reinforcement Learning Workshop at ICLR 2023

  12. arXiv:2301.02099  [pdf, other

    cs.RO cs.AI cs.LG

    Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping

    Authors: Lina Mezghani, Sainbayar Sukhbaatar, Piotr Bojanowski, Alessandro Lazaric, Karteek Alahari

    Abstract: Developing agents that can execute multiple skills by learning from pre-collected datasets is an important problem in robotics, where online interaction with the environment is extremely time-consuming. Moreover, manually designing reward functions for every single desired skill is prohibitive. Prior works targeted these challenges by learning goal-conditioned policies from offline datasets withou… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Code: https://github.com/facebookresearch/go-fresh

    Journal ref: 6th Conference on Robot Learning (CoRL 2022)

  13. arXiv:2212.08420  [pdf, other

    cs.CV cs.LG

    Fake it till you make it: Learning transferable representations from synthetic ImageNet clones

    Authors: Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, Yannis Kalantidis

    Abstract: Recent image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by investigating the need for real images when training models for ImageNet classification.… ▽ More

    Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

    Comments: Accepted to CVPR 2023

  14. arXiv:2212.05102  [pdf, other

    cs.CV cs.LG

    A soft nearest-neighbor framework for continual semi-supervised learning

    Authors: Zhiqi Kang, Enrico Fini, Moin Nabi, Elisa Ricci, Karteek Alahari

    Abstract: Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning--a setting where not all the data samples are labeled. A primary issue in this scenario is the model forgetting representations of unlabeled da… ▽ More

    Submitted 11 September, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: Accepted at ICCV 2023

  15. arXiv:2212.00394  [pdf, other

    cs.CV cs.AI eess.IV stat.ML

    From CNNs to Shift-Invariant Twin Models Based on Complex Wavelets

    Authors: Hubert Leterme, Kévin Polisano, Valérie Perrier, Karteek Alahari

    Abstract: We propose a novel method to increase shift invariance and prediction accuracy in convolutional neural networks. Specifically, we replace the first-layer combination "real-valued convolutions + max pooling" (RMax) by "complex-valued convolutions + modulus" (CMod), which is stable to translations, or shifts. To justify our approach, we claim that CMod and RMax produce comparable outputs when the co… ▽ More

    Submitted 31 May, 2024; v1 submitted 1 December, 2022; originally announced December 2022.

  16. arXiv:2211.16289  [pdf, other

    cs.CV

    Lightweight Structure-Aware Attention for Visual Understanding

    Authors: Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari

    Abstract: Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computat… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 8 pages, 5 figures

  17. arXiv:2210.11815  [pdf, other

    cs.CV cs.LG

    Self-Supervised Pretraining on Satellite Imagery: a Case Study on Label-Efficient Vehicle Detection

    Authors: Jules BOURCIER, Thomas Floquet, Gohar Dashyan, Tugdual Ceillier, Karteek Alahari, Jocelyn Chanussot

    Abstract: In defense-related remote sensing applications, such as vehicle detection on satellite imagery, supervised learning requires a huge number of labeled examples to reach operational performances. Such data are challenging to obtain as it requires military experts, and some observables are intrinsically rare. This limited labeling capability, as well as the large number of unlabeled images available… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Journal ref: Conference on Artificial Intelligence for Defense (CAID) 2022, DGA Maîtrise de l'Information, Nov 2022, Rennes, France

  18. arXiv:2210.06786  [pdf, other

    eess.IV cs.CV cs.LG

    Evaluating the Label Efficiency of Contrastive Self-Supervised Learning for Multi-Resolution Satellite Imagery

    Authors: Jules Bourcier, Gohar Dashyan, Jocelyn Chanussot, Karteek Alahari

    Abstract: The application of deep neural networks to remote sensing imagery is often constrained by the lack of ground-truth annotations. Adressing this issue requires models that generalize efficiently from limited amounts of labeled data, allowing us to tackle a wider range of Earth observation tasks. Another challenge in this domain is developing algorithms that operate at variable spatial resolutions, e… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Journal ref: Image and Signal Processing for Remote Sensing XXVIII, SPIE, Sep 2022, Berlin, Germany

  19. arXiv:2209.11740  [pdf, other

    cs.CV cs.AI eess.SP stat.ML

    On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

    Authors: Hubert Leterme, Kévin Polisano, Valérie Perrier, Karteek Alahari

    Abstract: This paper focuses on improving the mathematical interpretability of convolutional neural networks (CNNs) in the context of image classification. Specifically, we tackle the instability issue arising in their first layer, which tends to learn parameters that closely resemble oriented band-pass filters when trained on datasets like ImageNet. Subsampled convolutions with such Gabor-like filters are… ▽ More

    Submitted 18 April, 2025; v1 submitted 19 September, 2022; originally announced September 2022.

  20. arXiv:2206.15369  [pdf, other

    cs.CV cs.LG

    No Reason for No Supervision: Improved Generalization in Supervised Models

    Authors: Mert Bulent Sariyildiz, Yannis Kalantidis, Karteek Alahari, Diane Larlus

    Abstract: We consider the problem of training a deep neural network on a given classification task, e.g., ImageNet-1K (IN1K), so that it excels at both the training task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model's generalization and maintaining its performance on the original task. Models trained with self-supervi… ▽ More

    Submitted 10 March, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted to ICLR 2023 (spotlight)

  21. arXiv:2206.13294  [pdf, other

    cs.CV cs.AI cs.RO

    LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation

    Authors: Florent Bartoccioni, Éloi Zablocki, Andrei Bursuc, Patrick Pérez, Matthieu Cord, Karteek Alahari

    Abstract: Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common topview grid. This is usually done with error-prone geometric operations (e.g., homography or back-project… ▽ More

    Submitted 26 November, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

    MSC Class: 68T45

    Journal ref: CoRL 2022 https://openreview.net/forum?id=abd_D-iVjk0

  22. arXiv:2206.11733  [pdf, other

    cs.LG cs.AI cs.RO

    Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

    Authors: Lina Mezghani, Sainbayar Sukhbaatar, Piotr Bojanowski, Karteek Alahari

    Abstract: Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge. In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or any domain knowledge. We use random walk to train a reacha… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  23. arXiv:2206.07684  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVATAR: Unconstrained Audiovisual Speech Recognition

    Authors: Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

    Abstract: Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  24. arXiv:2203.00115  [pdf, other

    cs.CV

    The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

    Authors: Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

    Abstract: Both a good understanding of geometrical concepts and a broad familiarity with objects lead to our excellent perception of moving objects. The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer and even camouflage. How humans perceive moving objects so reliably is a longstanding research question in comp… ▽ More

    Submitted 28 February, 2022; originally announced March 2022.

  25. arXiv:2112.04215  [pdf, other

    cs.CV cs.LG

    Self-Supervised Models are Continual Learners

    Authors: Enrico Fini, Victor G. Turrisi da Costa, Xavier Alameda-Pineda, Elisa Ricci, Karteek Alahari, Julien Mairal

    Abstract: Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However, their efficacy is catastrophically reduced in a Continual Learning (CL) scenario where data is presented to the model sequentially. In this paper, we show that self-supervised loss functions can be seamlessly conv… ▽ More

    Submitted 1 April, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

  26. arXiv:2111.01300  [pdf, other

    cs.CV

    Masking Modalities for Cross-modal Video Retrieval

    Authors: Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

    Abstract: Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never see… ▽ More

    Submitted 3 November, 2021; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: Accepted at WACV 2022

  27. arXiv:2110.14759  [pdf, ps, other

    cs.LG cs.CV math.OC stat.ML

    Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond

    Authors: Đ. Khuê Lê-Huu, Karteek Alahari

    Abstract: We introduce regularized Frank-Wolfe, a general and effective algorithm for inference and learning of dense conditional random fields (CRFs). The algorithm optimizes a nonconvex continuous relaxation of the CRF inference problem using vanilla Frank-Wolfe with approximate updates, which are equivalent to minimizing a regularized energy function. Our proposed method is a generalization of existing a… ▽ More

    Submitted 7 September, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021. This version fixed some minor typos (constant factor 2 removed from bottom-right cell of Theorem 1's table, and from last row of Table 5)

  28. arXiv:2109.03569  [pdf, other

    cs.CV cs.AI cs.RO

    LiDARTouch: Monocular metric depth estimation with a few-beam LiDAR

    Authors: Florent Bartoccioni, Éloi Zablocki, Patrick Pérez, Matthieu Cord, Karteek Alahari

    Abstract: Vision-based depth estimation is a key feature in autonomous systems, which often relies on a single camera or several independent ones. In such a monocular setup, dense depth is obtained with either additional input from one or several expensive LiDARs, e.g., with 64 beams, or camera-only methods, which suffer from scale-ambiguity and infinite-depth problems. In this paper, we propose a new alter… ▽ More

    Submitted 25 November, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

    MSC Class: 68T45

  29. arXiv:2101.05181  [pdf, other

    cs.CV cs.AI cs.RO

    Memory-Augmented Reinforcement Learning for Image-Goal Navigation

    Authors: Lina Mezghani, Sainbayar Sukhbaatar, Thibaut Lavril, Oleksandr Maksymets, Dhruv Batra, Piotr Bojanowski, Karteek Alahari

    Abstract: In this work, we present a memory-augmented approach for image-goal navigation. Earlier attempts, including RL-based and SLAM-based approaches have either shown poor generalization performance, or are heavily-reliant on pose/depth sensors. Our method is based on an attention-based end-to-end model that leverages an episodic memory to learn to navigate. First, we train a state-embedding network in… ▽ More

    Submitted 12 September, 2022; v1 submitted 13 January, 2021; originally announced January 2021.

    Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022

  30. arXiv:2012.05649  [pdf, other

    cs.CV cs.LG

    Concept Generalization in Visual Representation Learning

    Authors: Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, Karteek Alahari

    Abstract: Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the… ▽ More

    Submitted 10 September, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

    Comments: Accepted to ICCV 2021. See our project website: https://europe.naverlabs.com/cog-benchmark for code and ImageNet-CoG level files

  31. arXiv:2008.12037  [pdf, other

    cs.LG cs.CV stat.ML

    Meta-Learning with Shared Amortized Variational Inference

    Authors: Ekaterina Iakovleva, Jakob Verbeek, Karteek Alahari

    Abstract: We propose a novel amortized variational inference scheme for an empirical Bayes meta-learning model, where model parameters are treated as latent variables. We learn the prior distribution over model parameters conditioned on limited training data using a variational autoencoder approach. Our framework proposes sharing the same amortized inference network between the conditional prior and variati… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: ICML 2020

  32. arXiv:2008.00744  [pdf, other

    cs.CV

    The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

    Authors: Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shizhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

    Abstract: We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the re… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Individual reports, dataset information, rules, and released source code can be found at the competition webpage (https://www.robots.ox.ac.uk/~vgg/challenges/video-pentathlon)

  33. arXiv:2007.10639  [pdf, other

    cs.CV

    Multi-modal Transformer for Video Retrieval

    Authors: Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

    Abstract: The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a mul… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

    Comments: ECCV 2020 (spotlight paper)

  34. arXiv:2003.05614  [pdf, other

    cs.CV

    Beyond the Camera: Neural Networks in World Coordinates

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

    Abstract: Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information. This fundamental system has been missing from video understanding with deep networks, typically limited to 224 by 224 pixel content locked to the camera frame. We propose a simple idea, WorldFeatures, where each feature at every layer has… ▽ More

    Submitted 12 March, 2020; originally announced March 2020.

  35. arXiv:1901.01091  [pdf, other

    cs.CV cs.LG

    Adaptive Density Estimation for Generative Models

    Authors: Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

    Abstract: Unsupervised learning of generative models has seen tremendous progress over recent years, in particular due to generative adversarial networks (GANs), variational autoencoders, and flow-based models. GANs have dramatically improved sample quality, but suffer from two drawbacks: (i) they mode-drop, i.e., do not cover the full support of the train data, and (ii) they do not allow for likelihood eva… ▽ More

    Submitted 3 January, 2020; v1 submitted 4 January, 2019; originally announced January 2019.

  36. arXiv:1807.09536  [pdf, other

    cs.CV

    End-to-End Incremental Learning

    Authors: Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

    Abstract: Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally. This is due to current neural network architectures requiring the entire dataset, consisting of all the samples from the old as well as the new cla… ▽ More

    Submitted 3 September, 2018; v1 submitted 25 July, 2018; originally announced July 2018.

    Comments: To appear in ECCV 2018

  37. arXiv:1807.09499  [pdf, other

    cs.CV cs.LG

    How good is my GAN?

    Authors: Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

    Abstract: Generative adversarial networks (GANs) are one of the most popular methods for generating images today. While impressive results have been validated by visual inspection, a number of quantitative criteria have emerged only recently. We argue here that the existing ones are insufficient and need to be in adequation with the task at hand. In this paper we introduce two measures based on image classi… ▽ More

    Submitted 25 July, 2018; originally announced July 2018.

    Comments: Accepted to ECCV2018

  38. arXiv:1804.09627  [pdf, other

    cs.CV

    Actor and Observer: Joint Modeling of First and Third-Person Videos

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

    Abstract: Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a st… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

    Comments: CVPR 2018 spotlight presentation

    Journal ref: CVPR 2018

  39. arXiv:1804.09626  [pdf, other

    cs.CV

    Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

    Abstract: In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Chara… ▽ More

    Submitted 30 April, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

  40. arXiv:1712.01127  [pdf, other

    cs.CV

    Learning to Segment Moving Objects

    Authors: Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

    Abstract: We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to c… ▽ More

    Submitted 1 December, 2017; originally announced December 2017.

    Comments: arXiv admin note: text overlap with arXiv:1704.05737, arXiv:1612.07217

  41. arXiv:1708.06977  [pdf, other

    cs.CV

    Incremental Learning of Object Detectors without Catastrophic Forgetting

    Authors: Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

    Abstract: Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i.e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data. They suffer from "catastrophic forgetting" - an abrupt degradation of performance on the original set of classes, when the traini… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: To appear in ICCV 2017

  42. arXiv:1707.06005  [pdf, other

    cs.CV

    Detecting Parts for Action Localization

    Authors: Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

    Abstract: In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neur… ▽ More

    Submitted 21 July, 2017; v1 submitted 19 July, 2017; originally announced July 2017.

    Comments: BMVC 2017

  43. arXiv:1704.05737  [pdf, other

    cs.CV

    Learning Video Object Segmentation with Visual Memory

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a "visual memory" in video, i.… ▽ More

    Submitted 12 July, 2017; v1 submitted 19 April, 2017; originally announced April 2017.

  44. arXiv:1612.07217  [pdf, other

    cs.CV

    Learning Motion Patterns in Videos

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved. We address this challenging task by learning motion patterns in videos. The core of our approach is a fully convolutional network, which is learned entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation. This encoder-decoder style archite… ▽ More

    Submitted 10 April, 2017; v1 submitted 21 December, 2016; originally announced December 2016.

  45. arXiv:1603.07188  [pdf, other

    cs.CV

    Weakly-Supervised Semantic Segmentation using Motion Cues

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations, they need additional constraints, such as the size of an object, to obtain reasonable performance. To address this issue, w… ▽ More

    Submitted 21 April, 2017; v1 submitted 23 March, 2016; originally announced March 2016.

    Comments: Extended version of our ECCV 2016 paper

  46. Enhancing Energy Minimization Framework for Scene Text Recognition with Top-Down Cues

    Authors: Anand Mishra, Karteek Alahari, C. V. Jawahar

    Abstract: Recognizing scene text is a challenging problem, even more so than the recognition of scanned documents. This problem has gained significant attention from the computer vision community in recent years, and several methods based on energy minimization frameworks and deep learning approaches have been proposed. In this work, we focus on the energy minimization framework and propose a model that exp… ▽ More

    Submitted 12 January, 2016; originally announced January 2016.

  47. arXiv:1509.09114  [pdf, other

    cs.CV

    Online Object Tracking with Proposal Selection

    Authors: Yang Hua, Karteek Alahari, Cordelia Schmid

    Abstract: Tracking-by-detection approaches are some of the most successful object trackers in recent years. Their success is largely determined by the detector model they learn initially and then update over time. However, under challenging conditions where an object can undergo transformations, e.g., severe rotation, these methods are found to be lacking. In this paper, we address this problem by formulati… ▽ More

    Submitted 30 September, 2015; originally announced September 2015.

    Comments: ICCV 2015

  48. arXiv:1108.5710  [pdf, other

    cs.CV cs.AI

    Generalized Fast Approximate Energy Minimization via Graph Cuts: Alpha-Expansion Beta-Shrink Moves

    Authors: Mark Schmidt, Karteek Alahari

    Abstract: We present alpha-expansion beta-shrink moves, a simple generalization of the widely-used alpha-beta swap and alpha-expansion algorithms for approximate energy minimization. We show that in a certain sense, these moves dominate both alpha-beta-swap and alpha-expansion moves, but unlike previous generalizations the new moves require no additional assumptions and are still solvable in polynomial-time… ▽ More

    Submitted 29 August, 2011; originally announced August 2011.

    Comments: Conference on Uncertainty in Artificial Intelligence (2011)