Skip to main content

Showing 1–50 of 136 results for author: Snoek, C G M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.08799  [pdf, ps, other

    cs.CL cs.AI

    KV Cache Steering for Inducing Reasoning in Small Language Models

    Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano

    Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

  2. arXiv:2507.06137  [pdf, ps, other

    cs.CL cs.AI cs.CV

    NeoBabel: A Multilingual Open Tower for Visual Generation

    Authors: Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek

    Abstract: Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier i… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 34 pages, 12 figures

  3. arXiv:2506.19331  [pdf, ps, other

    cs.CV

    Segment Any 3D-Part in a Scene from a Sentence

    Authors: Hongyu Wu, Pengwan Yang, Yuki M. Asano, Cees G. M. Snoek

    Abstract: This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions, extending beyond traditional object-level 3D scene understanding and addressing both data and methodological challenges. Due to the expensive acquisition and annotation burden, existing datasets and methods are predominantly limited to object-level comprehension. To overcome the limitation… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  4. arXiv:2506.10710  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Continual Hyperbolic Learning of Instances and Classes

    Authors: Melika Ayoughi, Mina Ghadimi Atigh, Mohammad Mahdi Derakhshani, Cees G. M. Snoek, Pascal Mettes, Paul Groth

    Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over t… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  5. arXiv:2506.09095  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Foundation Models in Medical Imaging -- A Review and Outlook

    Authors: Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman, Edwin D. de Jong, Hugo Horlings, Clárisa I. Sanchez, Cees G. M. Snoek, Lodewyk Wessels, Ritse Mann, Eric Marcus, Jonas Teuwen

    Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pa… ▽ More

    Submitted 16 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

  6. arXiv:2506.08694  [pdf, ps, other

    cs.CV

    MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

    Authors: Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees G. M. Snoek, Yuki M Asano

    Abstract: Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a mo… ▽ More

    Submitted 10 July, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Accepted to ICCV2025

  7. arXiv:2504.11055  [pdf, ps, other

    cs.CV

    Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

    Authors: Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou

    Abstract: Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to ge… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  8. arXiv:2504.05706  [pdf, other

    cs.CV

    SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

    Authors: Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, Cees G. M. Snoek

    Abstract: Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kine… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: Under Review

  9. arXiv:2503.16311  [pdf, other

    cs.LG cs.AI cs.SD

    Structured-Noise Masked Modeling for Video, Audio and Beyond

    Authors: Aritra Bhowmik, Fida Mohammad Thoker, Carlos Hinojosa, Bernard Ghanem, Cees G. M. Snoek

    Abstract: Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filteri… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  10. arXiv:2502.12195  [pdf, other

    cs.LG

    GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts

    Authors: Sameer Ambekar, Zehao Xiao, Xiantong Zhen, Cees G. M. Snoek

    Abstract: We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training. Different from the common methods that fine-tune the model or adjust the classifier parameters online, we propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which w… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

    Comments: WACV 2025

  11. arXiv:2502.02338  [pdf, other

    cs.CV cs.LG

    Geometric Neural Process Fields

    Authors: Wenzhe Yin, Zehao Xiao, Jiayi Shen, Yunlu Chen, Cees G. M. Snoek, Jan-Jakob Sonke, Efstratios Gavves

    Abstract: This paper addresses the challenge of Neural Field (NeF) generalization, where models must efficiently adapt to new signals given only a few observations. To tackle this, we propose Geometric Neural Process Fields (G-NPF), a probabilistic framework for neural radiance fields that explicitly captures uncertainty. We formulate NeF generalization as a probabilistic problem, enabling direct inference… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  12. arXiv:2501.16404  [pdf, other

    cs.LG cs.AI cs.CL

    DynaPrompt: Dynamic Test-Time Prompt Tuning

    Authors: Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, Cees G. M. Snoek

    Abstract: Test-time prompt tuning enhances zero-shot generalization of vision-language models but tends to ignore the relatedness among test samples during inference. Online test-time prompt tuning provides a simple way to leverage the information in previous test samples, albeit with the risk of prompt collapse due to error accumulation. To enhance test-time prompt tuning, we propose DynaPrompt, short for… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

    Comments: ICLR 2025

  13. arXiv:2501.05069  [pdf, other

    cs.CV cs.AI

    Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

    Authors: Huabin Liu, Filip Ilievski, Cees G. M. Snoek

    Abstract: This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds… ▽ More

    Submitted 24 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

    Comments: Accepted by CVPR 2025

  14. arXiv:2412.11148  [pdf, other

    cs.CV

    Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection

    Authors: Mohammadreza Salehi, Nikolaos Apostolikas, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

    Abstract: In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations due to their focus on individual objects. Our paper suggests a novel approach: redefining `normal' at the object level in training datasets. Rather than the usu… ▽ More

    Submitted 15 December, 2024; originally announced December 2024.

    Comments: Accepted at ACCV24(Oral)

  15. arXiv:2411.19534  [pdf, other

    cs.CV cs.LG

    QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

    Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek

    Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effe… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

    Comments: 12 pages, 6 figures

  16. arXiv:2411.11222  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    The Sound of Water: Inferring Physical Properties from Pouring Liquids

    Authors: Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

    Abstract: We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in the… ▽ More

    Submitted 13 January, 2025; v1 submitted 17 November, 2024; originally announced November 2024.

    Comments: Project page at https://bpiyush.github.io/pouring-water-website. Short version accepted to ICASSP 2025

  17. arXiv:2411.04679  [pdf, other

    cs.AI cs.CV cs.MA

    CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

    Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G. M. Snoek, Jan-Jakob Sonke, Efstratios Gavves

    Abstract: In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long-term strategic and cooperative planning, leading to redundant steps, failures, and even serious repercussions in complex tasks like search-and-rescue miss… ▽ More

    Submitted 1 March, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: Accepted in ICLR2025

  18. arXiv:2411.03687  [pdf, other

    cs.LG cs.AI

    Beyond Model Adaptation at Test Time: A Survey

    Authors: Zehao Xiao, Cees G. M. Snoek

    Abstract: Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these algorithms struggle and become brittle even when samples in the test distribution start to deviate from the ones observed during training. Domain adaptation and d… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

  19. arXiv:2410.20164  [pdf, other

    cs.LG cs.CV

    Prompt Diffusion Robustifies Any-Modality Prompt Learning

    Authors: Yingjun Du, Gaowen Liu, Yuzhang Shang, Yuguang Yao, Ramana Kompella, Cees G. M. Snoek

    Abstract: Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifi… ▽ More

    Submitted 26 October, 2024; originally announced October 2024.

    Comments: Under review

  20. arXiv:2410.15397  [pdf, other

    cs.LG cs.CL cs.CV

    IPO: Interpretable Prompt Optimization for Vision-Language Models

    Authors: Yingjun Du, Wenfang Sun, Cees G. M. Snoek

    Abstract: Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. Howev… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: Accepted by NeurIPS 2024

  21. arXiv:2410.12407  [pdf, other

    cs.CV cs.CL cs.MM

    Beyond Coarse-Grained Matching in Video-Text Retrieval

    Authors: Aozhu Chen, Hazel Doughty, Xirong Li, Cees G. M. Snoek

    Abstract: Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, ad… ▽ More

    Submitted 17 October, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: Accepted to ACCV 2024

  22. arXiv:2410.12018  [pdf, other

    cs.CV cs.CL cs.MM

    LocoMotion: Learning Motion-Focused Video-Language Representations

    Authors: Hazel Doughty, Fida Mohammad Thoker, Cees G. M. Snoek

    Abstract: This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achiev… ▽ More

    Submitted 23 October, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: ACCV 2024 Oral

  23. arXiv:2410.10491  [pdf, other

    cs.CV

    TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

    Authors: Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Yuki M. Asano, Martin R. Oswald, Cees G. M. Snoek

    Abstract: Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propo… ▽ More

    Submitted 20 March, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

  24. arXiv:2410.10034  [pdf, other

    cs.CV

    TULIP: Token-length Upgraded CLIP

    Authors: Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek

    Abstract: We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relation… ▽ More

    Submitted 28 March, 2025; v1 submitted 13 October, 2024; originally announced October 2024.

  25. arXiv:2410.07752  [pdf, other

    cs.CV

    Lost in Time: A New Temporal Benchmark for VideoLLMs

    Authors: Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, Yuki M. Asano

    Abstract: Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified… ▽ More

    Submitted 25 March, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

  26. arXiv:2408.14371  [pdf, other

    cs.CV cs.AI cs.LG

    SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery

    Authors: Sarah Rastegar, Mohammadreza Salehi, Yuki M. Asano, Hazel Doughty, Cees G. M. Snoek

    Abstract: In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ab… ▽ More

    Submitted 10 November, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  27. arXiv:2407.15447  [pdf, other

    cs.CV

    SIGMA: Sinkhorn-Guided Masked Video Modeling

    Authors: Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

    Abstract: Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 24

  28. arXiv:2407.12427  [pdf, other

    cs.CV

    GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

    Authors: Luc P. J. Sträter, Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

    Abstract: In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training set, like unseen objects in self-driving cars. In contrast, industrial anomalies are subtle defects that preserve semantic meaning, such as cracks in airplane co… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  29. arXiv:2406.09415  [pdf, other

    cs.CV cs.LG

    An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    Authors: Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen

    Abstract: This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in… ▽ More

    Submitted 13 March, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: In Proceeding of ICLR'2025

  30. arXiv:2404.00701  [pdf, other

    cs.CV

    Training-Free Semantic Segmentation via LLM-Supervision

    Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G. M. Snoek

    Abstract: Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model accuracy through prompt engineering, prompt learning, or fine-tuning with limited labeled data, thereby overlooking the importance of refining the class descriptor… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: 22 pages,10 figures, conference

  31. arXiv:2403.12143  [pdf, other

    cs.LG cs.AI stat.ML

    Graph Neural Networks for Learning Equivariant Representations of Neural Networks

    Authors: Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, David W. Zhang

    Abstract: Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance,… ▽ More

    Submitted 23 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: In ICLR 2024. Source code: https://github.com/mkofinas/neural-graphs

  32. arXiv:2402.10099  [pdf, other

    cs.CV

    Any-Shift Prompting for Generalization over Distributions

    Authors: Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, Cees G. M. Snoek

    Abstract: Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

  33. arXiv:2402.08657  [pdf, other

    cs.CV

    PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

    Authors: Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano

    Abstract: Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom,… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  34. arXiv:2401.04716  [pdf, other

    cs.CV

    Low-Resource Vision Challenges for Foundation Models

    Authors: Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

    Abstract: Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for deep learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we address this gap and explore the challenges of low-resource image tasks with vision foundation models. We first collect a benchmark of genuinely low-resource image dat… ▽ More

    Submitted 11 April, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: Accepted at CVPR2024

  35. arXiv:2312.10825  [pdf, other

    cs.CV cs.LG

    Latent Space Editing in Transformer-Based Flow Matching

    Authors: Vincent Tao Hu, David W Zhang, Pascal Mettes, Meng Tang, Deli Zhao, Cees G. M. Snoek

    Abstract: This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone of… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

    Comments: AAAI 2024 with Appendix

  36. arXiv:2312.08895  [pdf, other

    cs.CV

    Motion Flow Matching for Human Motion Synthesis and Editing

    Authors: Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek

    Abstract: Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose \emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effective… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: WIP

  37. arXiv:2312.08825  [pdf, other

    cs.CV

    Guided Diffusion from Self-Supervised Diffusion Features

    Authors: Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M. Asano, Cees G. M. Snoek, Bjorn Ommer

    Abstract: Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks a… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Work In Progress

  38. arXiv:2311.18512  [pdf, other

    cs.CV cs.LG

    Union-over-Intersections: Object Detection beyond Winner-Takes-All

    Authors: Aritra Bhowmik, Pascal Mettes, Martin R. Oswald, Cees G. M. Snoek

    Abstract: This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves… ▽ More

    Submitted 19 December, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: 17 pages, 6 figures, 12 tables

  39. arXiv:2311.17937  [pdf, other

    cs.CV

    Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

    Authors: Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle

    Abstract: We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene, such as `An image of a gray cat on the left of an orange dog', and generate corresponding images. This is especially important in order t… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  40. arXiv:2311.13895  [pdf, other

    cs.CV

    Query by Activity Video in the Wild

    Authors: Tao Hu, William Thong, Pascal Mettes, Cees G. M. Snoek

    Abstract: This paper focuses on activity retrieval from a video query in an imbalanced scenario. In current query-by-activity-video literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding. This assumption does however practically not hold, as only a portion of activities have many examples, while other activities are only described by few examples.… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

    Comments: An extended version of ICIP 2023

  41. arXiv:2311.08851  [pdf, other

    cs.LG cs.CV

    Data Augmentations in Deep Weight Spaces

    Authors: Aviv Shamsian, David W. Zhang, Aviv Navon, Yan Zhang, Miltiadis Kofinas, Idan Achituve, Riccardo Valperga, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, Ethan Fetaya, Gal Chechik, Haggai Maron

    Abstract: Learning in weight spaces, where neural networks process the weights of other deep neural networks, has emerged as a promising research direction with applications in various fields, from analyzing and editing neural fields and implicit neural representations, to network pruning and quantization. Recent works designed architectures for effective learning in that space, which takes into account its… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations

  42. arXiv:2310.19776  [pdf, other

    cs.CV cs.AI cs.IT cs.LG

    Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery

    Authors: Sarah Rastegar, Hazel Doughty, Cees G. M. Snoek

    Abstract: In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category?… ▽ More

    Submitted 18 January, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

    ACM Class: I.2.1.b; I.2.6.g; I.5.4.b; I.4

  43. arXiv:2310.05920  [pdf, other

    cs.CV

    SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

    Authors: Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek

    Abstract: The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive… ▽ More

    Submitted 13 March, 2025; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: In Proceeding of TMLR'2025

  44. arXiv:2310.00500  [pdf, other

    cs.CV

    Self-Supervised Open-Ended Classification with Small Visual Language Models

    Authors: Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek, Marcel Worring, Yuki M. Asano

    Abstract: We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of inter… ▽ More

    Submitted 6 December, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

  45. arXiv:2308.11796  [pdf, other

    cs.CV

    Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

    Authors: Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

    Abstract: Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consisten… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

  46. arXiv:2307.04033  [pdf, other

    cs.LG cs.AI

    Probabilistic Test-Time Generalization by Variational Neighbor-Labeling

    Authors: Sameer Ambekar, Zehao Xiao, Jiayi Shen, Xiantong Zhen, Cees G. M. Snoek

    Abstract: This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains. We follow the strict separation of source training and target testing, but exploit the value of the unlabeled target data itself during inference. We make three contributions. First, we propose probabilistic pseudo-labeling of target samples to genera… ▽ More

    Submitted 1 July, 2024; v1 submitted 8 July, 2023; originally announced July 2023.

    Comments: Accepted by CoLLAs 2024

  47. arXiv:2306.12795  [pdf, other

    cs.CV cs.LG cs.MM

    Learning Unseen Modality Interaction

    Authors: Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

    Abstract: Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploi… ▽ More

    Submitted 25 October, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

    Comments: Published at NeurIPS 2023

  48. Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph Generation

    Authors: Shuo Chen, Yingjun Du, Pascal Mettes, Cees G. M. Snoek

    Abstract: This paper investigates the problem of scene graph generation in videos with the aim of capturing semantic relations between subjects and objects in the form of $\langle$subject, predicate, object$\rangle$ triplets. Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature, ranging from ubiquitous interactions such as spatial relationships (\eg \emph{in fro… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: ICMR 2023

    ACM Class: I.2.10

  49. arXiv:2306.05411  [pdf, other

    cs.CV

    R-MAE: Regions Meet Masked Autoencoders

    Authors: Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

    Abstract: In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions,… ▽ More

    Submitted 4 January, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

  50. arXiv:2306.05189  [pdf, other

    cs.LG

    EMO: Episodic Memory Optimization for Few-Shot Meta-Learning

    Authors: Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G. M. Snoek

    Abstract: Few-shot meta-learning presents a challenge for gradient descent optimization due to the limited number of training samples per task. To address this issue, we propose an episodic memory optimization for meta-learning, we call EMO, which is inspired by the human ability to recall past learning experiences from the brain's memory. EMO retains the gradient history of past experienced tasks in extern… ▽ More

    Submitted 26 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted by CoLLAs 2023