Skip to main content

Showing 1–50 of 123 results for author: Belongie, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.22793  [pdf, other

    cs.CV cs.CL

    Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

    Authors: Srishti Yadav, Lauren Tilton, Maria Antoniak, Taylor Arnold, Jiaang Li, Siddhesh Milind Pawar, Antonia Karamolegkou, Stella Frank, Zhaochong An, Negar Rostamzadeh, Daniel Hershcovich, Serge Belongie, Ekaterina Shutova

    Abstract: Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dime… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  2. arXiv:2505.14462  [pdf, ps, other

    cs.CV cs.CL

    RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

    Authors: Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

    Abstract: As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its appli… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  3. arXiv:2504.14988  [pdf, other

    cs.CV

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    Authors: Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie

    Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine… ▽ More

    Submitted 13 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

  4. arXiv:2504.08111  [pdf, other

    cs.CV

    POEM: Precise Object-level Editing via MLLM control

    Authors: Marco Schouten, Mehmet Onurcan Kaya, Serge Belongie, Dim P. Papadopoulos

    Abstract: Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often int… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted to SCIA 2025

  5. arXiv:2504.05457  [pdf, other

    cs.CV

    Taxonomy-Aware Evaluation of Vision-Language Models

    Authors: Vésteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, Stella Frank

    Abstract: When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specif… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: CVPR 2025

  6. arXiv:2504.02821  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

    Authors: Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata

    Abstract: Given that interpretability and steerability are crucial to AI safety, Sparse Autoencoders (SAEs) have emerged as a tool to enhance them in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations. To ensure that o… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: Preprint

  7. arXiv:2503.20960  [pdf, ps, other

    cs.CL cs.CY cs.LG

    Multi-Modal Framing Analysis of News

    Authors: Arnav Arora, Srishti Yadav, Maria Antoniak, Serge Belongie, Isabelle Augenstein

    Abstract: Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, th… ▽ More

    Submitted 29 May, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  8. arXiv:2503.16282  [pdf, other

    cs.CV

    Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model

    Authors: Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, Serge Belongie

    Abstract: Generalized few-shot 3D point cloud segmentation (GFS-PCS) adapts models to new classes with few support samples while retaining base class segmentation. Existing GFS-PCS methods enhance prototypes via interacting with support or query features but remain limited by sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), generalizing across open-world novel classes,… ▽ More

    Submitted 20 May, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  9. arXiv:2502.20847  [pdf, other

    cs.LG

    Gradient Imbalance in Direct Preference Optimization

    Authors: Qinwei Ma, Jingzhe Shi, Can Jin, Jenq-Neng Hwang, Serge Belongie, Lei Li

    Abstract: Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance a… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: 15 pages, 2 figures

  10. arXiv:2502.18180  [pdf, other

    cs.AI cs.MA

    ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis

    Authors: Lei Li, Sen Jia, Jianhao Wang, Zhaochong An, Jiaang Li, Jenq-Neng Hwang, Serge Belongie

    Abstract: Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their "instruct-only" nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets u… ▽ More

    Submitted 27 February, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  11. arXiv:2501.13851  [pdf, other

    cs.LG

    Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

    Authors: Shiling Deng, Serge Belongie, Peter Ebert Christensen

    Abstract: Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To addre… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

    Comments: 18 pages, 5 figures, 13 tables, GitHub repository: https://github.com/Seefreem/meme_text_retrieval_p1

  12. arXiv:2410.22489  [pdf, other

    cs.CV

    Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

    Authors: Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, Serge Belongie

    Abstract: Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utili… ▽ More

    Submitted 26 February, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: Published at ICLR 2025 (Spotlight)

  13. arXiv:2410.08069  [pdf, other

    cs.LG cs.AI cs.CV

    Unlearning-based Neural Interpretations

    Authors: Ching Lam Choi, Alexandre Duplessis, Serge Belongie

    Abstract: Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions--constant mapping, averaging or blurring--inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution m… ▽ More

    Submitted 10 February, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: Accepted to ICLR 2025

    Journal ref: Choi, Ching Lam, Alexandre Duplessis, and Serge Belongie. 'Unlearning-Based Neural Interpretations'. In The Thirteenth International Conference on Learning Representations, 2025

  14. arXiv:2410.07173  [pdf, other

    cs.CL cs.AI cs.CV

    Better Language Models Exhibit Higher Visual Alignment

    Authors: Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

    Abstract: How well do text-only Large Language Models (LLMs) naturally align with the visual world? We provide the first direct analysis by utilizing frozen text representations in a discriminative vision-language model framework and measuring zero-shot generalization on unseen classes. We find decoder-based LLMs exhibit high intrinsic visual alignment. In particular, more capable LLMs reliably demonstrate… ▽ More

    Submitted 17 February, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

  15. arXiv:2406.19238  [pdf, other

    cs.CL cs.CY cs.LG

    Revealing Fine-Grained Values and Opinions in Large Language Models

    Authors: Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein

    Abstract: Uncovering latent values and opinions embedded in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by prompting LLMs with survey questions and quantifying the stances in the outputs towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and t… ▽ More

    Submitted 31 October, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: Findings of EMNLP 2024; 28 pages, 20 figures, 7 tables

  16. arXiv:2406.04898  [pdf, other

    cs.CV

    Labeled Data Selection for Category Discovery

    Authors: Bingchen Zhao, Nico Lang, Serge Belongie, Oisin Mac Aodha

    Abstract: Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a res… ▽ More

    Submitted 18 July, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

  17. arXiv:2406.04332  [pdf, other

    cs.CV cs.LG

    Coarse-To-Fine Tensor Trains for Compact Visual Representations

    Authors: Sebastian Loeschcke, Dan Wang, Christian Leth-Espensen, Serge Belongie, Michael J. Kastoryano, Sagie Benaim

    Abstract: The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project webpage: https://sebulo.github.io/PuTT_website/

  18. arXiv:2405.16528  [pdf, other

    cs.LG cs.CL

    LoQT: Low-Rank Adapters for Quantized Pretraining

    Authors: Sebastian Loeschcke, Mads Toftrup, Michael J. Kastoryano, Serge Belongie, Vésteinn Snæbjarnarson

    Abstract: Despite advances using low-rank adapters and quantization, pretraining of large models on consumer hardware has not been possible without model sharding, offloading during training, or per-layer gradient updates. To address these limitations, we propose Low-Rank Adapters for Quantized Training (LoQT), a method for efficiently training quantized models. LoQT uses gradient-based tensor factorization… ▽ More

    Submitted 4 November, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

  19. arXiv:2405.02771  [pdf, other

    cs.CV cs.AI cs.LG

    MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

    Authors: Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang

    Abstract: The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at gl… ▽ More

    Submitted 29 July, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

    Comments: Accepted for ECCV 2024. Data and code: https://vishalned.github.io/mmearth Update arXiv v2 (ECCV): 1. Dataset fix: Removed duplicates and corrected ERA5 yearly statistics. 2. Data augmentation fix: Random crops are now aligned. 3. Test metrics fix: Metrics are now overall instead of mini-batch averages, matching GEO-Bench metrics. 4. Pretrained on MMEarth v001 & evaluated on GEO-Bench v1.0

  20. arXiv:2403.00592  [pdf, other

    cs.CV

    Rethinking Few-shot 3D Point Cloud Semantic Segmentation

    Authors: Zhaochong An, Guolei Sun, Yun Liu, Fayao Liu, Zongwei Wu, Dan Wang, Luc Van Gool, Serge Belongie

    Abstract: This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS), with a focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. The former arises from non-uniform point sampling, allowing models to distinguish the density disparities between foreground and background for easier segmentation. The latter results from sampling only 2,048 p… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  21. arXiv:2311.05006  [pdf, other

    cs.CV cs.LG

    Familiarity-Based Open-Set Recognition Under Adversarial Attacks

    Authors: Philip Enevoldsen, Christian Gundersen, Nico Lang, Serge Belongie, Christian Igel

    Abstract: Open-set recognition (OSR), the identification of novel categories, can be a critical component when deploying classification models in real-world applications. Recent work has shown that familiarity-based scoring rules such as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are strong baselines when the closed-set accuracy is high. However, one of the potential weaknesses o… ▽ More

    Submitted 2 January, 2025; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: Published in: Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL), PMLR 265, 2025

  22. arXiv:2309.10359  [pdf, other

    cs.CL

    Prompt, Condition, and Generate: Classification of Unsupported Claims with In-Context Learning

    Authors: Peter Ebert Christensen, Srishti Yadav, Serge Belongie

    Abstract: Unsupported and unfalsifiable claims we encounter in our daily lives can influence our view of the world. Characterizing, summarizing, and -- more generally -- making sense of such claims, however, can be challenging. In this work, we focus on fine-grained debate topics and formulate a new task of distilling, from such claims, a countable set of narratives. We present a crowdsourced dataset of 12… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  23. arXiv:2308.16900  [pdf, other

    cs.LG

    Learning to Taste: A Multimodal Wine Dataset

    Authors: Thoranna Bender, Simon Moe Sørensen, Alireza Kashani, K. Eldjarn Hjorleifsson, Grethe Hyldig, Søren Hauberg, Serge Belongie, Frederik Warburg

    Abstract: We present WineSensed, a large multimodal wine dataset for studying the relations between visual perception, language, and flavor. The dataset encompasses 897k images of wine labels and 824k reviews of wines curated from the Vivino platform. It has over 350k unique bottlings, annotated with year, region, rating, alcohol percentage, price, and grape composition. We obtained fine-grained flavor anno… ▽ More

    Submitted 15 January, 2024; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted to NeurIPS 2023. See project page: https://thoranna.github.io/learning_to_taste/

  24. arXiv:2305.02360  [pdf, other

    cs.CV cs.AI

    Fashionpedia-Ads: Do Your Favorite Advertisements Reveal Your Fashion Taste?

    Authors: Mengyun Shi, Claire Cardie, Serge Belongie

    Abstract: Consumers are exposed to advertisements across many different domains on the internet, such as fashion, beauty, car, food, and others. On the other hand, fashion represents second highest e-commerce shopping category. Does consumer digital record behavior on various fashion ad images reveal their fashion taste? Does ads from other domains infer their fashion taste as well? In this paper, we study… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

  25. arXiv:2305.02307  [pdf, other

    cs.CV cs.AI cs.DB

    Fashionpedia-Taste: A Dataset towards Explaining Human Fashion Taste

    Authors: Mengyun Shi, Serge Belongie, Claire Cardie

    Abstract: Existing fashion datasets do not consider the multi-facts that cause a consumer to like or dislike a fashion image. Even two consumers like a same fashion image, they could like this image for total different reasons. In this paper, we study the reason why a consumer like a certain fashion image. Towards this goal, we introduce an interpretability dataset, Fashionpedia-taste, consist of rich annot… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

  26. arXiv:2303.17155  [pdf, other

    cs.CV cs.AI

    Discriminative Class Tokens for Text-to-Image Diffusion Models

    Authors: Idan Schwartz, Vésteinn Snæbjarnarson, Hila Chefer, Ryan Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim

    Abstract: Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. While impressive, the images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This approach has two disadvantages: (i) supervised da… ▽ More

    Submitted 9 January, 2025; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: ICCV 2023

  27. arXiv:2302.04862  [pdf, other

    cs.CV cs.LG

    Polynomial Neural Fields for Subband Decomposition and Manipulation

    Authors: Guandao Yang, Sagie Benaim, Varun Jampani, Kyle Genova, Jonathan T. Barron, Thomas Funkhouser, Bharath Hariharan, Serge Belongie

    Abstract: Neural fields have emerged as a new paradigm for representing signals, thanks to their ability to do it compactly while being easy to optimize. In most applications, however, neural fields are treated like black boxes, which precludes many signal manipulation tasks. In this paper, we propose a new class of neural fields called polynomial neural fields (PNFs). The key advantage of a PNF is that it… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: Accepted to NeurIPS 2022

  28. arXiv:2212.10564  [pdf, other

    cs.CL cs.AI cs.LG

    Re-evaluating the Need for Multimodal Signals in Unsupervised Grammar Induction

    Authors: Boyi Li, Rodolfo Corona, Karttikeya Mangalam, Catherine Chen, Daniel Flaherty, Serge Belongie, Kilian Q. Weinberger, Jitendra Malik, Trevor Darrell, Dan Klein

    Abstract: Are multimodal inputs necessary for grammar induction? Recent work has shown that multimodal training inputs can improve grammar induction. However, these improvements are based on comparisons to weak text-only baselines that were trained on relatively little textual data. To determine whether multimodal inputs are needed in regimes with large amounts of textual training data, we design a stronger… ▽ More

    Submitted 12 April, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: NAACL Findings 2024

  29. arXiv:2211.15673  [pdf, other

    cs.LG

    PyTorch Adapt

    Authors: Kevin Musgrave, Serge Belongie, Ser-Nam Lim

    Abstract: PyTorch Adapt is a library for domain adaptation, a type of machine learning algorithm that re-purposes existing models to work in new domains. It is a fully-featured toolkit, allowing users to create a complete train/test pipeline in a few lines of code. It is also modular, so users can import just the parts they need, and not worry about being locked into a framework. One defining feature of thi… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

  30. arXiv:2211.09782  [pdf, other

    cs.CV cs.CR cs.LG

    Assessing Neural Network Robustness via Adversarial Pivotal Tuning

    Authors: Peter Ebert Christensen, Vésteinn Snæbjarnarson, Andrea Dittadi, Serge Belongie, Sagie Benaim

    Abstract: The robustness of image classifiers is essential to their deployment in the real world. The ability to assess this resilience to manipulations or deviations from the training data is thus crucial. These modifications have traditionally consisted of minimal changes that still manage to fool classifiers, and modern approaches are increasingly robust to them. Semantic manipulations that modify elemen… ▽ More

    Submitted 6 January, 2024; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Major changes include new experiments in Table 1 on page 5 and Table 2-4 on page 6, new figure 5 on page 8. Paper accepted at WACV (oral)

  31. arXiv:2209.00495  [pdf, other

    cs.CL cs.LG cs.SI

    Searching for Structure in Unfalsifiable Claims

    Authors: Peter Ebert Christensen, Frederik Warburg, Menglin Jia, Serge Belongie

    Abstract: Social media platforms give rise to an abundance of posts and comments on every topic imaginable. Many of these posts express opinions on various aspects of society, but their unfalsifiable nature makes them ill-suited to fact-checking pipelines. In this work, we aim to distill such posts into a small set of narratives that capture the essential claims related to a given topic. Understanding and v… ▽ More

    Submitted 19 August, 2022; originally announced September 2022.

    Comments: 30 pages, 9 main Figures, 5 main Tables Website: https://captaine.github.io/Searching-for-Structure-in-Unfalsifiable-Claims/ Github repo: https://github.com/captainE/Searching-for-Structure-in-Unfalsifiable-Claims

  32. arXiv:2208.07360  [pdf, other

    cs.CV cs.LG

    Three New Validators and a Large-Scale Benchmark Ranking for Unsupervised Domain Adaptation

    Authors: Kevin Musgrave, Serge Belongie, Ser-Nam Lim

    Abstract: Changes to hyperparameters can have a dramatic effect on model accuracy. Thus, the tuning of hyperparameters plays an important role in optimizing machine-learning models. An integral part of the hyperparameter-tuning process is the evaluation of model checkpoints, which is done through the use of "validators". In a supervised setting, these validators evaluate checkpoints by computing accuracy on… ▽ More

    Submitted 17 May, 2023; v1 submitted 15 August, 2022; originally announced August 2022.

    Comments: This paper was previously titled Benchmarking Validation Methods for Unsupervised Domain Adaptation. This version contains new experiments, analysis, and figures

  33. arXiv:2207.10664  [pdf, other

    cs.CV cs.LG

    Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

    Authors: Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, Serge Belongie

    Abstract: We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 datas… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: ECCV 2022 Camera Ready

  34. arXiv:2207.10225  [pdf, other

    cs.CV cs.LG

    On Label Granularity and Object Localization

    Authors: Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, Oisin Mac Aodha

    Abstract: Weakly supervised object localization (WSOL) aims to learn representations that encode object location using only image-level category labels. However, many objects can be labeled at different levels of granularity. Is it an animal, a bird, or a great horned owl? Which image-level labels should we use? In this paper we study the role of label granularity in WSOL. To facilitate this investigation w… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  35. arXiv:2207.07646  [pdf, other

    cs.CV cs.LG

    Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

    Authors: Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

    Abstract: Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In M… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

  36. arXiv:2206.12396  [pdf, other

    cs.CV

    Text-Driven Stylization of Video Objects

    Authors: Sebastian Loeschcke, Serge Belongie, Sagie Benaim

    Abstract: We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details,… ▽ More

    Submitted 27 June, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

  37. arXiv:2206.02776  [pdf, other

    cs.CV

    Volumetric Disentanglement for 3D Scene Manipulation

    Authors: Sagie Benaim, Frederik Warburg, Peter Ebert Christensen, Serge Belongie

    Abstract: Recently, advances in differential volumetric rendering enabled significant breakthroughs in the photo-realistic and fine-detailed reconstruction of complex 3D scenes, which is key for many virtual reality applications. However, in the context of augmented reality, one may also wish to effect semantic manipulations or augmentations of objects within a scene. To this end, we propose a volumetric fr… ▽ More

    Submitted 6 June, 2022; originally announced June 2022.

  38. arXiv:2203.12119  [pdf, other

    cs.CV

    Visual Prompt Tuning

    Authors: Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim

    Abstract: The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amo… ▽ More

    Submitted 20 July, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: ECCV2022

  39. arXiv:2202.04036  [pdf, other

    cs.CV

    Residual Aligned: Gradient Optimization for Non-Negative Image Synthesis

    Authors: Flora Yu Shen, Katie Luo, Guandao Yang, Harald Haraldsson, Serge Belongie

    Abstract: In this work, we address an important problem of optical see through (OST) augmented reality: non-negative image synthesis. Most of the image generation methods fail under this condition, since they assume full control over each pixel and cannot create darker pixels by adding light. In order to solve the non-negative image generation problem in AR image synthesis, prior works have attempted to uti… ▽ More

    Submitted 8 February, 2022; originally announced February 2022.

  40. arXiv:2202.00659  [pdf, other

    cs.CV cs.GR

    Stay Positive: Non-Negative Image Synthesis for Augmented Reality

    Authors: Katie Luo, Guandao Yang, Wenqi Xian, Harald Haraldsson, Bharath Hariharan, Serge Belongie

    Abstract: In applications such as optical see-through and projector augmented reality, producing images amounts to solving non-negative image generation, where one can only add light to an existing image. Most image generation methods, however, are ill-suited to this problem setting, as they make the assumption that one can assign arbitrary color to each pixel. In fact, naive application of existing methods… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10050-10060

  41. arXiv:2201.03546  [pdf, other

    cs.CV cs.CL cs.LG

    Language-driven Semantic Segmentation

    Authors: Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, René Ranftl

    Abstract: We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding… ▽ More

    Submitted 2 April, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

    Comments: ICLR 2022

  42. arXiv:2112.08459  [pdf, other

    cs.CV

    Rethinking Nearest Neighbors for Visual Classification

    Authors: Menglin Jia, Bor-Chun Chen, Zuxuan Wu, Claire Cardie, Serge Belongie, Ser-Nam Lim

    Abstract: Neural network classifiers have become the de-facto choice for current "pre-train then fine-tune" paradigms of visual classification. In this paper, we investigate k-Nearest-Neighbor (k-NN) classifiers, a classical model-free learning method from the pre-deep learning era, as an augmentation to modern neural network based approaches. As a lazy learning method, k-NN simply aggregates the distance b… ▽ More

    Submitted 17 December, 2021; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: Modified paragraph spacing

  43. arXiv:2112.04480  [pdf, other

    cs.CV cs.LG

    Exploring Temporal Granularity in Self-Supervised Video Representation Learning

    Authors: Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui

    Abstract: This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between co… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  44. arXiv:2111.15672  [pdf, other

    cs.CV

    Unsupervised Domain Adaptation: A Reality Check

    Authors: Kevin Musgrave, Serge Belongie, Ser-Nam Lim

    Abstract: Interest in unsupervised domain adaptation (UDA) has surged in recent years, resulting in a plethora of new algorithms. However, as is often the case in fast-moving fields, baseline algorithms are not tested to the extent that they should be. Furthermore, little attention has been paid to validation methods, i.e. the methods for estimating the accuracy of a model in the absence of target domain la… ▽ More

    Submitted 30 November, 2021; originally announced November 2021.

  45. arXiv:2111.07950  [pdf, other

    cs.CV

    Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge

    Authors: Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip H. S. Torr, Song Bai

    Abstract: Although deep learning methods have achieved advanced video object recognition performance in recent years, perceiving heavily occluded objects in a video is still a very challenging task. To promote the development of occlusion understanding, we collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario. OVIS consists of 296k high-quality instance masks and… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

    Comments: Accepted by NeurIPS 2021 Datasets and Benchmarks Track. arXiv admin note: text overlap with arXiv:2102.01558

    MSC Class: 68T07; 68T45

  46. arXiv:2111.06119  [pdf, other

    cs.CV cs.LG

    Fine-Grained Image Analysis with Deep Learning: A Survey

    Authors: Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, Serge Belongie

    Abstract: Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it… ▽ More

    Submitted 19 November, 2021; v1 submitted 11 November, 2021; originally announced November 2021.

    Comments: Accepted by IEEE TPAMI

  47. arXiv:2109.02765  [pdf, other

    cs.CV cs.CR cs.LG

    Robustness and Generalization via Generative Adversarial Training

    Authors: Omid Poursaeed, Tianxing Jiang, Harry Yang, Serge Belongie, SerNam Lim

    Abstract: While deep neural networks have achieved remarkable success in various computer vision tasks, they often fail to generalize to new domains and subtle variations of input images. Several defenses have been proposed to improve the robustness against these variations. However, current defenses can only withstand the specific attack used in training, and the models often remain vulnerable to other inp… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

    Comments: ICCV 2021. arXiv admin note: substantial text overlap with arXiv:1911.09058

  48. arXiv:2108.13246  [pdf, other

    cs.CV

    LUAI Challenge 2021 on Learning to Understand Aerial Images

    Authors: Gui-Song Xia, Jian Ding, Ming Qian, Nan Xue, Jiaming Han, Xiang Bai, Michael Ying Yang, Shengyang Li, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, Liangpei Zhang, Qiang Zhou, Chao-hui Yu, Kaixuan Hu, Yingjia Bu, Wenming Tan, Zhe Yang, Wei Li, Shang Liu, Jiaxuan Zhao, Tianzhi Ma, Zi-han Gao, Lingqi Wang , et al. (11 additional authors not shown)

    Abstract: This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images. Using DOTA-v2.0 and GID-15 datasets, this challenge proposes three tasks for oriented object detection, horizontal object detection, and semantic segmentation of common categories in aerial images. This cha… ▽ More

    Submitted 17 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: 7 pages, 2 figures, accepted by ICCVW 2021

  49. arXiv:2106.13804  [pdf, other

    cs.CV cs.AI cs.LG

    SITTA: Single Image Texture Translation for Data Augmentation

    Authors: Boyi Li, Yin Cui, Tsung-Yi Lin, Serge Belongie

    Abstract: Recent advances in data augmentation enable one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of image synthesis methods for recognition ta… ▽ More

    Submitted 14 January, 2023; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: Learning from Limited and Imperfect Data (L2ID) Workshop, ECCV 2022

  50. arXiv:2105.13808  [pdf, other

    cs.CV

    The Herbarium 2021 Half-Earth Challenge Dataset

    Authors: Riccardo de Lutio, Damon Little, Barbara Ambrose, Serge Belongie

    Abstract: Herbarium sheets present a unique view of the world's botanical history, evolution, and diversity. This makes them an all-important data source for botanical research. With the increased digitisation of herbaria worldwide and the advances in the fine-grained classification domain that can facilitate automatic identification of herbarium specimens, there are a lot of opportunities for supporting re… ▽ More

    Submitted 28 May, 2021; originally announced May 2021.

    Comments: FGVC8 Workshop at CVPR 2021