-
Hidden in plain sight: VLMs overlook their visual representations
Authors:
Stephanie Fu,
Tyler Bonnen,
Devin Guillory,
Trevor Darrell
Abstract:
Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., d…
▽ More
Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Hyperbolic Active Learning for Semantic Segmentation under Domain Shift
Authors:
Luca Franco,
Paolo Mandica,
Konstantinos Kallidromitis,
Devin Guillory,
Yu-Teng Li,
Trevor Darrell,
Fabio Galasso
Abstract:
We introduce a hyperbolic neural network approach to pixel-level active learning for semantic segmentation. Analysis of the data statistics leads to a novel interpretation of the hyperbolic radius as an indicator of data scarcity. In HALO (Hyperbolic Active Learning Optimization), for the first time, we propose the use of epistemic uncertainty as a data acquisition strategy, following the intuitio…
▽ More
We introduce a hyperbolic neural network approach to pixel-level active learning for semantic segmentation. Analysis of the data statistics leads to a novel interpretation of the hyperbolic radius as an indicator of data scarcity. In HALO (Hyperbolic Active Learning Optimization), for the first time, we propose the use of epistemic uncertainty as a data acquisition strategy, following the intuition of selecting data points that are the least known. The hyperbolic radius, complemented by the widely-adopted prediction entropy, effectively approximates epistemic uncertainty. We perform extensive experimental analysis based on two established synthetic-to-real benchmarks, i.e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes. Additionally, we test HALO on Cityscape $\rightarrow$ ACDC for domain adaptation under adverse weather conditions, and we benchmark both convolutional and attention-based backbones. HALO sets a new state-of-the-art in active learning for semantic segmentation under domain shift and it is the first active learning approach that surpasses the performance of supervised domain adaptation while using only a small portion of labels (i.e., 1%).
△ Less
Submitted 4 June, 2024; v1 submitted 19 June, 2023;
originally announced June 2023.
-
Using Language to Extend to Unseen Domains
Authors:
Lisa Dunlap,
Clara Mohri,
Devin Guillory,
Han Zhang,
Trevor Darrell,
Joseph E. Gonzalez,
Aditi Raghunathan,
Anja Rohrbach
Abstract:
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our m…
▽ More
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias.
△ Less
Submitted 29 April, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Studying Bias in GANs through the Lens of Race
Authors:
Vongani H. Maluleke,
Neerja Thakkar,
Tim Brooks,
Ethan Weber,
Trevor Darrell,
Alexei A. Efros,
Angjoo Kanazawa,
Devin Guillory
Abstract:
In this work, we study how the performance and evaluation of generative image models are impacted by the racial composition of their training datasets. By examining and controlling the racial distributions in various training datasets, we are able to observe the impacts of different training distributions on generated image quality and the racial distributions of the generated images. Our results…
▽ More
In this work, we study how the performance and evaluation of generative image models are impacted by the racial composition of their training datasets. By examining and controlling the racial distributions in various training datasets, we are able to observe the impacts of different training distributions on generated image quality and the racial distributions of the generated images. Our results show that the racial compositions of generated images successfully preserve that of the training data. However, we observe that truncation, a technique used to generate higher quality images during inference, exacerbates racial imbalances in the data. Lastly, when examining the relationship between image quality and race, we find that the highest perceived visual quality images of a given race come from a distribution where that race is well-represented, and that annotators consistently prefer generated images of white people over those of Black people.
△ Less
Submitted 14 September, 2022; v1 submitted 6 September, 2022;
originally announced September 2022.
-
Disentangled Action Recognition with Knowledge Bases
Authors:
Zhekun Luo,
Shalini Ghosh,
Devin Guillory,
Keizo Kato,
Trevor Darrell,
Huijuan Xu
Abstract:
Action in video usually involves the interaction of human with objects. Action labels are typically composed of various combinations of verbs and nouns, but we may not have training data for all possible combinations. In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by lever…
▽ More
Action in video usually involves the interaction of human with objects. Action labels are typically composed of various combinations of verbs and nouns, but we may not have training data for all possible combinations. In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by leveraging the power of knowledge graphs. Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale since the number of compositional action nodes grows quadratically with respect to the number of verbs and nouns. To address this issue, we propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions. DARK trains a factorized model by first extracting disentangled feature representations for verbs and nouns, and then predicting classification weights using relations in external knowledge graphs. The type constraint between verb and noun is extracted from external knowledge bases and finally applied when composing actions. DARK has better scalability in the number of objects and verbs, and achieves state-of-the-art performance on the Charades dataset. We further propose a new benchmark split based on the Epic-kitchen dataset which is an order of magnitude bigger in the numbers of classes and samples, and benchmark various models on this benchmark.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Predicting with Confidence on Unseen Distributions
Authors:
Devin Guillory,
Vaishaal Shankar,
Sayna Ebrahimi,
Trevor Darrell,
Ludwig Schmidt
Abstract:
Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature,…
▽ More
Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature, and allows us to predict model accuracy on challenging unseen distributions without access to labeled data. In the context of distribution shift, distributional distances are often used to adapt models and improve their performance on new domains, however accuracy estimation, or other forms of predictive uncertainty, are often neglected in these investigations. Through investigating a wide range of established distributional distances, such as Frechet distance or Maximum Mean Discrepancy, we determine that they fail to induce reliable estimates of performance under distribution shift. On the other hand, we find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference. $DoC$ reduces predictive error by almost half ($46\%$) on several realistic and challenging distribution shifts, e.g., on the ImageNet-Vid-Robust and ImageNet-Rendition datasets.
△ Less
Submitted 19 August, 2021; v1 submitted 7 July, 2021;
originally announced July 2021.
-
Self-Supervised Pretraining Improves Self-Supervised Pretraining
Authors:
Colorado J. Reed,
Xiangyu Yue,
Ani Nrusimha,
Sayna Ebrahimi,
Vivek Vijaykumar,
Richard Mao,
Bo Li,
Shanghang Zhang,
Devin Guillory,
Sean Metzger,
Kurt Keutzer,
Trevor Darrell
Abstract:
While self-supervised pretraining has proven beneficial for many computer vision tasks, it requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. Prior work demonstrates that models pretrained on datasets dissimilar to their target data, such as chest X-ray models trained on ImageNet, underperform models trained from scratch. Users that lack the r…
▽ More
While self-supervised pretraining has proven beneficial for many computer vision tasks, it requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. Prior work demonstrates that models pretrained on datasets dissimilar to their target data, such as chest X-ray models trained on ImageNet, underperform models trained from scratch. Users that lack the resources to pretrain must use existing models with lower performance. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. Through experimentation on 16 diverse vision datasets, we show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data. Taken together, HPT provides a simple framework for obtaining better pretrained representations with less computational resources.
△ Less
Submitted 24 March, 2021; v1 submitted 23 March, 2021;
originally announced March 2021.
-
Combating Anti-Blackness in the AI Community
Authors:
Devin Guillory
Abstract:
In response to a national and international awakening on the issues of anti-Blackness and systemic discrimination, we have penned this piece to serve as a resource for allies in the AI community who are wondering how they can more effectively engage with dismantling racist systems. This work aims to help elucidate areas where the AI community actively and passively contributes to anti-Blackness an…
▽ More
In response to a national and international awakening on the issues of anti-Blackness and systemic discrimination, we have penned this piece to serve as a resource for allies in the AI community who are wondering how they can more effectively engage with dismantling racist systems. This work aims to help elucidate areas where the AI community actively and passively contributes to anti-Blackness and offers actionable items on ways to reduce harm.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning
Authors:
Zhekun Luo,
Devin Guillory,
Baifeng Shi,
Wei Ke,
Fang Wan,
Trevor Darrell,
Huijuan Xu
Abstract:
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger t…
▽ More
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.
△ Less
Submitted 25 August, 2020; v1 submitted 31 March, 2020;
originally announced April 2020.
-
An Ensemble-based Approach to Click-Through Rate Prediction for Promoted Listings at Etsy
Authors:
Kamelia Aryafar,
Devin Guillory,
Liangjie Hong
Abstract:
Etsy is a global marketplace where people across the world connect to make, buy and sell unique goods. Sellers at Etsy can promote their product listings via advertising campaigns similar to traditional sponsored search ads. Click-Through Rate (CTR) prediction is an integral part of online search advertising systems where it is utilized as an input to auctions which determine the final ranking of…
▽ More
Etsy is a global marketplace where people across the world connect to make, buy and sell unique goods. Sellers at Etsy can promote their product listings via advertising campaigns similar to traditional sponsored search ads. Click-Through Rate (CTR) prediction is an integral part of online search advertising systems where it is utilized as an input to auctions which determine the final ranking of promoted listings to a particular user for each query. In this paper, we provide a holistic view of Etsy's promoted listings' CTR prediction system and propose an ensemble learning approach which is based on historical or behavioral signals for older listings as well as content-based features for new listings. We obtain representations from texts and images by utilizing state-of-the-art deep learning techniques and employ multimodal learning to combine these different signals. We compare the system to non-trivial baselines on a large-scale real world dataset from Etsy, demonstrating the effectiveness of the model and strong correlations between offline experiments and online performance. The paper is also the first technical overview to this kind of product in e-commerce context.
△ Less
Submitted 21 November, 2017; v1 submitted 3 November, 2017;
originally announced November 2017.