Skip to main content

Showing 1–50 of 68 results for author: Plummer, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.21649  [pdf, ps, other

    cs.CV

    Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

    Authors: Keanu Nichols, Nazia Tasnim, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, Bryan A. Plummer

    Abstract: Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensiv… ▽ More

    Submitted 4 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

  2. arXiv:2504.02996  [pdf, other

    cs.LG cs.CV

    Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization

    Authors: Siqi Wang, Aoming Liu, Bryan A. Plummer

    Abstract: Multi-source Domain Generalization (DG) aims to improve model robustness to new distributions. However, DG methods often overlook the effect of label noise, which can confuse a model during training, reducing performance. Limited prior work has analyzed DG method's noise-robustness, typically focused on an analysis of existing methods rather than new solutions. In this paper, we investigate this u… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  3. arXiv:2503.19331  [pdf, other

    cs.CV cs.LG

    ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

    Authors: Chau Pham, Juan C. Caicedo, Bryan A. Plummer

    Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal f… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  4. arXiv:2503.13652  [pdf, other

    cs.CV

    Web Artifact Attacks Disrupt Vision Language Models

    Authors: Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an a… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  5. arXiv:2501.04666  [pdf, other

    cs.CV

    Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

    Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer

    Abstract: Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that o… ▽ More

    Submitted 7 May, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: Accepted in CVPR 2025

  6. arXiv:2412.10362  [pdf, other

    cs.LG cs.CV

    OP-LoRA: The Blessing of Dimensionality

    Authors: Piotr Teterwak, Kate Saenko, Bryan A. Plummer, Ser-Nam Lim

    Abstract: Low-rank adapters enable fine-tuning of large models with only a small number of parameters, thus reducing storage costs and minimizing the risk of catastrophic forgetting. However, they often pose optimization challenges, with poor convergence. To overcome these challenges, we introduce an over-parameterized approach that accelerates training without increasing inference costs. This method repara… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  7. arXiv:2412.07755  [pdf, other

    cs.CV cs.AI cs.GR cs.RO

    SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

    Authors: Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, Kate Saenko

    Abstract: Reasoning about motion and space is a fundamental cognitive capability that is required by multiple real-world applications. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only focus on static spatial relationships, and not dynamic awareness of motion and space, i.e., reasoning about the effect of egocentric and object motions on spat… ▽ More

    Submitted 3 April, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

    Comments: Project webpage: https://arijitray.com/SAT/

  8. arXiv:2412.02856  [pdf, other

    cs.CV cs.LG

    Is Large-Scale Pretraining the Secret to Good Domain Generalization?

    Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan A. Plummer, Kate Saenko

    Abstract: Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becomin… ▽ More

    Submitted 22 April, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: Accepted at ICLR 2025

  9. arXiv:2411.16870  [pdf, other

    cs.CV cs.LG

    RECAST: Reparameterized, Compact weight Adaptation for Sequential Tasks

    Authors: Nazia Tasnim, Bryan A. Plummer

    Abstract: Incremental learning aims to adapt to new sets of categories over time with minimal computational overhead. Prior work often addresses this task by training efficient task-specific adaptors that modify frozen layer weights or features to capture relevant information without affecting predictions on previously learned categories. While these adaptors are generally more efficient than finetuning the… ▽ More

    Submitted 14 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: Accepted as a conference paper in ICLR, 2025

  10. arXiv:2408.02157  [pdf, other

    cs.CV

    PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance

    Authors: Aoming Liu, Zhong Li, Zhang Chen, Nannan Li, Yi Xu, Bryan A. Plummer

    Abstract: Immersive scene generation, notably panorama creation, benefits significantly from the adaptation of large pre-trained text-to-image (T2I) models for multi-view image generation. Due to the high cost of acquiring multi-view images, tuning-free generation is preferred. However, existing methods are either limited to simple correspondences or require extensive fine-tuning to capture complex ones. We… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  11. arXiv:2406.07822  [pdf, other

    cs.CV cs.CL

    Tell Me What's Next: Textual Foresight for Generic UI Representations

    Authors: Andrea Burns, Kate Saenko, Bryan A. Plummer

    Abstract: Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To co… ▽ More

    Submitted 7 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings. Data and code to be released at https://github.com/aburns4/textualforesight

  12. arXiv:2406.01449  [pdf, other

    cs.CV

    SLANT: Spurious Logo ANalysis Toolkit

    Authors: Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Online content is filled with logos, from ads and social media posts to website branding and product placements. Consequently, these logos are prevalent in the extensive web-scraped datasets used to pretrain Vision-Language Models, which are used for a wide array of tasks (content moderation, object classification). While these models have been shown to learn harmful correlations in various tasks,… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  13. arXiv:2405.16419  [pdf, other

    cs.CV cs.AI

    Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

    Authors: Chau Pham, Bryan A. Plummer

    Abstract: Multi-Channel Imaging (MCI) contains an array of challenges for encoding useful feature representations not present in traditional images. For example, images from two different satellites may both contain RGB channels, but the remaining channels can be different for each imaging source. Thus, MCI models must support a variety of channel configurations at test time. Recent work has extended tradit… ▽ More

    Submitted 28 October, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024

  14. arXiv:2404.04346  [pdf, other

    cs.CV

    Koala: Key frame-conditioned long video-LLM

    Authors: Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko

    Abstract: Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to unde… ▽ More

    Submitted 3 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024 as a poster highlight

  15. arXiv:2402.11744  [pdf, other

    cs.CL

    Machine-Generated Text Localization

    Authors: Zhongping Zhang, Wenda Qin, Bryan A. Plummer

    Abstract: Machine-Generated Text (MGT) detection aims to identify a piece of text as machine or human written. Prior work has primarily formulated MGT detection as a binary classification task over an entire document, with limited work exploring cases where only part of a document is machine generated. This paper provides the first in-depth study of MGT that localizes the portions of a document that were ma… ▽ More

    Submitted 10 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (findings)

  16. arXiv:2402.00626  [pdf, other

    cs.CV cs.CR cs.LG

    Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

    Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs). The susceptibility of recent large LVLMs like GPT4-V to such attacks is understudied, raising concerns about amplified misinformation in personal assistant applications. Previous attacks use simple strategies, such as random misleading words, which don't fully exploit LVLMs' language reasoning abilit… ▽ More

    Submitted 12 February, 2025; v1 submitted 1 February, 2024; originally announced February 2024.

  17. arXiv:2312.14985  [pdf, other

    cs.CV

    UniHuman: A Unified Model for Editing Human Images in the Wild

    Authors: Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin

    Abstract: Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To en… ▽ More

    Submitted 31 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024

  18. arXiv:2312.01629  [pdf, other

    cs.CV

    CLAMP: Contrastive LAnguage Model Prompt-tuning

    Authors: Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim

    Abstract: Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set… ▽ More

    Submitted 26 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

  19. arXiv:2312.01274  [pdf, other

    cs.CV

    Learning to Compose SuperWeights for Neural Parameter Allocation Search

    Authors: Piotr Teterwak, Soren Nelson, Nikoli Dryden, Dina Bashkirova, Kate Saenko, Bryan A. Plummer

    Abstract: Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not du… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted at IEEE Winter Conference on Applications of Computer Vision (WACV) 2024

  20. arXiv:2312.00827  [pdf, other

    cs.CV

    A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

    Authors: Siqi Wang, Chau Pham, Bryan A. Plummer

    Abstract: Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structur… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  21. arXiv:2311.04251  [pdf, other

    cs.LG cs.AI cs.CV

    MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

    Authors: Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer

    Abstract: Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: Accepted at IEEE Winter Conference on Applications of Computer Vision (WACV) 2024

  22. arXiv:2310.19224  [pdf, other

    cs.CV

    CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

    Authors: Zitong Chen, Chau Pham, Siqi Wang, Michael Doron, Nikita Moshkov, Bryan A. Plummer, Juan C. Caicedo

    Abstract: Most neural networks assume that input images have a fixed number of channels (three for RGB images). However, there are many settings where the number of channels may vary, such as microscopy images where the number of channels changes depending on instruments and experimental goals. Yet, there has not been a systemic attempt to create and evaluate neural networks that are invariant to the number… ▽ More

    Submitted 16 January, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted at NeurIPS Track on Datasets and Benchmarks, 2023

  23. arXiv:2310.06272  [pdf, other

    cs.CL cs.AI cs.LG

    Let Models Speak Ciphers: Multiagent Debate through Embeddings

    Authors: Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang

    Abstract: Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one to… ▽ More

    Submitted 26 February, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  24. arXiv:2308.16741  [pdf, other

    cs.AI cs.CV

    Socratis: Are large multimodal models emotionally aware?

    Authors: Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko

    Abstract: Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactio… ▽ More

    Submitted 2 November, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: ICCV 2023 WECIA

  25. arXiv:2308.04553  [pdf, other

    cs.CV cs.LG

    From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from off-the-shelf large-scale generative models offers a promising direction to mitigate this issue by augmenting underrepresented subgroups in the real dataset. However, by us… ▽ More

    Submitted 17 July, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted at ECCV 2024

  26. arXiv:2307.12854  [pdf, other

    cs.CV

    Multiscale Video Pretraining for Long-Term Activity Forecasting

    Authors: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

    Abstract: Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issu… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  27. arXiv:2306.11911  [pdf, other

    cs.CV

    LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration

    Authors: Siqi Wang, Bryan A. Plummer

    Abstract: Learning with noisy labels (LNL) aims to train a high-performing model using a noisy dataset. We observe that noise for a given class often comes from a limited set of categories, yet many LNL methods overlook this. For example, an image mislabeled as a cheetah is more likely a leopard than a hippopotamus due to its visual similarity. Thus, we explore Learning with Noisy Labels with noise source K… ▽ More

    Submitted 13 July, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: This paper is accepted by European Conference on Computer Vision (ECCV) 2024

  28. arXiv:2305.17489  [pdf, other

    cs.CV

    Text-to-image Editing by Image Information Removal

    Authors: Zhongping Zhang, Jian Zheng, Jacob Zhiyuan Fang, Bryan A. Plummer

    Abstract: Diffusion models have demonstrated impressive performance in text-guided image generation. Current methods that leverage the knowledge of these models for image editing either fine-tune them using the input image (e.g., Imagic) or incorporate structure information as additional constraints (e.g., ControlNet). However, fine-tuning large-scale diffusion models on a single image can lead to severe ov… ▽ More

    Submitted 7 November, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: Full paper is accepted by WACV2024; Best paper runner-up of AI4CC@CVPR 2023

  29. arXiv:2305.05432  [pdf, other

    cs.CL cs.CV

    WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at the WikiWorkshop 2023. Data is readily available at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md. arXiv admin note: text overlap with arXiv:2305.03668

  30. arXiv:2305.03689  [pdf, other

    cs.CV

    COLA: A Benchmark for Compositional Text-to-image Retrieval

    Authors: Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

    Abstract: Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve i… ▽ More

    Submitted 2 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023. Webpage: https://cs-people.bu.edu/array/research/cola/

  31. arXiv:2305.03668  [pdf, other

    cs.CL cs.CV

    A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia… ▽ More

    Submitted 20 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted in EMNLP 2023, revision contains camera ready edits. Data can be downloaded at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md

  32. arXiv:2304.01973  [pdf, other

    cs.LG cs.CV

    ERM++: An Improved Baseline for Domain Generalization

    Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, Bryan A. Plummer

    Abstract: Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when… ▽ More

    Submitted 9 December, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: An improved baseline for Domain Generalization. WACV 2025

  33. arXiv:2303.16342  [pdf, other

    cs.CV cs.AI cs.CL

    Language-Guided Audio-Visual Source Separation via Trimodal Consistency

    Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

    Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More

    Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023

  34. arXiv:2211.12112  [pdf, other

    cs.CV cs.AI cs.LG

    Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

    Authors: Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori

    Abstract: We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: NeurIPS 2022 Workshop on Human Evaluation of Generative Models (HEGM)

  35. arXiv:2210.01887  [pdf, other

    cs.CV

    Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures

    Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer

    Abstract: Human pose transfer synthesizes new view(s) of a person for a given pose. Recent work achieves this via self-reconstruction, which disentangles a person's pose and texture information by breaking the person down into parts, then recombines them for reconstruction. However, part-level disentanglement preserves some pose information that can create unwanted artifacts. In this paper, we propose Pose… ▽ More

    Submitted 30 August, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted to ICCV 2023

  36. arXiv:2209.15605  [pdf, other

    cs.CV

    Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss func… ▽ More

    Submitted 27 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted at CVPR 2023

  37. arXiv:2207.13061  [pdf, other

    cs.CV cs.AI cs.CL

    NewsStories: Illustrating articles with visual summaries

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud, Thomas Leung

    Abstract: Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visu… ▽ More

    Submitted 14 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: Accepted at ECCV 2022

  38. arXiv:2207.06555  [pdf, other

    cs.CV

    Supervised Attribute Information Removal and Reconstruction for Image Manipulation

    Authors: Nannan Li, Bryan A. Plummer

    Abstract: The goal of attribute manipulation is to control specified attribute(s) in given images. Prior work approaches this problem by learning disentangled representations for each attribute that enables it to manipulate the encoded source attributes to the target attributes. However, encoded attributes are often correlated with relevant image content. Thus, the source attribute information can often be… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at ECCV 2022

  39. arXiv:2203.13281  [pdf, other

    cs.CV

    Movie Genre Classification by Language Augmentation and Shot Sampling

    Authors: Zhongping Zhang, Yiwen Gu, Bryan A. Plummer, Xin Miao, Jiayi Liu, Huayan Wang

    Abstract: Video-based movie genre classification has garnered considerable attention due to its various applications in recommendation systems. Prior work has typically addressed this task by adapting models from traditional video classification tasks, such as action recognition or event detection. However, these models often neglect language elements (e.g., narrations or conversations) present in videos, w… ▽ More

    Submitted 7 November, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted at WACV2024

  40. arXiv:2203.12849  [pdf, other

    cs.CV

    Complex Scene Image Editing by Scene Graph Comprehension

    Authors: Zhongping Zhang, Huiwen He, Bryan A. Plummer, Zhenyu Liao, Huayan Wang

    Abstract: Conditional diffusion models have demonstrated impressive performance on various tasks like text-guided semantic image editing. Prior work requires image regions to be identified manually by human users or use an object detector that only perform well for object-centric manipulations. For example, if an input image contains multiple objects with the same semantic meaning (such as a group of birds)… ▽ More

    Submitted 19 September, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to BMVC 2023

  41. arXiv:2202.02312  [pdf, other

    cs.CL cs.CV cs.HC

    A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

    Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

    Abstract: Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Itera… ▽ More

    Submitted 14 August, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: Accepted at the European Conference on Computer Vision (ECCV) 2022. This is a new version of the paper with additional experimental results and a few prior implementation bugs fixed

  42. arXiv:2201.12462  [pdf, other

    cs.LG cs.AI cs.HC cs.RO

    Explaining Reinforcement Learning Policies through Counterfactual Trajectories

    Authors: Julius Frost, Olivia Watkins, Eric Weiner, Pieter Abbeel, Trevor Darrell, Bryan Plummer, Kate Saenko

    Abstract: In order for humans to confidently decide where to employ RL agents for real-world tasks, a human developer must validate that the agent will perform well at test-time. Some policy interpretability methods facilitate this by capturing the policy's decision making in a set of agent rollouts. However, even the most informative trajectories of training time behavior may give little insight into the a… ▽ More

    Submitted 18 March, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Accepted at ICML HILL 2021 Workshop

    ACM Class: I.2.6

  43. arXiv:2112.05917  [pdf, other

    cs.CL

    Show, Write, and Retrieve: Entity-aware Article Generation and Retrieval

    Authors: Zhongping Zhang, Yiwen Gu, Bryan A. Plummer

    Abstract: Article comprehension is an important challenge in natural language processing with many applications such as article generation or image-to-article retrieval. Prior work typically encodes all tokens in articles uniformly using pretrained language models. However, in many applications, such as understanding news stories, these articles are based on real-world events and may reference many named en… ▽ More

    Submitted 20 October, 2023; v1 submitted 11 December, 2021; originally announced December 2021.

    Comments: Accepted at EMNLP 2023 Findings

  44. arXiv:2112.03237  [pdf, other

    cs.CV

    From Coarse to Fine-grained Concept based Discrimination for Phrase Detection

    Authors: Maan Qraitem, Bryan A. Plummer

    Abstract: Phrase detection requires methods to identify if a phrase is relevant to an image and localize it, if applicable. A key challenge for training more discriminative detection models is sampling negatives. Sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. Our proposed CFCD-Net addresses this through two novel… ▽ More

    Submitted 14 November, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

  45. arXiv:2112.03208  [pdf, other

    cs.LG

    Anchoring to Exemplars for Training Mixture-of-Expert Cell Embeddings

    Authors: Siqi Wang, Manyuan Lu, Nikita Moshkov, Juan C. Caicedo, Bryan A. Plummer

    Abstract: Analyzing the morphology of cells in microscopy images can provide insights into the mechanism of compounds or the function of genes. Addressing this task requires methods that can not only extract biological information from the images, but also ignore technical variations, ie, changes in experimental procedure or differences between equipments used to collect microscopy images. We propose Treatm… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

  46. arXiv:2110.10596  [pdf, other

    cs.CV cs.LG

    Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell

    Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.… ▽ More

    Submitted 2 December, 2021; v1 submitted 20 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021 (Spotlight)

  47. arXiv:2105.01695  [pdf, other

    cs.CV

    Effectively Leveraging Attributes for Visual Similarity

    Authors: Samarth Mishra, Zhongping Zhang, Yuan Shen, Ranjitha Kumar, Venkatesh Saligrama, Bryan Plummer

    Abstract: Measuring similarity between two images often requires performing complex reasoning along different axes (e.g., color, texture, or shape). Insights into what might be important for measuring similarity can can be provided by annotated attributes, but prior work tends to view these annotations as complete, resulting in them using a simplistic approach of predicting attributes on single images, whic… ▽ More

    Submitted 20 August, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Accepted to ICCV 2021

  48. arXiv:2104.08560  [pdf, other

    cs.CL cs.CV

    Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

    Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

    Abstract: In recent years, vision-language research has shifted to study tasks which require more complex reasoning, such as interactive question answering, visual common sense reasoning, and question-answer plausibility prediction. However, the datasets used for these problems fail to capture the complexity of real inputs and multimodal environments, such as ambiguous natural language requests and diverse… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

    Comments: Accepted at the workshop on Visually Grounded Interaction and Language (ViGIL) at NAACL 2021

  49. Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko

    Abstract: Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are gener… ▽ More

    Submitted 21 October, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

    Comments: Accepted at EMNLP 2020

  50. arXiv:2008.00348  [pdf, other

    cs.CV

    Self-supervised Visual Attribute Learning for Fashion Compatibility

    Authors: Donghyun Kim, Kuniaki Saito, Samarth Mishra, Stan Sclaroff, Kate Saenko, Bryan A Plummer

    Abstract: Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, prior work in SSL focuses on tasks like object recognition or detection, which aim to learn object shapes and assume that the features should be invariant to concepts like colors and textures. Thus, these SSL methods perform poorly on downst… ▽ More

    Submitted 11 August, 2021; v1 submitted 1 August, 2020; originally announced August 2020.

    Comments: Accepted to VIPriors Workshop ICCV 2021