Skip to main content

Showing 1–34 of 34 results for author: Yeung-Levy, S

.
  1. arXiv:2505.22946  [pdf, ps, other

    cs.CL cs.AI cs.CV cs.CY cs.LG

    NegVQA: Can Vision Language Models Understand Negation?

    Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy

    Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering di… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Published at ACL 2025 Findings

  2. arXiv:2505.22787  [pdf, ps, other

    cs.CL

    Can Large Language Models Match the Conclusions of Systematic Reviews?

    Authors: Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy

    Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  3. arXiv:2504.02799  [pdf, other

    cs.CV cs.AI

    Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

    Authors: Anita Rau, Mark Endo, Josiah Aklilu, Jaewoo Heo, Khaled Saab, Alberto Paderno, Jeffrey Jopling, F. Christopher Holsinger, Serena Yeung-Levy

    Abstract: Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variabl… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  4. arXiv:2503.22727  [pdf, other

    cs.CL cs.LG

    A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

    Authors: Alejandro Lozano, Min Woo Sun, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, Robert Tibshirani, Serena Yeung-Levy

    Abstract: Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 millio… ▽ More

    Submitted 1 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  5. arXiv:2503.13399  [pdf, other

    cs.CV cs.AI cs.CL cs.LG q-bio.CB

    MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

    Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy

    Abstract: Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimo… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: CVPR 2025 (Conference on Computer Vision and Pattern Recognition) Project page at https://jmhb0.github.io/microvqa Benchmark at https://huggingface.co/datasets/jmhb/microvqa

  6. arXiv:2503.07860  [pdf, other

    cs.CV cs.AI cs.LG

    Video Action Differencing

    Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

    Abstract: How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: ICLR 2025 (International Conference on Learning Representations) Project page: http://jmhb0.github.io/viddiff Benchmark: https://huggingface.co/datasets/jmhb/VidDiffBench

  7. arXiv:2503.03942  [pdf

    cs.CV

    SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection

    Authors: Devanish N. Kamtam, Joseph B. Shrager, Satya Deepya Malla, Xiaohan Wang, Nicole Lin, Juan J. Cardona, Serena Yeung-Levy, Clarence Hu

    Abstract: Background: We evaluate SAM 2 for surgical scene understanding by examining its semantic segmentation capabilities for organs/tissues both in zero-shot scenarios and after fine-tuning. Methods: We utilized five public datasets to evaluate and fine-tune SAM 2 for segmenting anatomical tissues in surgical videos/images. Fine-tuning was applied to the image encoder and mask decoder. We limited traini… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  8. arXiv:2503.00838  [pdf, other

    cs.LG cs.CV

    Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models

    Authors: Jeffrey Gu, Serena Yeung-Levy

    Abstract: Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models. Hypernetworks, neural networks that generate some or all of the parameters of another neural network, have become an increasingly important technique for conditioning and generalizing implicit neural representations (INRs), which… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: ICLR 2025

  9. arXiv:2502.09775  [pdf, ps, other

    q-bio.QM cs.CV cs.LG q-bio.BM q-bio.CB

    CellFlux: Simulating Cellular Morphology Changes via Flow Matching

    Authors: Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy

    Abstract: Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell… ▽ More

    Submitted 28 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: Published at ICML 2025

  10. arXiv:2501.13919  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    Temporal Preference Optimization for Long-Form Video Understanding

    Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

    Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts… ▽ More

    Submitted 30 January, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  11. arXiv:2501.07171  [pdf, other

    cs.CV cs.CL

    BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

    Authors: Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy

    Abstract: The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address… ▽ More

    Submitted 1 April, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

  12. arXiv:2501.03225  [pdf, other

    cs.CV cs.AI cs.CL cs.CY cs.LG

    Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

    Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy

    Abstract: The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended que… ▽ More

    Submitted 9 April, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

    Comments: CVPR 2025

  13. arXiv:2412.13180  [pdf, other

    cs.CV

    Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

    Authors: Mark Endo, Xiaohan Wang, Serena Yeung-Levy

    Abstract: Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability t… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: Project page: https://web.stanford.edu/~markendo/projects/feather.html

  14. arXiv:2412.10360  [pdf, other

    cs.CV cs.AI

    Apollo: An Exploration of Video Understanding in Large Multimodal Models

    Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia

    Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

    Comments: https://apollo-lmms.github.io

  15. arXiv:2411.11214  [pdf, other

    cs.CV

    DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery

    Authors: Jaewoo Heo, George Hu, Zeyu Wang, Serena Yeung-Levy

    Abstract: Human Mesh Recovery (HMR) is an important yet challenging problem with applications across various domains including motion capture, augmented reality, and biomechanics. Accurately predicting human pose parameters from a single image remains a challenging 3D computer vision task. In this work, we introduce DeforHMR, a novel regression-based monocular HMR framework designed to enhance the predictio… ▽ More

    Submitted 17 November, 2024; originally announced November 2024.

    Comments: 11 pages, 5 figures, 3DV2025

  16. arXiv:2411.10582  [pdf, other

    cs.CV

    Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

    Authors: Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy

    Abstract: Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured wi… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

    Comments: 15 pages, 2 figures, submitted to TMLR

  17. arXiv:2410.14340  [pdf, other

    cs.CV

    Zero-shot Action Localization via the Confidence of Large Vision-Language Models

    Authors: Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy

    Abstract: Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developmen… ▽ More

    Submitted 24 March, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

  18. arXiv:2410.00309  [pdf, other

    cs.CV cs.AI cs.LG

    Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

    Authors: Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy

    Abstract: Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired im… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

    Comments: Project webpage: https://laubravo.github.io/apu_website/

  19. arXiv:2409.11654  [pdf, other

    q-bio.QM cs.AI cs.LG q-bio.NC

    How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities

    Authors: Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk , et al. (17 additional authors not shown)

    Abstract: The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision… ▽ More

    Submitted 14 October, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

  20. arXiv:2408.07867  [pdf, other

    cs.CV

    Continuous Perception Benchmark

    Authors: Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

    Abstract: Humans continuously perceive and process visual signals. However, current video models typically either sample key frames sparsely or divide videos into chunks and densely sample within each chunk. This approach stems from the fact that most existing video benchmarks can be addressed by analyzing key frames or aggregating information from separate chunks. We anticipate that the next generation of… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  21. arXiv:2407.06189  [pdf, other

    cs.CV cs.AI

    Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

    Authors: Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

    Abstract: The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervisio… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Project page: https://orrzohar.github.io/projects/video-star/

  22. arXiv:2407.01791  [pdf, other

    cs.CV cs.AI

    μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

    Authors: Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy

    Abstract: Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers' efficiency, identifying new image biomarkers, and accelerating hypothesis generation and scientific discovery. However, there is a lack of standa… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  23. arXiv:2405.18415  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Why are Visually-Grounded Language Models Bad at Image Classification?

    Authors: Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

    Abstract: Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard im… ▽ More

    Submitted 3 November, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Published at NeurIPS 2024

  24. arXiv:2403.13206  [pdf, ps, other

    cs.CV cs.AI

    Depth-guided NeRF Training via Earth Mover's Distance

    Authors: Anita Rau, Josiah Aklilu, F. Christopher Holsinger, Serena Yeung-Levy

    Abstract: Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-gro… ▽ More

    Submitted 4 September, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to ECCV 2024

  25. arXiv:2403.12952  [pdf, other

    cs.CV cs.AI cs.LG

    Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

    Authors: Elaine Sui, Xiaohan Wang, Serena Yeung-Levy

    Abstract: Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets us… ▽ More

    Submitted 10 December, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted at WACV 2025

  26. arXiv:2403.10517  [pdf, other

    cs.CV cs.AI cs.CL cs.IR

    VideoAgent: Long-form Video Understanding with Large Language Model as Agent

    Authors: Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

    Abstract: Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employ… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  27. Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

    Authors: Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Serena Yeung-Levy, Curtis P. Langlotz , et al. (2 additional authors not shown)

    Abstract: The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant… ▽ More

    Submitted 26 June, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Journal ref: Nature Communications volume 16, Article number: 3108 (2025)

  28. arXiv:2402.16806  [pdf, other

    cs.CV

    Multi-Human Mesh Recovery with Transformers

    Authors: Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

    Abstract: Conventional approaches to human mesh recovery predominantly employ a region-based strategy. This involves initially cropping out a human-centered region as a preprocessing step, with subsequent modeling focused on this zoomed-in image. While effective for single figures, this pipeline poses challenges when dealing with images featuring multiple individuals, as different people are processed separ… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  29. arXiv:2401.14555  [pdf, other

    cs.CV cs.LG

    Revisiting Active Learning in the Era of Vision Foundation Models

    Authors: Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J. Nirschl, Serena Yeung-Levy

    Abstract: Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context… ▽ More

    Submitted 24 June, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted to TMLR

  30. arXiv:2401.12175  [pdf, other

    cs.CV

    Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

    Authors: Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang

    Abstract: Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a hum… ▽ More

    Submitted 14 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Project Page: https://zzweng.github.io/humanlrm

  31. arXiv:2401.08567  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

    Authors: Yuhui Zhang, Elaine Sui, Serena Yeung-Levy

    Abstract: Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is u… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Published at ICLR 2024

  32. arXiv:2312.02974  [pdf, other

    cs.CV cs.CL cs.CY cs.LG

    Describing Differences in Image Sets with Natural Language

    Authors: Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

    Abstract: How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in im… ▽ More

    Submitted 26 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Oral

  33. arXiv:2309.07986  [pdf, other

    cs.CV cs.AI cs.LG

    Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

    Authors: James Burgess, Kuan-Chieh Wang, Serena Yeung-Levy

    Abstract: Text-to-image diffusion models generate impressive and realistic images, but do they learn to represent the 3D world from only 2D supervision? We demonstrate that yes, certain 3D scene representations are encoded in the text embedding space of models like Stable Diffusion. Our approach, Viewpoint Neural Textual Inversion (ViewNeTI), is to discover 3D view tokens; these tokens control the 3D viewpo… ▽ More

    Submitted 26 July, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: ECCV 2024 (European Conference on Computer Vision). Project page: https://jmhb0.github.io/view_neti/

  34. arXiv:2303.09541  [pdf, other

    cs.CV

    Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains

    Authors: Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy

    Abstract: Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images. However, despite the visually impressive results, these models often struggle to preserve plausible human structure in the generations. Due to this reason, while generative models have shown promising results in aiding downstream image recognition tasks by generating l… ▽ More

    Submitted 30 December, 2023; v1 submitted 16 March, 2023; originally announced March 2023.