Skip to main content

Showing 1–50 of 94 results for author: Lee, Y J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.20998  [pdf, other

    cs.CV cs.AI

    YoChameleon: Personalized Vision and Language Generation

    Authors: Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, Yuheng Li

    Abstract: Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduc… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: CVPR 2025; Project page: https://thaoshibe.github.io/YoChameleon

  2. arXiv:2504.20996  [pdf, other

    cs.CV

    X-Fusion: Introducing New Modality to Frozen Large Language Models

    Authors: Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li

    Abstract: We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently ou… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: Project Page: https://sichengmo.github.io/XFusion/

  3. arXiv:2504.00557  [pdf, other

    cs.CV cs.LG

    Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

    Authors: Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim

    Abstract: Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: accepted at CVPR 2025 Workshop on ELVM

  4. arXiv:2503.13058  [pdf, other

    cs.CV

    Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

    Authors: Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee

    Abstract: When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibi… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  5. Fits like a Flex-Glove: Automatic Design of Personalized FPCB-Based Tactile Sensing Gloves

    Authors: Devin Murphy, Yichen Li, Crystal Owens, Layla Stanton, Young Joong Lee, Paul Pu Liang, Yiyue Luo, Antonio Torralba, Wojciech Matusik

    Abstract: Resistive tactile sensing gloves have captured the interest of researchers spanning diverse domains, such as robotics, healthcare, and human-computer interaction. However, existing fabrication methods often require labor-intensive assembly or costly equipment, limiting accessibility. Leveraging flexible printed circuit board (FPCB) technology, we present an automated pipeline for generating resist… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: 8 pages, 6 figures, to be published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

  6. arXiv:2502.07778  [pdf, other

    cs.CV

    Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection

    Authors: Anirudh Sundara Rajan, Yong Jae Lee

    Abstract: Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue th… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

  7. arXiv:2501.11899  [pdf, other

    cs.CV cs.LG

    LASER: Lip Landmark Assisted Speaker Detection for Robustness

    Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee

    Abstract: Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtecti… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  8. arXiv:2501.04336  [pdf, other

    cs.CV

    Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

    Authors: Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

    Abstract: Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  9. arXiv:2501.02791  [pdf, other

    math.NA cs.LG

    Orthogonal greedy algorithm for linear operator learning with shallow neural network

    Authors: Ye Lin, Jiwei Jia, Young Ju Lee, Ran Zhang

    Abstract: Greedy algorithms, particularly the orthogonal greedy algorithm (OGA), have proven effective in training shallow neural networks for fitting functions and solving partial differential equations (PDEs). In this paper, we extend the application of OGA to the tasks of linear operator learning, which is equivalent to learning the kernel function through integral transforms. Firstly, a novel greedy alg… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

  10. arXiv:2410.11835  [pdf, other

    cs.CV

    Aligned Datasets Improve Detection of Latent Diffusion-Generated Images

    Authors: Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, Yong Jae Lee

    Abstract: As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images.… ▽ More

    Submitted 26 February, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

  11. arXiv:2410.10818  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

    Authors: Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang

    Abstract: Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal… ▽ More

    Submitted 15 October, 2024; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: Project Page: https://temporalbench.github.io/

  12. arXiv:2410.02763  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

    Authors: Jianrui Zhang, Mu Cai, Yong Jae Lee

    Abstract: There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack man… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Project Page: https://vinoground.github.io

  13. arXiv:2410.00905  [pdf, other

    cs.CV

    Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

    Authors: Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh

    Abstract: In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance betwe… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

  14. arXiv:2409.12963  [pdf, other

    cs.CV cs.AI cs.LG

    Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

    Authors: Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan

    Abstract: Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding… ▽ More

    Submitted 1 October, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

  15. arXiv:2409.06827  [pdf, other

    cs.CV

    Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

    Authors: Mu Cai, Chenxu Luo, Yong Jae Lee, Xiaodong Yang

    Abstract: 3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet a… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

    Comments: IROS 2024

  16. arXiv:2408.14419  [pdf, other

    cs.AI cs.CL cs.CV

    CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

    Authors: Shubham Bharti, Shiyun Cheng, Jihyun Rho, Jianrui Zhang, Mu Cai, Yong Jae Lee, Martina Rau, Xiaojin Zhu

    Abstract: We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We d… ▽ More

    Submitted 9 May, 2025; v1 submitted 26 August, 2024; originally announced August 2024.

  17. arXiv:2407.10972  [pdf, other

    cs.CV cs.AI cs.LG

    VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

    Authors: Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee

    Abstract: In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more c… ▽ More

    Submitted 29 August, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: Project Page: https://vgbench.github.io

  18. arXiv:2407.09541  [pdf, other

    cs.CL cs.AI cs.CV

    MATE: Meet At The Embedding -- Connecting Images with Long Texts

    Authors: Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

    Abstract: While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this pape… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

  19. arXiv:2407.03593  [pdf, other

    math.NA cs.LG

    Green Multigrid Network

    Authors: Ye Lin, Young Ju Lee, Jiwei Jia

    Abstract: GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designe… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  20. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot… ▽ More

    Submitted 30 January, 2025; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: ICLR 2025

  21. arXiv:2406.09400  [pdf, other

    cs.CV cs.LG

    Yo'LLaVA: Your Personalized Language and Vision Assistant

    Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

    Abstract: Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in o… ▽ More

    Submitted 4 December, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024; Project page: https://thaoshibe.github.io/YoLLaVA

  22. arXiv:2405.17430  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Matryoshka Multimodal Models

    Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

    Abstract: Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While… ▽ More

    Submitted 29 July, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: Project Page: https://matryoshka-mm.github.io/

  23. arXiv:2403.15388  [pdf, other

    cs.CV cs.AI cs.CL

    LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

    Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

    Abstract: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which… ▽ More

    Submitted 22 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: Project page: https://llava-prumerge.github.io/

  24. arXiv:2402.16363  [pdf, other

    cs.CL cs.AI

    LLM Inference Unveiled: Survey and Roofline Model Insights

    Authors: Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer

    Abstract: The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summ… ▽ More

    Submitted 1 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  25. arXiv:2402.15583  [pdf, other

    cs.CV cs.LG

    Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

    Authors: Yichen Xie, Hongge Chen, Gregory P. Meyer, Yong Jae Lee, Eric M. Wolff, Masayoshi Tomizuka, Wei Zhan, Yuning Chai, Xin Huang

    Abstract: Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to signific… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  26. arXiv:2402.13254  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

    Authors: Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee

    Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation mode… ▽ More

    Submitted 12 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: 15 pages, 6 figures, 12 tables, Project Page: https://countercurate.github.io/

  27. arXiv:2401.10219  [pdf, other

    cs.CV

    Edit One for All: Interactive Batch Image Editing

    Authors: Thao Nguyen, Utkarsh Ojha, Yuheng Li, Haotian Liu, Yong Jae Lee

    Abstract: In recent years, image editing has advanced remarkably. With increased human control, it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change, to straight up dragging the contents of the image in an interactive point-based manner. However, most of the focus has remained on editing single images at a time. Whether and how we can simultaneously edit… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Project page: https://thaoshibe.github.io/edit-one-for-all/

  28. arXiv:2312.07532  [pdf, other

    cs.CV cs.AI cs.CL

    Interfacing Foundation Models' Embeddings

    Authors: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

    Abstract: Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in the teaser figure, a lightweight transformer interface without tuning any… ▽ More

    Submitted 15 July, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: CODE: https://github.com/UX-Decoder/FIND

  29. arXiv:2312.02253  [pdf, other

    cs.CV cs.AI cs.LG

    Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

    Authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

    Abstract: Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In thi… ▽ More

    Submitted 21 January, 2025; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Accepted by Transactions on Machine Learning Research (TMLR)

  30. arXiv:2312.00784  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

    Authors: Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee

    Abstract: While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual… ▽ More

    Submitted 26 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR2024. Project page: https://vip-llava.github.io/

  31. arXiv:2311.07377  [pdf, other

    cs.SE cs.AI cs.DC cs.RO

    Testing learning-enabled cyber-physical systems with Large-Language Models: A Formal Approach

    Authors: Xi Zheng, Aloysius K. Mok, Ruzica Piskac, Yong Jae Lee, Bhaskar Krishnamachari, Dakai Zhu, Oleg Sokolsky, Insup Lee

    Abstract: The integration of machine learning (ML) into cyber-physical systems (CPS) offers significant benefits, including enhanced efficiency, predictive capabilities, real-time responsiveness, and the enabling of autonomous operations. This convergence has accelerated the development and deployment of a range of real-world applications, such as autonomous vehicles, delivery drones, service robots, and te… ▽ More

    Submitted 16 May, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

  32. arXiv:2311.05889  [pdf, other

    eess.IV cs.CV cs.LG

    Semantic Map Guided Synthesis of Wireless Capsule Endoscopy Images using Diffusion Models

    Authors: Haejin Lee, Jeongwoo Ju, Jonghyuck Lee, Yeoun Joo Lee, Heechul Jung

    Abstract: Wireless capsule endoscopy (WCE) is a non-invasive method for visualizing the gastrointestinal (GI) tract, crucial for diagnosing GI tract diseases. However, interpreting WCE results can be time-consuming and tiring. Existing studies have employed deep neural networks (DNNs) for automatic GI tract lesion detection, but acquiring sufficient training examples, particularly due to privacy concerns, r… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  33. arXiv:2310.03744  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Improved Baselines with Visual Instruction Tuning

    Authors: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

    Abstract: Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response form… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Camera ready, CVPR 2024 (highlight). LLaVA project page: https://llava-vl.github.io

  34. arXiv:2309.12530  [pdf, other

    cs.CV

    A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

    Authors: Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee

    Abstract: Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unsee… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: to appear at ICCV2023

  35. arXiv:2309.10313  [pdf, other

    cs.CL cs.AI cs.LG

    Investigating the Catastrophic Forgetting in Multimodal Large Language Models

    Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

    Abstract: Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still… ▽ More

    Submitted 5 December, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  36. arXiv:2307.14331  [pdf, other

    cs.CV

    Visual Instruction Inversion: Image Editing via Visual Prompting

    Authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee

    Abstract: Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: Project page: https://thaoshibe.github.io/visii/

  37. arXiv:2307.13697  [pdf, other

    cs.CV cs.AI

    Benchmarking and Analyzing Generative Data for Visual Recognition

    Authors: Bo Li, Haotian Liu, Liangyu Chen, Yong Jae Lee, Chunyuan Li, Ziwei Liu

    Abstract: Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external data (\ie generative \vs retrieval \vs original). Our key contributions are: \textbf{1) GenBench Construction:} We devise \textbf{GenBench}, a broad benchmar… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Research Report

  38. arXiv:2306.17154  [pdf, other

    cs.CV

    Generate Anything Anywhere in Any Scene

    Authors: Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee

    Abstract: Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields. However, challenges persist in creating controllable models for personalized object generation. In this paper, we first identify the entanglement issues in existing personalized generative models, and then propose a straightforward and efficient data augmentation training stra… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  39. arXiv:2306.06094  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

    Authors: Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, Yong Jae Lee

    Abstract: Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well? This work investigates this question. To enable the LLM to process images, we convert them into a representation given by Scalable Vector Graphics (SVG). To stud… ▽ More

    Submitted 11 July, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

  40. arXiv:2304.08485  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Visual Instruction Tuning

    Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

    Abstract: Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce… ▽ More

    Submitted 11 December, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Oral; project page: https://llava-vl.github.io/

  41. arXiv:2304.06718  [pdf, other

    cs.CV

    Segment Everything Everywhere All at Once

    Authors: Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee

    Abstract: In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four… ▽ More

    Submitted 11 July, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

  42. arXiv:2303.07269  [pdf, other

    cs.CV cs.LG

    InPL: Pseudo-labeling the Inliers First for Imbalanced Semi-supervised Learning

    Authors: Zhuoran Yu, Yin Li, Yong Jae Lee

    Abstract: Recent state-of-the-art methods in imbalanced semi-supervised learning (SSL) rely on confidence-based pseudo-labeling with consistency regularization. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the ps… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted by ICLR 2023

  43. arXiv:2302.10174  [pdf, other

    cs.CV cs.LG

    Towards Universal Fake Image Detectors that Generalize Across Generative Models

    Authors: Utkarsh Ojha, Yuheng Li, Yong Jae Lee

    Abstract: With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting… ▽ More

    Submitted 1 April, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

  44. arXiv:2301.07094  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Learning Customized Visual Models with Retrieval-Augmented Knowledge

    Authors: Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, Chunyuan Li

    Abstract: Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framew… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

  45. arXiv:2301.07093  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    GLIGEN: Open-Set Grounded Text-to-Image Generation

    Authors: Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee

    Abstract: Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding… ▽ More

    Submitted 16 April, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

  46. arXiv:2212.11270  [pdf, other

    cs.CV cs.CL

    Generalized Decoding for Pixel, Image, and Language

    Authors: Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao

    Abstract: We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

    Comments: https://x-decoder-vl.github.io

  47. arXiv:2212.04875  [pdf, other

    cs.CV cs.AI

    Expeditious Saliency-guided Mix-up through Random Gradient Thresholding

    Authors: Minh-Long Luu, Zeyi Huang, Eric P. Xing, Yong Jae Lee, Haohan Wang

    Abstract: Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities… ▽ More

    Submitted 10 August, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: Accepted Long paper at 2nd Practical-DL Workshop at AAAI 2023

  48. arXiv:2211.02707  [pdf, other

    cs.CV

    Contrastive Learning for Diverse Disentangled Foreground Generation

    Authors: Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

    Abstract: We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: ECCV 2022

  49. arXiv:2209.06723  [pdf

    cs.CL

    Toward Improving Health Literacy in Patient Education Materials with Neural Machine Translation Models

    Authors: David Oniani, Sreekanth Sreekumar, Renuk DeAlmeida, Dinuk DeAlmeida, Vivian Hui, Young Ji Lee, Yiye Zhang, Leming Zhou, Yanshan Wang

    Abstract: Health literacy is the central focus of Healthy People 2030, the fifth iteration of the U.S. national goals and objectives. People with low health literacy usually have trouble understanding health information, following post-visit instructions, and using prescriptions, which results in worse health outcomes and serious health disparities. In this study, we propose to leverage natural language pro… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

  50. arXiv:2206.06359  [pdf, other

    cs.CV cs.AI cs.LG

    EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

    Authors: Zhuoran Yu, Yin Li, Yong Jae Lee

    Abstract: Recent state-of-the-art methods in semi-supervised learning (SSL) combine consistency regularization with confidence-based pseudo-labeling. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the pseudo-labels… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.