Skip to main content

Showing 1–29 of 29 results for author: Lin, K Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.21497  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.MA

    Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

    Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr

    Abstract: Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Qualit… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Project Page: https://github.com/Paper2Poster/Paper2Poster

  2. arXiv:2505.16854  [pdf, other

    cs.AI cs.CV

    Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

    Authors: Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

    Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where peop… ▽ More

    Submitted 23 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: update more examples in appendix

  3. arXiv:2503.15661  [pdf, other

    cs.CV cs.AI cs.CL

    UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

    Authors: Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

    Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first… ▽ More

    Submitted 6 May, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

    Comments: This paper has been accepted to the 41st International Conference on Machine Learning (ICML 2025)

  4. arXiv:2503.13444  [pdf, other

    cs.CV cs.AI

    VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

    Authors: Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

    Abstract: Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal… ▽ More

    Submitted 31 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: Project Page: https://videomind.github.io/

  5. arXiv:2503.09402  [pdf, ps, other

    cs.CV

    VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

    Authors: Kevin Qinghong Lin, Mike Zheng Shou

    Abstract: Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight languag… ▽ More

    Submitted 9 June, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025. Github: https://github.com/showlab/VLog

  6. arXiv:2412.11621  [pdf, other

    cs.CV cs.MM

    VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

    Authors: Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

    Abstract: Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and vide… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted for The 39th Annual AAAI Conference on Artificial Intelligence 2025 in Main Track, 19 pages, 24 figures

  7. arXiv:2411.17949  [pdf, other

    cs.CV

    ROICtrl: Boosting Instance Control for Visual Generation

    Authors: Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

    Abstract: Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box pai… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Project page at https://roictrl.github.io/

  8. arXiv:2411.17465  [pdf, other

    cs.CV cs.AI cs.CL cs.HC

    ShowUI: One Vision-Language-Action Model for GUI Visual Agent

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

    Abstract: Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-langu… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Technical Report. Github: https://github.com/showlab/ShowUI

  9. arXiv:2411.15262  [pdf, other

    cs.CV

    MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

    Authors: Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou

    Abstract: Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video ge… ▽ More

    Submitted 30 March, 2025; v1 submitted 22 November, 2024; originally announced November 2024.

    Comments: The project website is at: https://weijiawu.github.io/MovieBench/. Code: https://github.com/showlab/MovieBecnh

    Journal ref: CVPR 2025

  10. arXiv:2408.16730  [pdf, other

    cs.CV

    VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

    Authors: Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

    Abstract: A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the visio… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  11. arXiv:2408.12528  [pdf, other

    cs.CV

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Authors: Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

    Abstract: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image… ▽ More

    Submitted 20 October, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

    Comments: Technical Report

  12. arXiv:2407.21757  [pdf, other

    cs.CV cs.MM

    Learning Video Context as Interleaved Multimodal Sequences

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

    Abstract: Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as i… ▽ More

    Submitted 12 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  13. arXiv:2406.13719  [pdf, other

    cs.CV

    GUI Action Narrator: Where and When Did That Action Take Place?

    Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

    Abstract: The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  14. arXiv:2406.11816  [pdf, other

    cs.CV

    VideoLLM-online: Online Video Large Language Model for Streaming Video

    Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

    Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: CVPR 2024. This arxiv version is upgraded with Llama-3

  15. arXiv:2406.10227  [pdf, other

    cs.CV cs.AI

    VideoGUI: A Benchmark for GUI Automation from Instructional Videos

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-c… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 24 pages, 16 tables, 17 figures

  16. arXiv:2404.15909  [pdf, other

    cs.CV

    Learning Long-form Video Prior via Generative Pre-Training

    Authors: Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

    Abstract: Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning lon… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  17. arXiv:2401.00849  [pdf, other

    cs.CV

    COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

    Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introd… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: 16 pages; Website: http://fingerrec.github.io/cosmo

  18. arXiv:2312.01987  [pdf, other

    cs.CV

    Bootstrapping SparseFormers from Vision Foundation Models

    Authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

    Abstract: The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this p… ▽ More

    Submitted 4 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  19. arXiv:2308.15109  [pdf, other

    cs.CV

    DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

    Authors: Henghao Zhao, Kevin Qinghong Lin, Rui Yan, Zechao Li

    Abstract: Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Given that the video content is continuous in time, there is often a lack of clear boundaries between temporal events in a video. This boundary ambiguity makes it challenging for the model t… ▽ More

    Submitted 2 March, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

  20. arXiv:2307.16715  [pdf, other

    cs.CV

    UniVTG: Towards Unified Video-Language Temporal Grounding

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

    Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: https://github.com/showlab/UniVTG

  21. arXiv:2307.05463  [pdf, other

    cs.CV

    EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

    Authors: Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

    Abstract: Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of e… ▽ More

    Submitted 18 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

    Comments: Published in ICCV 2023

  22. arXiv:2306.08640  [pdf, other

    cs.CV

    AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

    Authors: Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou

    Abstract: Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected… ▽ More

    Submitted 28 June, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Project page: https://showlab.github.io/assistgpt/

  23. arXiv:2305.20087  [pdf, other

    cs.CV

    Too Large; Data Reduction for Vision-Language Pre-Training

    Authors: Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

    Abstract: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major s… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: ICCV2023. Code: https://github.com/showlab/datacentric.vlp

  24. arXiv:2305.13777  [pdf, other

    cs.CV

    VisorGPT: Learning Visual Prior via Generative Pre-Training

    Authors: Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou

    Abstract: Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic re… ▽ More

    Submitted 30 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Project web-page: https://sierkinhane.github.io/visor-gpt/

  25. arXiv:2303.14644  [pdf, other

    cs.CV

    Affordance Grounding from Demonstration Video to Target Image

    Authors: Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou

    Abstract: Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  26. arXiv:2209.11475  [pdf, other

    cs.CV cs.IR

    Unsupervised Hashing with Semantic Concept Mining

    Authors: Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang

    Abstract: Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an i… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

  27. arXiv:2207.01622  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

    Authors: Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pre… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Preprint. 4 pages, 2 figures, 5 tables. Code: https://github.com/showlab/EgoVLP. The Ego4D challenge technical report of EgoVLP arXiv:2206.01670. See EPIC challenge technical report arXiv:2207.01334 for overlap

  28. arXiv:2207.01334  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

    Authors: Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, Rongcheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretra… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: To appeared in CVPRW22. 5 pages, 2 figures, 2 tables. Code: https://github.com/showlab/EgoVLP. The EPIC challenge technical report of EgoVLP arXiv:2206.01670. See Ego4D challenge technical report arXiv:2207.01622

  29. arXiv:2206.01670  [pdf, other

    cs.CV cs.AI

    Egocentric Video-Language Pretraining

    Authors: Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS 2022. Double champions at Ego4D and EPIC-Kitchens, CVPR 2022 challenges. 23 pages, 13 figures, 12 tables. Code: https://github.com/showlab/EgoVLP