Skip to main content

Showing 1–50 of 188 results for author: Wen, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.05474  [pdf, ps, other

    cs.CV

    3D Scene Generation: A Survey

    Authors: Beichen Wen, Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

    Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF,… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Project Page: https://github.com/hzxie/Awesome-3D-Scene-Generation

  2. arXiv:2505.02835  [pdf, ps, other

    cs.CV cs.CL

    R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

    Authors: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang

    Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In… ▽ More

    Submitted 9 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: Home page: https://github.com/yfzhang114/r1_reward

  3. arXiv:2504.19549  [pdf, other

    cs.CV

    DEEMO: De-identity Multimodal Emotion Recognition and Reasoning

    Authors: Deng Li, Bohao Xing, Xin Liu, Baiqiang Xia, Bihan Wen, Heikki Kälviäinen

    Abstract: Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  4. arXiv:2504.19136  [pdf, other

    cs.CV cs.AI eess.IV

    PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification

    Authors: Huiling Zheng, Xian Zhong, Bin Liu, Yi Xiao, Bihan Wen, Xiaofeng Li

    Abstract: The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and the underutilization of spectral complementarity. Existing methods often fail to decouple shared structural features from modality-specific radiometric attributes, leading to feature conflicts and information loss. To address this issue, we propose Phase-… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: 13 pages, 8 figures

  5. arXiv:2504.14904  [pdf, other

    cs.SI cs.AI cs.CL cs.MM

    VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform

    Authors: Xingyu Lu, Tianke Zhang, Chang Meng, Xiaobei Wang, Jinpeng Wang, YiFan Zhang, Shisong Tang, Changyi Liu, Haojie Ding, Kaiyu Jiang, Kaiyu Tang, Bin Wen, Hai-Tao Zheng, Fan Yang, Tingting Gao, Di Zhang, Kun Gai

    Abstract: Exponentially growing short video platforms (SVPs) face significant challenges in moderating content detrimental to users' mental health, particularly for minors. The dissemination of such content on SVPs can lead to catastrophic societal consequences. Although substantial efforts have been dedicated to moderating such content, existing methods suffer from critical limitations: (1) Manual review i… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 20 pages, 6 figures

  6. arXiv:2504.12711  [pdf, other

    cs.CV cs.AI eess.IV

    NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More

    Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

  7. arXiv:2504.10329  [pdf, other

    cs.CV

    InstructEngine: Instruction-driven Text-to-Image Alignment

    Authors: Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Chun Yuan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

    Abstract: Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. H… ▽ More

    Submitted 21 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: 8 pages, 7 figures

  8. arXiv:2504.09899  [pdf, other

    cs.CV eess.IV

    Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data

    Authors: Ziwang Xu, Lanqing Guo, Satoshi Tsutsui, Shuyan Zhang, Alex C. Kot, Bihan Wen

    Abstract: Staining is essential in cell imaging and medical diagnostics but poses significant challenges, including high cost, time consumption, labor intensity, and irreversible tissue alterations. Recent advances in deep learning have enabled digital staining through supervised model training. However, collecting large-scale, perfectly aligned pairs of stained and unstained images remains difficult. In th… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted to IEEE Transactions on Medical Imaging

  9. arXiv:2504.08809  [pdf, other

    cs.LG

    Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

    Authors: Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen

    Abstract: Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinatio… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 13 pages, 4 figures

  10. arXiv:2504.03151  [pdf, other

    cs.CL cs.LG

    Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

    Authors: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu

    Abstract: Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  11. arXiv:2503.23897  [pdf, other

    cs.CV cs.AI

    Training-Free Text-Guided Image Editing with Visual Autoregressive Model

    Authors: Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang

    Abstract: Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modificat… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  12. arXiv:2503.20314  [pdf, other

    cs.CV

    Wan: Open and Advanced Large-Scale Video Generative Models

    Authors: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu , et al. (37 additional authors not shown)

    Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluat… ▽ More

    Submitted 18 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 60 pages, 33 figures

  13. arXiv:2503.18673  [pdf, other

    cs.CV cs.AI cs.RO

    Any6D: Model-free 6D Pose Estimation of Novel Objects

    Authors: Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, Kuk-Jin Yoon

    Abstract: We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accu… ▽ More

    Submitted 25 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: CVPR 2025, Project Page: https://taeyeop.com/any6d

  14. arXiv:2503.15898  [pdf, other

    cs.CV

    Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

    Authors: Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, Yong-Lu Li

    Abstract: Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  15. arXiv:2503.09994  [pdf, other

    cs.CV

    TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

    Authors: Yunxiao Wang, Meng Liu, Rui Shao, Haoyu Zhang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Liqiang Nie

    Abstract: Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  16. arXiv:2503.09592  [pdf, other

    cs.LG cs.SC

    Parsing the Language of Expression: Enhancing Symbolic Regression with Domain-Aware Symbolic Priors

    Authors: Sikai Huang, Yixin Berry Wen, Tara Adusumilli, Kusum Choudhary, Haizhao Yang

    Abstract: Symbolic regression is essential for deriving interpretable expressions that elucidate complex phenomena by exposing the underlying mathematical and physical relationships in data. In this paper, we present an advanced symbolic regression method that integrates symbol priors from diverse scientific domains - including physics, biology, chemistry, and engineering - into the regression process. By s… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  17. arXiv:2503.09143  [pdf, other

    cs.CV

    Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

    Authors: Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, Liqiang Nie

    Abstract: AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address thes… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Project: https://egovisiongroup.github.io/Exo2Ego.github.io/

  18. Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview

    Authors: Ahmad Chaddad, Yan Hu, Yihang Wu, Binbin Wen, Reem Kateb

    Abstract: Objective. This paper presents an overview of generalizable and explainable artificial intelligence (XAI) in deep learning (DL) for medical imaging, aimed at addressing the urgent need for transparency and explainability in clinical applications. Methodology. We propose to use four CNNs in three medical datasets (brain tumor, skin cancer, and chest x-ray) for medical image classification tasks.… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Published in Current Opinion in Biomedical Engineering

  19. arXiv:2503.05228  [pdf, other

    cs.CV

    RecipeGen: A Benchmark for Real-World Recipe Image Generation

    Authors: Ruoxuan Zhang, Hongxia Xie, Yi Yao, Jian-Yu Jiang-Lin, Bin Wen, Ling Lo, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng

    Abstract: Recipe image generation is an important challenge in food computing, with applications from culinary education to interactive recipe platforms. However, there is currently no real-world dataset that comprehensively connects recipe goals, sequential steps, and corresponding images. To address this, we introduce RecipeGen, the first real-world goal-step-image benchmark for recipe generation, featuri… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  20. arXiv:2503.02748  [pdf, other

    cs.RO

    Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation

    Authors: Junjie Zhu, Huayu Liu, Jin Wang, Bangrong Wen, Kaixiang Huang, Xiaofei Li, Haiyun Zhan, Guodong Lu

    Abstract: From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  21. arXiv:2503.01288  [pdf, other

    cs.CV

    Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual

    Authors: Chong Wang, Lanqing Guo, Zixuan Fu, Siyuan Yang, Hao Cheng, Alex C. Kot, Bihan Wen

    Abstract: Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned \textit{discriminative denoiser} as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained \textit{generative diffusion model}, has gained great popularity for solving IR problems through stochastic sampling.… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  22. arXiv:2502.14914  [pdf, other

    cs.CV cs.CL cs.LG

    What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

    Authors: Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Boqiang Zhang, Nianzu Yang, Pandeng Li, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

    Abstract: Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and… ▽ More

    Submitted 15 April, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

  23. arXiv:2502.14355  [pdf, other

    cs.CV

    Triply Laplacian Scale Mixture Modeling for Seismic Data Noise Suppression

    Authors: Sirui Pan, Zhiyuan Zha, Shigang Wang, Yue Li, Zipei Fan, Gang Yan, Binh T. Nguyen, Bihan Wen, Ce Zhu

    Abstract: Sparsity-based tensor recovery methods have shown great potential in suppressing seismic data noise. These methods exploit tensor sparsity measures capturing the low-dimensional structures inherent in seismic data tensors to remove noise by applying sparsity constraints through soft-thresholding or hard-thresholding operators. However, in these methods, considering that real seismic data are non-s… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  24. arXiv:2502.13078  [pdf, other

    cs.CV

    L4P: Low-Level 4D Vision Perception Unified

    Authors: Abhishek Badki, Hang Su, Bowen Wen, Orazio Gallo

    Abstract: The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception tasks. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P, a feedforward, general-purpose architecture that solves low-level 4D perception… ▽ More

    Submitted 25 April, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  25. arXiv:2502.13031  [pdf, other

    cs.CL

    HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

    Authors: Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang

    Abstract: Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Comments: 32 pages, 10 figures

  26. arXiv:2502.10391  [pdf, other

    cs.CL cs.CV

    MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

    Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan

    Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhan… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: Project Page: https://mm-rlhf.github.io/

  27. arXiv:2502.09925  [pdf, other

    cs.CV cs.AI

    TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

    Authors: Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

    Abstract: Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  28. arXiv:2502.06987  [pdf, other

    eess.IV cs.CV

    Universal Vessel Segmentation for Multi-Modality Retinal Images

    Authors: Bo Wen, Anna Heinke, Akshay Agnihotri, Dirk-Uwe Bartsch, William Freeman, Truong Nguyen, Cheolhong An

    Abstract: We identify two major limitations in the existing studies on retinal vessel segmentation: (1) Most existing works are restricted to one modality, i.e, the Color Fundus (CF). However, multi-modality retinal images are used every day in the study of retina and retinal diseases, and the study of vessel segmentation on the other modalities is scarce; (2) Even though a small amount of works extended th… ▽ More

    Submitted 8 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  29. arXiv:2502.04384  [pdf, other

    cs.CL cs.AI cs.LG eess.SY

    Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications

    Authors: Bo Wen, Xin Zhang

    Abstract: This paper presents SOLOMON, a novel Neuro-inspired Large Language Model (LLM) Reasoning Network architecture that enhances the adaptability of foundation models for domain-specific applications. Through a case study in semiconductor layout design, we demonstrate how SOLOMON enables swift adaptation of general-purpose LLMs to specialized tasks by leveraging Prompt Engineering and In-Context Learni… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: NeurIPS 2024 Workshop AFM (Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning)

    MSC Class: 68T09; 68T35; 68T45; 94C30 ACM Class: I.2.7; I.2.11; B.7.2

    Journal ref: https://neurips.cc/virtual/2024/104981

  30. arXiv:2502.01191  [pdf, other

    cs.CV

    Towards Robust and Reliable Concept Representations: Reliability-Enhanced Concept Embedding Model

    Authors: Yuxuan Cai, Xiyu Wang, Satoshi Tsutsui, Winnie Pang, Bihan Wen

    Abstract: Concept Bottleneck Models (CBMs) aim to enhance interpretability by predicting human-understandable concepts as intermediates for decision-making. However, these models often face challenges in ensuring reliable concept representations, which can propagate to downstream tasks and undermine robustness, especially under distribution shifts. Two inherent issues contribute to concept unreliability: se… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  31. arXiv:2501.11462  [pdf, other

    cs.CV eess.IV

    On the Adversarial Vulnerabilities of Transfer Learning in Remote Sensing

    Authors: Tao Bai, Xingjian Tian, Yonghao Xu, Bihan Wen

    Abstract: The use of pretrained models from general computer vision tasks is widespread in remote sensing, significantly reducing training costs and improving performance. However, this practice also introduces vulnerabilities to downstream tasks, where publicly available pretrained models can be used as a proxy to compromise downstream models. This paper presents a novel Adversarial Neuron Manipulation met… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  32. arXiv:2501.09898  [pdf, other

    cs.CV cs.LG cs.RO

    FoundationStereo: Zero-Shot Stereo Matching

    Authors: Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield

    Abstract: Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot gener… ▽ More

    Submitted 3 April, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

    Comments: CVPR 2025

  33. arXiv:2501.05205  [pdf, other

    cs.CV cs.AI

    Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

    Authors: Xueyi Ke, Satoshi Tsutsui, Yayun Zhang, Bihan Wen

    Abstract: Infants develop complex visual understanding rapidly, even preceding the acquisition of linguistic skills. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader… ▽ More

    Submitted 25 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

    Comments: Accepted at CVPR 2025

  34. arXiv:2412.19542  [pdf, other

    cs.CV cs.AI cs.LG

    Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

    Authors: Xiaoyang Liu, Boran Wen, Xinpeng Liu, Zizheng Zhou, Hongwei Fan, Cewu Lu, Lizhuang Ma, Yulong Chen, Yong-Lu Li

    Abstract: Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Int… ▽ More

    Submitted 23 February, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

    Comments: To be published in the Proceedings of AAAI 2025. The first three authors contributed equally. Project: https://github.com/DirtyHarryLYL/HAKE-AVA

  35. arXiv:2412.18108  [pdf, other

    cs.CV

    Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

    Authors: Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

    Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 m… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

  36. arXiv:2412.12798  [pdf, other

    cs.CV

    ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation

    Authors: Shiqi Huang, Shuting He, Bihan Wen

    Abstract: Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed-set predictions. In this work, we propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inte… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: AAAI 2025, code see https://github.com/HuangShiqi128/ZoRI

  37. arXiv:2412.11912  [pdf, other

    cs.CL

    CharacterBench: Benchmarking Character Customization of Large Language Models

    Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

    Abstract: Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: AAAI 2025

  38. arXiv:2412.02076  [pdf, other

    cs.CV

    Topology-Preserving Image Segmentation with Spatial-Aware Persistent Feature Matching

    Authors: Bo Wen, Haochen Zhang, Dirk-Uwe G. Bartsch, William R. Freeman, Truong Q. Nguyen, Cheolhong An

    Abstract: Topological correctness is critical for segmentation of tubular structures. Existing topological segmentation loss functions are primarily based on the persistent homology of the image. They match the persistent features from the segmentation with the persistent features from the ground truth and minimize the difference between them. However, these methods suffer from an ambiguous matching problem… ▽ More

    Submitted 8 March, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

  39. arXiv:2412.00111  [pdf, other

    cs.CV

    Video Set Distillation: Information Diversification and Temporal Densification

    Authors: Yinjie Zhao, Heng Zhao, Bihan Wen, Yew-Soon Ong, Joey Tianyi Zhou

    Abstract: The rapid development of AI models has led to a growing emphasis on enhancing their capabilities for complex input data such as videos. While large-scale video datasets have been introduced to support this growth, the unique challenges of reducing redundancies in video \textbf{sets} have not been explored. Compared to image datasets or individual videos, video \textbf{sets} have a two-layer nested… ▽ More

    Submitted 28 November, 2024; originally announced December 2024.

  40. arXiv:2411.04799  [pdf, other

    cs.CL cs.AI

    Kwai-STaR: Transform LLMs into State-Transition Reasoners

    Authors: Xingyu Lu, Yuhang Hu, Changyi Liu, Tianke Zhang, Zhenyu Yang, Zhixiang Ding, Shengsheng Qian, Meng Du, Ruiwen Kang, Kaiyu Tang, Fan Yang, Tingting Gao, Di Zhang, Hai-Tao Zheng, Bin Wen

    Abstract: Mathematical reasoning presents a significant challenge to the cognitive capabilities of LLMs. Various methods have been proposed to enhance the mathematical ability of LLMs. However, few recognize the value of state transition for LLM reasoning. In this work, we define mathematical problem-solving as a process of transiting from an initial unsolved state to the final resolved state, and propose K… ▽ More

    Submitted 12 November, 2024; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: 6 pages, 2 figures

  41. arXiv:2411.00965  [pdf, other

    cs.RO

    SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

    Authors: Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, Stan Birchfield

    Abstract: We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations… ▽ More

    Submitted 13 May, 2025; v1 submitted 1 November, 2024; originally announced November 2024.

  42. FeBiM: Efficient and Compact Bayesian Inference Engine Empowered with Ferroelectric In-Memory Computing

    Authors: Chao Li, Zhicheng Xu, Bo Wen, Ruibin Mao, Can Li, Thomas Kämpfe, Kai Ni, Xunzhao Yin

    Abstract: In scenarios with limited training data or where explainability is crucial, conventional neural network-based machine learning models often face challenges. In contrast, Bayesian inference-based algorithms excel in providing interpretable predictions and reliable uncertainty estimation in these scenarios. While many state-of-the-art in-memory computing (IMC) architectures leverage emerging non-vol… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

    Comments: 6 pages, 8 figures, to be published in the 61st DAC (Design Automation Conference) proceedings

  43. arXiv:2410.18907  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment

    Authors: Caelan Garrett, Ajay Mandlekar, Bowen Wen, Dieter Fox

    Abstract: Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts the… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

    Journal ref: 2024 Conference on Robot Learning (CoRL)

  44. arXiv:2409.11256  [pdf, other

    cs.CV eess.IV

    Temporal As a Plugin: Unsupervised Video Denoising with Pre-Trained Image Denoisers

    Authors: Zixuan Fu, Lanqing Guo, Chong Wang, Yufei Wang, Zhihao Li, Bihan Wen

    Abstract: Recent advancements in deep learning have shown impressive results in image and video denoising, leveraging extensive pairs of noisy and noise-free data for supervision. However, the challenge of acquiring paired videos for dynamic scenes hampers the practical deployment of deep video denoising techniques. In contrast, this obstacle is less pronounced in image denoising, where paired data is more… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  45. arXiv:2409.07041  [pdf, other

    cs.CV

    SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal

    Authors: Xinrui Wang, Lanqing Guo, Xiyu Wang, Siyu Huang, Bihan Wen

    Abstract: Recent advancements in deep learning have yielded promising results for the image shadow removal task. However, most existing methods rely on binary pre-generated shadow masks. The binary nature of such masks could potentially lead to artifacts near the boundary between shadow and non-shadow areas. In view of this, inspired by the physical model of shadow formation, we introduce novel soft shadow… ▽ More

    Submitted 12 March, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: This paper has been accepted by CVPR 2025

  46. arXiv:2408.04587  [pdf, other

    cs.RO

    FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty

    Authors: Michael Noseworthy, Bingjie Tang, Bowen Wen, Ankur Handa, Chad Kessens, Nicholas Roy, Dieter Fox, Fabio Ramos, Yashraj Narang, Iretiayo Akinola

    Abstract: We present FORGE, a method for sim-to-real transfer of force-aware manipulation policies in the presence of significant pose uncertainty. During simulation-based policy learning, FORGE combines a force threshold mechanism with a dynamics randomization scheme to enable robust transfer of the learned policies to the real robot. At deployment, FORGE policies, conditioned on a maximum allowable force,… ▽ More

    Submitted 2 January, 2025; v1 submitted 8 August, 2024; originally announced August 2024.

    Comments: IndustReal comparisons and snap-fit task added (v2)

  47. arXiv:2407.20177  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

    Authors: Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

    Abstract: Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly… ▽ More

    Submitted 5 April, 2025; v1 submitted 29 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review

  48. arXiv:2407.18418  [pdf, other

    cs.CL

    Know Your Limits: A Survey of Abstention in Large Language Models

    Authors: Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang

    Abstract: Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics us… ▽ More

    Submitted 12 February, 2025; v1 submitted 25 July, 2024; originally announced July 2024.

    Comments: TACL 2024

  49. arXiv:2407.17996  [pdf, other

    cs.CV

    Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

    Authors: Kailai Zhou, Lijing Cai, Yibo Wang, Mengya Zhang, Bihan Wen, Qiu Shen, Xun Cao

    Abstract: The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spect… ▽ More

    Submitted 28 November, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

  50. arXiv:2407.14177  [pdf, other

    cs.CV

    EVLM: An Efficient Vision-Language Model for Visual Understanding

    Authors: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

    Abstract: In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to sig… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.