Skip to main content

Showing 1–50 of 367 results for author: Sha, W

.
  1. arXiv:2507.01050  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

    Authors: Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li

    Abstract: The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and rob… ▽ More

    Submitted 23 June, 2025; originally announced July 2025.

  2. arXiv:2507.01029  [pdf, ps, other

    cs.LG cs.AI cs.CL

    PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning

    Authors: Junjie Zhou, Yingli Zuo, Shichang Feng, Peng Wan, Qi Zhu, Daoqiang Zhang, Wei Shao

    Abstract: With the development of generative artificial intelligence and instruction tuning techniques, multimodal large language models (MLLMs) have made impressive progress on general reasoning tasks. Benefiting from the chain-of-thought (CoT) methodology, MLLMs can solve the visual reasoning problem step-by-step. However, existing MLLMs still face significant challenges when applied to pathology visual r… ▽ More

    Submitted 18 June, 2025; originally announced July 2025.

  3. arXiv:2507.00392  [pdf, ps, other

    cs.CV

    Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

    Authors: Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu

    Abstract: Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  4. arXiv:2506.18385  [pdf, ps, other

    cs.CV

    InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

    Authors: Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, Wenhai Wang

    Abstract: Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed t… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  5. arXiv:2506.17929  [pdf, ps, other

    cs.LG cs.AI

    ASTER: Adaptive Spatio-Temporal Early Decision Model for Dynamic Resource Allocation

    Authors: Shulun Chen, Wei Shao, Flora D. Salim, Hao Xue

    Abstract: Supporting decision-making has long been a central vision in the field of spatio-temporal intelligence. While prior work has improved the timeliness and accuracy of spatio-temporal forecasting, converting these forecasts into actionable strategies remains a key challenge. A main limitation is the decoupling of the prediction and the downstream decision phases, which can significantly degrade the d… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: ASTER: Adaptive Spatio-Temporal Early Decision Model for Dynamic Resource Allocation

  6. arXiv:2506.17361  [pdf, ps, other

    eess.IV cs.CV cs.LG

    Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution

    Authors: Xufei Wang, Mingjian Zhang, Fei Ge, Jinchen Zhu, Wen Sha, Jifen Ren, Zhimeng Hou, Shouguo Zheng, ling Zheng, Shizhuang Weng

    Abstract: Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate ne… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 20 pages,17 figures

  7. arXiv:2506.17202  [pdf, ps, other

    cs.CV

    UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

    Authors: Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao

    Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/tliby/UniFork

  8. arXiv:2506.07740  [pdf, other

    cs.CV

    Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images

    Authors: Yingping Liang, Ying Fu, Yutao Hu, Wenqi Shao, Jiaming Liu, Debing Zhang

    Abstract: Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbf{Flow-Anything}, a large-scale data gen… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  9. Venus Cloud Research: Progress and Perspectives

    Authors: Longkang Dai, Dmitrij V. Titov, Wencheng D. Shao, Xi Zhang, Jun Cui, Siteng Fan

    Abstract: Venus has regained attention on the international stage with the approval of three new missions by ESA and NASA. As the twin sister of Earth, Venus exhibits a distinct atmosphere, which casts a veil of mystery over the planetary evolution and is of great scientific significance. One of the most important components of Venus-the cloud-is believed to have significantly regulated its climate evolutio… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 76 pages, 14 figures

    Journal ref: Space Sci Rev 221, 51 (2025)

  10. arXiv:2506.05781  [pdf, ps, other

    cs.IR

    Generating Long Semantic IDs in Parallel for Recommendation

    Authors: Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, Julian McAuley

    Abstract: Semantic ID-based recommendation models tokenize each item into a small number of discrete tokens that preserve specific semantics, leading to better performance, scalability, and memory efficiency. While recent models adopt a generative approach, they often suffer from inefficient inference due to the reliance on resource-intensive beam search and multiple forward passes through the neural sequen… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: KDD 2025

  11. arXiv:2506.04217  [pdf, ps, other

    cs.RO cs.AI

    OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

    Authors: Junting Chen, Haotian Liang, Lingxiao Du, Weiyun Wang, Mengkang Hu, Yao Mu, Wenhai Wang, Jifeng Dai, Ping Luo, Wenqi Shao, Lin Shao

    Abstract: The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on… ▽ More

    Submitted 21 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

    Comments: 9 pages of main content, 19 pages in total

    ACM Class: I.2.4; I.2.9; I.2.10

  12. arXiv:2506.02648  [pdf, ps, other

    cs.AI

    Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

    Authors: Yue Yang, MingKang Chen, Qihua Liu, Mengkang Hu, Qiguang Chen, Gengrui Zhang, Shuyue Hu, Guangtao Zhai, Yu Qiao, Yu Wang, Wenqi Shao, Ping Luo

    Abstract: Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or l… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  13. arXiv:2505.23461  [pdf, ps, other

    cs.CL

    UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

    Authors: Chuanyuan Tan, Wenbiao Shao, Hao Xiong, Tong Zhu, Zhenhua Liu, Kai Shi, Wenliang Chen

    Abstract: Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a ne… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Findings

  14. arXiv:2505.22184  [pdf, ps, other

    cs.CL cs.AI

    Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

    Authors: Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li

    Abstract: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while… ▽ More

    Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: 25 pages, 5 figures, 9 tables

  15. arXiv:2505.21355  [pdf, other

    eess.IV cs.AI cs.CV

    Prostate Cancer Screening with Artificial Intelligence-Enhanced Micro-Ultrasound: A Comparative Study with Traditional Methods

    Authors: Muhammad Imran, Wayne G. Brisbane, Li-Ming Su, Jason P. Joseph, Wei Shao

    Abstract: Background and objective: Micro-ultrasound (micro-US) is a novel imaging modality with diagnostic accuracy comparable to MRI for detecting clinically significant prostate cancer (csPCa). We investigated whether artificial intelligence (AI) interpretation of micro-US can outperform clinical screening methods using PSA and digital rectal examination (DRE). Methods: We retrospectively studied 145 men… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  16. arXiv:2505.18958  [pdf, ps, other

    cs.CV

    CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

    Authors: Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong

    Abstract: Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex… ▽ More

    Submitted 27 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

  17. arXiv:2505.18506  [pdf

    physics.app-ph

    Capacity Enhancement Analysis and Implementation of a 3D Array Based on Miniaturized Dipole Antennas

    Authors: Yongzheng Li, Wanchen Yang, Shuai S. A. Yuan, Zhitao Ye, Chongwen Huang, Xiaoming Chen, Wenquan Che, Wei E. I. Sha

    Abstract: Theoretically, the three-dimensional (3D) array architecture provides a higher communication degree of freedom (DoF) compared to the planar arrays, allowing for greater capacity potential in multiple-input multiple-output (MIMO) systems. However, in practical implementations, the upper elements of 3D arrays significantly degrade the performance of the lower elements, leading to increased inter-ele… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: This manuscript hvae been submitted to IEEE Transactions on Antennas and Propagation. Under review currently

  18. arXiv:2505.13427  [pdf, ps, other

    cs.AI cs.CV

    MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

    Authors: Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao

    Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model tra… ▽ More

    Submitted 5 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  19. arXiv:2505.12821  [pdf, other

    cs.CL cs.AI

    SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

    Authors: Han Sun, Zhen Sun, Zongmin Zhang, Linzhao Jia, Wei Shao, Min Zhang

    Abstract: Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and am… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  20. arXiv:2505.12504  [pdf, ps, other

    cs.LG cs.AI

    CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

    Authors: Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang

    Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  21. arXiv:2505.05155  [pdf, other

    cs.LG cs.CR

    FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning

    Authors: Zhihao Zeng, Ziquan Fang, Wei Shao, Lu Chen, Yunjun Gao

    Abstract: Trajectory data, which capture the movement patterns of people and vehicles over time and space, are crucial for applications like traffic optimization and urban planning. However, issues such as noise and incompleteness often compromise data quality, leading to inaccurate trajectory analyses and limiting the potential of these applications. While Trajectory Data Preparation (TDP) can enhance data… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  22. arXiv:2505.03383  [pdf, other

    cs.CV

    Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples

    Authors: Jian-Wei Li, Wen-Ze Shao

    Abstract: Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-s… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  23. arXiv:2505.03222  [pdf, ps, other

    math.OC

    A Stochastic Gradient Descent Method with Global Convergence for Minimizing Nearly Convex Functions

    Authors: Chenglong Bao, Liang Chen, Weizhi Shao

    Abstract: This paper proposes a stochastic gradient descent method with an adaptive Gaussian noise term for minimizing nonconvex differentiable functions. The noise term in the algorithm, independent of the gradient, is determined by the difference between the function value at the current step and a lower bound estimate of the optimal value. In both probability space and state space, our theoretical analys… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    MSC Class: Primary 65K05; 90C26; Secondary 90C06

  24. arXiv:2504.14582  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

    Authors: Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu , et al. (86 additional authors not shown)

    Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach… ▽ More

    Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

  25. arXiv:2504.10479  [pdf, other

    cs.CV

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang , et al. (26 additional authors not shown)

    Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p… ▽ More

    Submitted 18 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: Technical Report

  26. arXiv:2504.05782  [pdf, other

    cs.CV cs.AI

    MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

    Authors: Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang

    Abstract: Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited da… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 11 pages, 8 figures

  27. arXiv:2504.02753  [pdf, other

    quant-ph physics.optics

    Robust entangled photon generation enabled by single-shot Floquet driving

    Authors: Jun-Yong Yan, Paul C. A. Hagen, Hans-Georg Babin, Wei E. I. Sha, Andreas D. Wieck, Arne Ludwig, Chao-Yuan Jin, Vollrath M. Axt, Da-Wei Wang, Moritz Cygorek, Feng Liu

    Abstract: Quantum emitters driven by resonant two-photon excitation are a leading source for deterministically generated entangled photon pairs, essential for scalable photonic quantum technologies. However, conventional resonant schemes are highly sensitive to laser power fluctuations and pose additional experimental challenges for emitters with small biexciton binding energies. Here, we demonstrate how bi… ▽ More

    Submitted 6 May, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: Manuscript with 10 pages and 4 figures plus Supplementary Information comprising 8 pages and 7 figures

  28. arXiv:2504.01886  [pdf, other

    cs.CV

    GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

    Authors: Yanzhou Su, Tianbin Li, Jiyao Liu, Chenglong Ma, Junzhi Ning, Cheng Tang, Sibo Ju, Jin Ye, Pengcheng Chen, Ming Hu, Shixiang Tang, Lihao Liu, Bin Fu, Wenqi Shao, Xiaowei Hu, Xiangwen Liao, Yuanfeng Ji, Junjun He

    Abstract: Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boos… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  29. arXiv:2503.20047  [pdf, other

    cs.CV eess.IV

    Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

    Authors: Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao

    Abstract: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decom… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  30. arXiv:2503.16970  [pdf, other

    cs.CV

    Distilling Monocular Foundation Model for Fine-grained Depth Completion

    Authors: Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu

    Abstract: Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth compl… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  31. arXiv:2503.16779  [pdf, other

    cs.CL cs.AI

    Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

    Authors: Mengsong Wu, Tong Zhu, Han Han, Xiang Zhang, Wenbiao Shao, Wenliang Chen

    Abstract: Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representatio… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: 11 pages, 10 figures

  32. arXiv:2503.15024  [pdf, other

    cs.CV

    Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

    Authors: Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, kelu Yao, Chao Li, Wenqi Shao, Ping Luo

    Abstract: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to the… ▽ More

    Submitted 23 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

    Comments: 31 pages, 19 figures

  33. arXiv:2503.12545  [pdf, other

    cs.CV

    PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

    Authors: Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Xiaojiang Peng, Kai Wang, Yang You, Wenqi Shao, Hongxun Yao, Kaipeng Zhang

    Abstract: In recent years, Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning. However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security. To address these issues, machine unlearning (MU) has emerged as a pr… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  34. arXiv:2503.12505  [pdf, other

    cs.AI cs.CV

    MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

    Authors: Zhaopan Xu, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang

    Abstract: Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby i… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  35. arXiv:2503.12385  [pdf, other

    cs.CV

    Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset

    Authors: Yutao Hu, Sen Li, Jincheng Yan, Wenqi Shao, Xiaoyan Luo

    Abstract: Fine-grained visual categorization (FGVC) is a challenging but significant task in computer vision, which aims to recognize different sub-categories of birds, cars, airplanes, etc. Among them, recognizing models of different cars has significant application value in autonomous driving, traffic surveillance and scene understanding, which has received considerable attention in the past few years. Ho… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: accepted to The Eleventh Workshop on Fine-Grained Visual Categorization in CVPR 2024

  36. arXiv:2503.09560  [pdf, other

    eess.IV cs.CV

    FCaS: Fine-grained Cardiac Image Synthesis based on 3D Template Conditional Diffusion Model

    Authors: Jiahao Xia, Yutao Hu, Yaolei Qi, Zhenliang Li, Wenqi Shao, Junjun He, Ying Fu, Longjiang Zhang, Guanyu Yang

    Abstract: Solving medical imaging data scarcity through semantic image generation has attracted significant attention in recent years. However, existing methods primarily focus on generating whole-organ or large-tissue structures, showing limited effectiveness for organs with fine-grained structure. Due to stringent topological consistency, fragile coronary features, and complex 3D morphological heterogenei… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 16 pages, 9 figures

  37. arXiv:2503.09496  [pdf, other

    cs.CV

    Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder

    Authors: Junjie Zhou, Jiao Tang, Yingli Zuo, Peng Wan, Daoqiang Zhang, Wei Shao

    Abstract: The integrative analysis of histopathological images and genomic data has received increasing attention for survival prediction of human cancers. However, the existing studies always hold the assumption that full modalities are available. As a matter of fact, the cost for collecting genomic data is high, which sometimes makes genomic data unavailable in testing samples. A common way of tackling su… ▽ More

    Submitted 18 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  38. arXiv:2503.09491  [pdf, other

    cs.CV eess.IV

    DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

    Authors: Junjie Zhou, Shouju Wang, Yuxia Tang, Qi Zhu, Daoqiang Zhang, Wei Shao

    Abstract: The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among mult… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  39. arXiv:2503.08422  [pdf, other

    cs.CV

    JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

    Authors: Runjian Chen, Wenqi Shao, Bo Zhang, Shaoshuai Shi, Li Jiang, Ping Luo

    Abstract: Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators… ▽ More

    Submitted 13 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  40. arXiv:2503.07365  [pdf, other

    cs.CV

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Authors: Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao

    Abstract: DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging t… ▽ More

    Submitted 15 April, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  41. arXiv:2503.07167  [pdf, other

    cs.CV cs.RO

    Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

    Authors: Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang

    Abstract: Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose \textbf{T}emporal \textbf{O}verlapping \textbf{P}rediction (\textbf{TO… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  42. arXiv:2503.00745  [pdf, other

    eess.IV cs.CV

    Geodesic Diffusion Models for Medical Image-to-Image Generation

    Authors: Teng Zhang, Hongxu Jiang, Kuang Gong, Wei Shao

    Abstract: Diffusion models transform an unknown data distribution into a Gaussian prior by progressively adding noise until the data become indistinguishable from pure noise. This stochastic process traces a path in probability space, evolving from the original data distribution (considered as a Gaussian with near-zero variance) to an isotropic Gaussian. The denoiser then learns to reverse this process, gen… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  43. arXiv:2502.17241  [pdf, other

    physics.comp-ph

    High-Order Modulation Large MIMO Detector Based on Physics-Inspired Methods

    Authors: Qing-Guo Zeng, Xiao-Peng Cui, Xian-Zhe Tao, Jia-Qi Hu, Shi-Jie Pan, Wei E. I. Sha, Man-Hong Yung

    Abstract: Applying quantum annealing or current quantum-/physics-inspired algorithms for MIMO detection always abandon the direct gray-coded bit-to-symbol mapping in order to obtain Ising form, leading to inconsistency errors. This often results in slow convergence rates and error floor, particularly with high-order modulations. We propose HOPbit, a novel MIMO detector designed to address this issue by tran… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  44. arXiv:2502.13092  [pdf, other

    cs.CL cs.AI

    Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

    Authors: Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo

    Abstract: Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we int… ▽ More

    Submitted 24 February, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

    Comments: Project page: https://text-to-world.github.io/

  45. arXiv:2502.07508  [pdf, other

    cs.CV

    Enhance-A-Video: Better Generated Video for Free

    Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You

    Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to… ▽ More

    Submitted 27 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  46. arXiv:2502.06756  [pdf, other

    cs.CV

    SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement

    Authors: Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, Kaipeng Zhang

    Abstract: In this paper, we explore a principal way to enhance the quality of widely pre-existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the… ▽ More

    Submitted 16 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: Accepted to ICLR 2025

  47. arXiv:2502.05492  [pdf, other

    physics.optics

    Extraction of power transmission parameters from PT-symmetric waveguides

    Authors: Chengnian Huang, Zhihao Lan, Menglin L. N. Chen, Wei E. I. Sha

    Abstract: The PT-symmetric waveguides have been frequently discussed in the photonics community due to their extraordinary properties. Especially, the study of power transmission is significant for switching applications. The aim of this study is to extract the mode power transmission parameters based on the coupled mode equations and analyze the power properties of the PT-symmetric system. The equations re… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: 15 pages, 10 figures, published to Optics Express

    Journal ref: Optics Express, vol. 33, no. 2, pp. 3162-3176, Jan. 2025

  48. arXiv:2502.05330  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge

    Authors: Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee , et al. (38 additional authors not shown)

    Abstract: Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

  49. arXiv:2502.05091  [pdf, other

    cs.CV

    DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

    Authors: Gorkem Can Ates, Yu Xin, Kuang Gong, Wei Shao

    Abstract: Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which… ▽ More

    Submitted 25 April, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

  50. arXiv:2502.03738  [pdf, other

    cs.CV

    Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

    Authors: Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

    Abstract: Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss cause… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.