Skip to main content

Showing 1–50 of 13,444 results for author: Wang, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01951  [pdf, ps, other

    cs.LG cs.CL

    Test-Time Scaling with Reflective Generative Model

    Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

    Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra pr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2507.01925  [pdf, ps, other

    cs.RO

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

    Abstract: The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 70 pages, 5 figures

  3. arXiv:2507.01908  [pdf, ps, other

    cs.CV

    Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

    Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  4. arXiv:2507.01792  [pdf, ps, other

    cs.CV

    FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization

    Authors: Peng Zheng, Ye Wang, Rui Ma, Zuxuan Wu

    Abstract: Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimizati… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  5. arXiv:2507.01653  [pdf, ps, other

    cs.CV

    RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

    Authors: Yuran Wang, Yingping Liang, Yutao Hu, Ying Fu

    Abstract: Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose \textbf{RobuSTereo}, a novel framework that enhances the zer… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: accepted by ICCV25

  6. arXiv:2507.01630  [pdf, ps, other

    cs.CV cs.AI

    Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

    Authors: Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu

    Abstract: The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \te… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  7. arXiv:2507.01381  [pdf, ps, other

    cs.LG cs.AI

    Distributional Soft Actor-Critic with Diffusion Policy

    Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li

    Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional rei… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted IEEE ITSC 2025

  8. arXiv:2507.01335  [pdf, ps, other

    cs.CL cs.AI

    LEDOM: An Open and Fundamental Reverse Language Model

    Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

    Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Work in progress

  9. arXiv:2507.01334  [pdf, ps, other

    cs.CL

    Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

    Authors: Nifu Dan, Yujun Cai, Yiwei Wang

    Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challen… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  10. arXiv:2507.01281  [pdf, ps, other

    cs.CL cs.AI

    Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

    Authors: Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu

    Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  11. arXiv:2507.01041  [pdf, ps, other

    cs.LG cs.AI

    Fast AI Model Splitting over Edge Networks

    Authors: Zuguang Li, Wen Wu, Shaohua Wu, Songge Zhang, Ye Wang, Xuemin, Shen

    Abstract: Split learning (SL) has emerged as a computationally efficient approach for artificial intelligence (AI) model training, which can alleviate device-side computational workloads. However, complex AI model architectures pose high computational complexity to obtain the optimal model splitting. In this paper, we represent an arbitrary AI model as a directed acyclic graph (DAG), and then reformulate th… ▽ More

    Submitted 23 June, 2025; originally announced July 2025.

    Comments: 13 pages, 14 figures

  12. arXiv:2507.01016  [pdf, ps, other

    cs.RO cs.CV

    VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

    Authors: Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He

    Abstract: In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  13. arXiv:2507.01006  [pdf, ps, other

    cs.CV cs.AI cs.LG

    GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

    Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  14. arXiv:2507.00992  [pdf, ps, other

    cs.CV

    UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

    Authors: Yuanrui Wang, Cong Han, Yafei Li, Zhipeng Jin, Xiawei Li, SiNan Du, Wen Tao, Yi Yang, Shuanglong Li, Chun Yuan, Liu Lin

    Abstract: Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  15. arXiv:2507.00949  [pdf, ps, other

    cs.DC cs.AR

    How Fast Can Graph Computations Go on Fine-grained Parallel Architectures

    Authors: Yuqing Wang, Charles Colley, Brian Wheatman, Jiya Su, David F. Gleich, Andrew A. Chien

    Abstract: Large-scale graph problems are of critical and growing importance and historically parallel architectures have provided little support. In the spirit of co-design, we explore the question, How fast can graph computing go on a fine-grained architecture? We explore the possibilities of an architecture optimized for fine-grained parallelism, natural programming, and the irregularity and skew found in… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 13 pages, 11 figures, 6 tables

  16. arXiv:2507.00891  [pdf, ps, other

    cs.CL cs.AI

    MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

    Authors: Yuheng Wang, Xianhe Tang, Pufeng Huang

    Abstract: Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automati… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  17. arXiv:2507.00880  [pdf, ps, other

    cs.LG cs.AI

    NN-Former: Rethinking Graph Structure in Neural Architecture Representation

    Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

    Abstract: The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent compli… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to CVPR 2025. Code is avaiable at https://github.com/XuRuihan/NNFormer

  18. arXiv:2507.00790  [pdf, ps, other

    cs.CV cs.AI

    LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

    Authors: Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

    Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recur… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  19. arXiv:2507.00721  [pdf, ps, other

    cs.CV

    UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

    Authors: Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu

    Abstract: Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on m… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: ICCV2025

  20. arXiv:2507.00699  [pdf, ps, other

    cs.SE

    A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

    Authors: Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng

    Abstract: Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluat… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  21. arXiv:2507.00608  [pdf, ps, other

    cs.CV

    De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection

    Authors: Zehua Fu, Chenguang Liu, Yuyu Chen, Jiaqi Zhou, Qingjie Liu, Yunhong Wang

    Abstract: Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance.… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE Transactions on Intelligent Transportation Systems. 15 pages, 10 figures

  22. arXiv:2507.00501  [pdf, ps, other

    cs.CV

    Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing

    Authors: Yongzhen Wang, Liangliang Chen, Bingwen Hu, Heng Liu, Xiao-Ping Zhang, Mingqiang Wei

    Abstract: Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 12 pages, 11 figures, 6 tables

  23. arXiv:2507.00435  [pdf, ps, other

    cs.RO cs.AI cs.CV

    RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

    Authors: Yi Ru Wang, Carter Ung, Grant Tannert, Jiafei Duan, Josephine Li, Amy Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa

    Abstract: We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suit… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Project page: https://robo-eval.github.io

  24. arXiv:2507.00427  [pdf, ps, other

    cs.DB

    Zero-Knowledge Verifiable Graph Query Evaluation via Expansion-Centric Operator Decomposition

    Authors: Hao Wu, Changzheng Wei, Yanhao Wang, Li Lin, Yilong Leng, Shiyu He, Minghao Zhao, Hanghang Wu, Ying Yan, Aoying Zhou

    Abstract: This paper investigates the feasibility of achieving zero-knowledge verifiability for graph databases, enabling database owners to cryptographically prove the query execution correctness without disclosing the underlying data. Although similar capabilities have been explored for relational databases, their implementation for graph databases presents unique challenges. This is mainly attributed to… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  25. arXiv:2507.00286  [pdf, ps, other

    cs.HC cs.AI cs.ET

    Visual Privacy Management with Generative AI for Blind and Low-Vision People

    Authors: Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang

    Abstract: Blind and low vision (BLV) individuals use Generative AI (GenAI) tools to interpret and manage visual content in their daily lives. While such tools can enhance the accessibility of visual content and so enable greater user independence, they also introduce complex challenges around visual privacy. In this paper, we investigate the current practices and future design preferences of blind and low v… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  26. arXiv:2507.00185  [pdf

    eess.IV cs.AI cs.CV

    Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

    Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

    Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 42 pages, 3 composite figures, 4 tables

  27. arXiv:2507.00008  [pdf, other

    cs.AI cs.CV cs.HC

    DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

    Authors: Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

    Abstract: Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolit… ▽ More

    Submitted 11 June, 2025; originally announced July 2025.

    Comments: 8 pages, 6 figures

  28. arXiv:2506.24120  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

    Authors: Yuqing Wang, Shangding Gu

    Abstract: Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improv… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  29. arXiv:2506.24081  [pdf, ps, other

    quant-ph cs.AI cs.LG

    SQUASH: A SWAP-Based Quantum Attack to Sabotage Hybrid Quantum Neural Networks

    Authors: Rahul Kumar, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen

    Abstract: We propose a circuit-level attack, SQUASH, a SWAP-Based Quantum Attack to sabotage Hybrid Quantum Neural Networks (HQNNs) for classification tasks. SQUASH is executed by inserting SWAP gate(s) into the variational quantum circuit of the victim HQNN. Unlike conventional noise-based or adversarial input attacks, SQUASH directly manipulates the circuit structure, leading to qubit misalignment and dis… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Keywords: Quantum Machine Learning, Hybrid Quantum Neural Networks, SWAP Test, Fidelity, Circuit-level Attack

  30. arXiv:2506.24063  [pdf, ps, other

    cs.CV

    Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

    Authors: Deng Li, Aming Wu, Yang Li, Yaowei Wang, Yahong Han

    Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers.… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  31. arXiv:2506.24044  [pdf, ps, other

    cs.CV cs.AI cs.RO

    A Survey on Vision-Language-Action Models for Autonomous Driving

    Authors: Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun

    Abstract: The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructio… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  32. arXiv:2506.24026  [pdf, ps, other

    cs.AI

    Constructing Non-Markovian Decision Process via History Aggregator

    Authors: Yongyi Wang, Wenxin Li

    Abstract: In the domain of algorithmic decision-making, non-Markovian dynamics manifest as a significant impediment, especially for paradigms such as Reinforcement Learning (RL), thereby exerting far-reaching consequences on the advancement and effectiveness of the associated systems. Nevertheless, the existing benchmarks are deficient in comprehensively assessing the capacity of decision algorithms to hand… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  33. arXiv:2506.23854  [pdf, ps, other

    cs.CV cs.GR

    HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

    Authors: Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Xianpeng Lang

    Abstract: Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal co… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Published in International Conference on Computer Vision (ICCV) 2025

  34. arXiv:2506.23825  [pdf, ps, other

    cs.CV

    Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

    Authors: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin

    Abstract: Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  35. arXiv:2506.23726  [pdf, ps, other

    cs.LG cs.AI

    System-Embedded Diffusion Bridge Models

    Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek

    Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic pr… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Preprint

  36. arXiv:2506.23690  [pdf, ps, other

    cs.CV

    SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

    Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen

    Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce vis… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Project page: https://lucaria-academy.github.io/SynMotion/

  37. arXiv:2506.23643  [pdf, ps, other

    cs.IR

    Act-With-Think: Chunk Auto-Regressive Modeling for Generative Recommendation

    Authors: Yifan Wang, Weinan Gan, Longtao Xiao, Jieming Zhu, Heng Chang, Haozhao Wang, Rui Zhang, Zhenhua Dong, Ruiming Tang, Ruixuan Li

    Abstract: Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 9 pages, 2 figures

  38. arXiv:2506.23641  [pdf, ps, other

    cs.CV cs.AI

    VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

    Authors: Peng Huang, Junhu Fu, Bowen Guo, Zeju Li, Yuanyuan Wang, Yi Guo

    Abstract: As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  39. arXiv:2506.23590  [pdf, ps, other

    cs.CV

    CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

    Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

    Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' a… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  40. arXiv:2506.23490  [pdf, ps, other

    eess.IV cs.AI cs.CV

    UltraTwin: Towards Cardiac Anatomical Twin Generation from Multi-view 2D Ultrasound

    Authors: Junxuan Yu, Yaofei Duan, Yuhao Huang, Yu Wang, Rongbo Ling, Weihao Luo, Ang Zhang, Jingxian Xu, Qiongying Ni, Yongsong Zhou, Binghan Li, Haoran Dou, Liping Liu, Yanfen Chu, Feng Geng, Zhe Sheng, Zhifeng Ding, Dingxin Zhang, Rui Huang, Yuhang Zhang, Xiaowei Xu, Tao Tan, Dong Ni, Zhongshan Gou, Xin Yang

    Abstract: Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quan… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: accepted by miccai 2025

  41. arXiv:2506.23479  [pdf, ps, other

    cs.CV

    Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting

    Authors: Zhaojie Zeng, Yuesong Wang, Chao Yang, Tao Guan, Lili Ju

    Abstract: Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To a… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  42. arXiv:2506.23367  [pdf, ps, other

    cs.SD cs.CL eess.AS

    You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

    Authors: Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim

    Abstract: We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouragin… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted to ISCA Speech Synthesis Workshop, 2025

  43. arXiv:2506.23283  [pdf, ps, other

    cs.CV

    MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

    Authors: Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

    Abstract: Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the f… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: ICML 2025 paper

  44. arXiv:2506.23280  [pdf, ps, other

    cs.LG

    BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition

    Authors: Chaoqun Du, Yulin Wang, Shiji Song, Gao Huang

    Abstract: Bayesian decision theory advocates the Bayes classifier as the optimal approach for minimizing the risk in machine learning problems. Current deep learning algorithms usually solve for the optimal classifier by \emph{implicitly} estimating the posterior probabilities, \emph{e.g.}, by minimizing the Softmax cross-entropy loss. This simple methodology has been proven effective for meticulously balan… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  45. arXiv:2506.23257  [pdf, ps, other

    cs.CV

    PCLVis: Visual Analytics of Process Communication Latency in Large-Scale Simulation

    Authors: Chongke Bi, Xin Gao, Baofeng Fu, Yuheng Zhao, Siming Chen, Ying Zhao, Yunhai Wang

    Abstract: Large-scale simulations on supercomputers have become important tools for users. However, their scalability remains a problem due to the huge communication cost among parallel processes. Most of the existing communication latency analysis methods rely on the physical link layer information, which is only available to administrators. In this paper, a framework called PCLVis is proposed to help gene… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  46. arXiv:2506.23152  [pdf, ps, other

    cs.RO

    DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

    Authors: Youzhuo Wang, Jiayi Ye, Chuyang Xiao, Yiming Zhong, Heng Tao, Hang Yu, Yumeng Liu, Jingyi Yu, Yuexin Ma

    Abstract: Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handove… ▽ More

    Submitted 2 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

    Comments: Comments: Accepted by ICCV 2025. Project page: https://dexh2r.github.io/

  47. arXiv:2506.23138  [pdf, ps, other

    cs.CV

    VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

    Authors: Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu

    Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between gener… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: 12 pages, 5 figures

  48. arXiv:2506.22926  [pdf, ps, other

    cs.HC cs.GR cs.MM

    Coordinated 2D-3D Visualization of Volumetric Medical Data in XR with Multimodal Interactions

    Authors: Qixuan Liu, Shi Qiu, Yinqiao Wang, Xiwen Wu, Kenneth Siu Ho Chok, Chi-Wing Fu, Pheng-Ann Heng

    Abstract: Volumetric medical imaging technologies produce detailed 3D representations of anatomical structures. However, effective medical data visualization and exploration pose significant challenges, especially for individuals with limited medical expertise. We introduce a novel XR-based system with two key innovations: (1) a coordinated visualization module integrating Multi-layered Multi-planar Reconst… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: IEEE VIS 2025 Short Paper

  49. arXiv:2506.22908  [pdf, ps, other

    cs.CV

    Attention to Burstiness: Low-Rank Bilinear Prompt Tuning

    Authors: Yuzhu Wang, Manni Duan, Shu Kong

    Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the v… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: ICCV 2025

  50. arXiv:2506.22843  [pdf, ps, other

    cs.CV

    AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results

    Authors: Kien Nguyen, Clinton Fookes, Sridha Sridharan, Huy Nguyen, Feng Liu, Xiaoming Liu, Arun Ross, Dana Michalski, Tamás Endrei, Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zijing Gong, Yuhao Wang, Xuehu Liu, Pingping Zhang, Md Rashidunnabi, Hugo Proença, Kailash A. Hambarde, Saeid Rezaei

    Abstract: Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.