Skip to main content

Showing 1–50 of 1,046 results for author: Yang*, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.25958  [pdf, ps, other

    cs.AI cs.CL

    RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning

    Authors: Gang Li, Yulei Qin, Xiaoyu Tan, Dingkang Yang, Yuchen Shi, Zihan Xu, Xiang Li, Xing Sun, Ke Li

    Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response lengt… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  2. arXiv:2509.25655  [pdf, ps, other

    cs.AI

    Landmark-Guided Knowledge for Vision-and-Language Navigation

    Authors: Dongsheng Yang, Meiling Zhu, Yinfeng Yu

    Abstract: Vision-and-language navigation is one of the core tasks in embodied intelligence, requiring an agent to autonomously navigate in an unfamiliar environment based on natural language instructions. However, existing methods often fail to match instructions with environmental information in complex scenarios, one reason being the lack of common-sense reasoning ability. This paper proposes a vision-and… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: Accepted for publication by International Conference on Intelligent Computing 2025

  3. arXiv:2509.24358  [pdf, ps, other

    cs.CV cs.AI

    An Enhanced Pyramid Feature Network Based on Long-Range Dependencies for Multi-Organ Medical Image Segmentation

    Authors: Dayu Tan, Cheng Kong, Yansen Su, Hai Chen, Dongliang Yang, Junfeng Xia, Chunhou Zheng

    Abstract: In the field of multi-organ medical image segmentation, recent methods frequently employ Transformers to capture long-range dependencies from image features. However, these methods overlook the high computational cost of Transformers and their deficiencies in extracting local detailed information. To address high computational costs and inadequate local detail information, we reassess the design o… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  4. arXiv:2509.23719  [pdf, ps, other

    cs.CV

    PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease

    Authors: Shuai Shao, Shu Jiang, Shiyuan Zhao, Di Yang, Yan Wang, Yutong Bai, Jianguo Zhang, Jiangtao Wang

    Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

  5. arXiv:2509.23683  [pdf, ps, other

    cs.LG

    Decentralized Dynamic Cooperation of Personalized Models for Federated Continual Learning

    Authors: Danni Yang, Zhikang Chen, Sen Cui, Mengyue Yang, Ding Li, Abudukelimu Wuerkaixi, Haoxuan Li, Jinke Ren, Mingming Gong

    Abstract: Federated continual learning (FCL) has garnered increasing attention for its ability to support distributed computation in environments with evolving data distributions. However, the emergence of new tasks introduces both temporal and cross-client shifts, making catastrophic forgetting a critical challenge. Most existing works aggregate knowledge from clients into a global model, which may not enh… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

  6. arXiv:2509.23105  [pdf, ps, other

    cs.CV

    Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

    Authors: Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin, Fei Yu, Wei Zhou, Yanfei Zhong, Yang Liu, Dingkang Yang

    Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset t… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  7. arXiv:2509.21950  [pdf, ps, other

    cs.CV

    Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

    Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

    Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  8. arXiv:2509.21623  [pdf, ps, other

    cs.CL cs.AI cs.LG

    OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule

    Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen

    Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache c… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  9. arXiv:2509.21371  [pdf, ps, other

    cs.IR cs.AI cs.CL cs.LG

    ReGeS: Reciprocal Retrieval-Generation Synergy for Conversational Recommender Systems

    Authors: Dayu Yang, Hui Fang

    Abstract: Connecting conversation with external domain knowledge is vital for conversational recommender systems (CRS) to correctly understand user preferences. However, existing solutions either require domain-specific engineering, which limits flexibility, or rely solely on large language models, which increases the risk of hallucination. While Retrieval-Augmented Generation (RAG) holds promise, its naive… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted by WISE 2025: 26th International Web Information Systems Engineering conference. Our code is publicly available at the link: https://github.com/dayuyang1999/ReGeS

  10. arXiv:2509.21033  [pdf, ps, other

    cs.SD cs.AI

    SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

    Authors: Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang

    Abstract: Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  11. arXiv:2509.20843  [pdf, ps, other

    cs.RO

    MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases

    Authors: Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, Diange Yang

    Abstract: Vision-Language Models(VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRD… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

    Comments: 8 pages

  12. arXiv:2509.19314  [pdf, ps, other

    cs.CL cs.AI cs.CY

    Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias

    Authors: Sirui Wu, Daijin Yang

    Abstract: This study evaluates item neutralization assisted by the large language model (LLM) to reduce social desirability bias in personality assessment. GPT-o3 was used to rewrite the International Personality Item Pool Big Five Measure (IPIP-BFM-50), and 203 participants completed either the original or neutralized form along with the Marlowe-Crowne Social Desirability Scale. The results showed preserve… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

    Comments: Accepted for publication in NCME-AIME 2025

  13. arXiv:2509.18774  [pdf, ps, other

    cs.IT

    A Two-Dimensional Super-Resolution Method for Reconfigurable Intelligent Surface-Assisted Near-Field Localization

    Authors: Feng Xi, Dehui Yang

    Abstract: Reconfigurable intelligent surface (RIS)-aided localization in the radiating near-field requires range-aware spherical-wave models, which inherently couple angles and ranges and thus complicate accurate 3D positioning. Using the Fresnel approximation, we show that the RIS response can be expressed as the element-wise product of a 2D far-field steering vector and a range-dependent quadratic-phase c… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  14. arXiv:2509.18752  [pdf, ps, other

    cs.IT

    A Convex Demixing Approach for Hybrid-Field Channel Estimation of XL-MIMO Systems via Atomic Norm Minimization

    Authors: Dehui Yang, Feng Xi, Yanxian Zhu

    Abstract: Channel estimation is a critical task in extremely large-scale multiple-input multiple-output (XL-MIMO) systems for 6G wireless communications. A hybrid-field channel model effectively characterizes the mixed far-field and near-field scattering components in practical XL-MIMO systems. In this paper, we propose a convex demixing approach for hybrid-field channel estimation within the atomic norm mi… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  15. arXiv:2509.16679  [pdf, ps, other

    cs.CL

    Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

    Authors: Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, Lihua Zhang

    Abstract: In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comp… ▽ More

    Submitted 20 September, 2025; originally announced September 2025.

    Comments: A Survey of Reinforcement Learning for Large Language Models

  16. arXiv:2509.15156  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models

    Authors: Haobo Yang, Minghao Guo, Dequan Yang, Wenyu Wang

    Abstract: Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied ph… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

  17. arXiv:2509.14033  [pdf, ps, other

    cs.CV

    SAIL-VL2 Technical Report

    Authors: Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng

    Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Its effectiveness is driven… ▽ More

    Submitted 18 September, 2025; v1 submitted 17 September, 2025; originally announced September 2025.

    Comments: Technical Report

  18. arXiv:2509.13323  [pdf, ps, other

    cs.HC econ.GN

    AI Behavioral Science

    Authors: Matthew O. Jackson, Qiaozhu Me, Stephanie W. Wang, Yutong Xie, Walter Yuan, Seth Benzell, Erik Brynjolfsson, Colin F. Camerer, James Evans, Brian Jabarian, Jon Kleinberg, Juanjuan Meng, Sendhil Mullainathan, Asuman Ozdaglar, Thomas Pfeiffer, Moshe Tennenholtz, Robb Willer, Diyi Yang, Teng Ye

    Abstract: We discuss the three main areas comprising the new and emerging field of "AI Behavioral Science". This includes not only how AI can enhance research in the behavioral sciences, but also how the behavioral sciences can be used to study and better design AI and to understand how the world will change as AI and humans interact in increasingly layered and complex ways.

    Submitted 17 August, 2025; originally announced September 2025.

  19. arXiv:2509.11713  [pdf, ps, other

    cs.LG cs.NI

    Beyond Regularity: Modeling Chaotic Mobility Patterns for Next Location Prediction

    Authors: Yuqian Wu, Yuhong Peng, Jiapeng Yu, Xiangyu Liu, Zeting Yan, Kang Lin, Weifeng Su, Bingqing Qu, Raymond Lee, Dingqi Yang

    Abstract: Next location prediction is a key task in human mobility analysis, crucial for applications like smart city resource allocation and personalized navigation services. However, existing methods face two significant challenges: first, they fail to address the dynamic imbalance between periodic and chaotic mobile patterns, leading to inadequate adaptation over sparse trajectories; second, they underut… ▽ More

    Submitted 15 September, 2025; originally announced September 2025.

    Comments: 12 pages, 5 figures

  20. arXiv:2509.11134  [pdf, ps, other

    cs.DC

    GFS: A Preemption-aware Scheduling Framework for GPU Clusters with Predictive Spot Instance Management

    Authors: Jiaang Duan, Shenglin Xu, Shiyou Qian, Dingyu Yang, Kangjin Wang, Chenzhi Liao, Yinghao Yu, Qin Hua, Hanwen Hu, Qi Wang, Wenchao Wu, Dongqing Bao, Tianyu Lu, Jian Cao, Guangtao Xue, Guodong Yang, Liping Zhang, Gang Chen

    Abstract: The surge in large language models (LLMs) has fundamentally reshaped the landscape of GPU usage patterns, creating an urgent need for more efficient management strategies. While cloud providers employ spot instances to reduce costs for low-priority (LP) tasks, existing schedulers still grapple with high eviction rates and lengthy queuing times. To address these limitations, we present GFS, a novel… ▽ More

    Submitted 14 September, 2025; originally announced September 2025.

    Comments: This paper has been accepted to the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2026)

  21. arXiv:2509.09713  [pdf, ps, other

    cs.CL cs.AI

    HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

    Authors: Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu

    Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods s… ▽ More

    Submitted 8 September, 2025; originally announced September 2025.

  22. arXiv:2509.03421  [pdf

    eess.IV cs.CV

    Generalist versus Specialist Vision Foundation Models for Ocular Disease and Oculomics

    Authors: Yukun Zhou, Paul Nderitu, Jocelyn Hui Lin Goh, Justin Engelmann, Siegfried K. Wagner, Anran Ran, Hongyang Jiang, Lie Ju, Ke Zou, Sahana Srinivasan, Hyunmin Kim, Takahiro Ninomiya, Zheyuan Wang, Gabriel Dawei Yang, Eden Ruffell, Dominic Williamson, Rui Santos, Gabor Mark Somfai, Carol Y. Cheung, Tien Yin Wong, Daniel C. Alexander, Yih Chung Tham, Pearse A. Keane

    Abstract: Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

    Comments: 39 pages, 8 Figures

    ACM Class: J.3; I.2.10

  23. arXiv:2509.03419  [pdf, ps, other

    cs.CL

    Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

    Authors: Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang

    Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

    Comments: 8 pages, 4 figures, conference

  24. arXiv:2509.02993  [pdf, ps, other

    cs.CV

    SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation

    Authors: Chao Fan, Xibin Jia, Anqi Xiao, Hongyuan Yu, Zhenghan Yang, Dawei Yang, Hui Xu, Yan Huang, Liang Wang

    Abstract: Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel classes of medical objects using only a few labeled images. Prototype-based methods have made significant progress in addressing FSMIS. However, they typically generate a single global prototype for the support image to match with the query image, overlooking intra-class variations. To address this issue, we propose a Self-guided Pr… ▽ More

    Submitted 2 September, 2025; originally announced September 2025.

    Comments: Accepted by MICCAI2025

  25. arXiv:2509.01080  [pdf, ps, other

    cs.CV

    SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection

    Authors: Yao Wang, Dong Yang, Zhi Qiao, Wenjian Huang, Liuzhi Yang, Zhen Qian

    Abstract: Abnormality detection in medical imaging is a critical task requiring both high efficiency and accuracy to support effective diagnosis. While convolutional neural networks (CNNs) and Transformer-based models are widely used, both face intrinsic challenges: CNNs have limited receptive fields, restricting their ability to capture broad contextual information, and Transformers encounter prohibitive c… ▽ More

    Submitted 31 August, 2025; originally announced September 2025.

  26. arXiv:2508.20471  [pdf, ps, other

    cs.CV

    Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation

    Authors: Jiusi Li, Jackson Jiang, Jinyu Miao, Miao Long, Tuopu Wen, Peijin Jia, Shengxiang Liu, Chunlei Yu, Maolin Liu, Yuzhan Cai, Kun Jiang, Mengmeng Yang, Diange Yang

    Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fid… ▽ More

    Submitted 28 August, 2025; originally announced August 2025.

  27. arXiv:2508.19895  [pdf, ps, other

    cs.CV

    PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos

    Authors: Ziyun Qian, Runyu Xiao, Shuyuan Tu, Wei Xue, Dingkang Yang, Mingcheng Li, Dongliang Kou, Minghao Han, Zizhi Chen, Lihua Zhang

    Abstract: Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes vi… ▽ More

    Submitted 27 August, 2025; originally announced August 2025.

  28. arXiv:2508.19227  [pdf, ps, other

    cs.CL cs.AI cs.HC

    Generative Interfaces for Language Models

    Authors: Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang

    Abstract: Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Inter… ▽ More

    Submitted 26 August, 2025; originally announced August 2025.

    Comments: Preprint

  29. arXiv:2508.19182  [pdf, ps, other

    cs.CV

    SoccerNet 2025 Challenges Results

    Authors: Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez, Jan Held, Carlos Hinojosa, Victor Joos, Arnaud Leduc, Floriane Magera, Karen Sanchez, Vladimir Somers, Artur Xarles, Antonio Agudo, Alexandre Alahi, Olivier Barnich, Albert Clapés, Christophe De Vleeschouwer, Sergio Escalera, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck, Tomoki Abe, Saad Alotaibi, Faisal Altawijri, Steven Araujo, Xiang Bai , et al. (93 additional authors not shown)

    Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, tar… ▽ More

    Submitted 26 August, 2025; originally announced August 2025.

  30. arXiv:2508.18791  [pdf, ps, other

    cs.CL

    LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

    Authors: Ziming Zhu, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu

    Abstract: Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic i… ▽ More

    Submitted 26 August, 2025; originally announced August 2025.

  31. arXiv:2508.18734  [pdf, ps, other

    cs.CV cs.AI cs.MM eess.AS eess.SP

    Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

    Authors: DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang

    Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusi… ▽ More

    Submitted 26 August, 2025; originally announced August 2025.

    Comments: Accepted to IEEE ASRU 2025

  32. arXiv:2508.17863  [pdf, ps, other

    cs.CL cs.SD

    Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

    Authors: Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng

    Abstract: With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supe… ▽ More

    Submitted 25 August, 2025; originally announced August 2025.

    Comments: Accepted to EMNLP 2025 Main Conference

  33. arXiv:2508.16674  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation

    Authors: Fangxin Shang, Yuan Xia, Dalu Yang, Yahui Wang, Binglin Yang

    Abstract: Medical report interpretation plays a crucial role in healthcare, enabling both patient-facing explanations and effective information flow across clinical systems. While recent vision-language models (VLMs) and large language models (LLMs) have demonstrated general document understanding capabilities, there remains a lack of standardized benchmarks to assess structured interpretation quality in me… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

  34. arXiv:2508.16620  [pdf, ps, other

    cs.LG cs.AI

    STRelay: A Universal Spatio-Temporal Relaying Framework for Location Prediction with Future Spatiotemporal Contexts

    Authors: Bangchao Deng, Lianhua Ji, Chunhua Chen, Xin Jing, Ling Ding, Bingqing QU, Pengyang Wang, Dingqi Yang

    Abstract: Next location prediction is a critical task in human mobility modeling, enabling applications like travel planning and urban mobility management. Existing methods mainly rely on historical spatiotemporal trajectory data to train sequence models that directly forecast future locations. However, they often overlook the importance of the future spatiotemporal contexts, which are highly informative fo… ▽ More

    Submitted 13 August, 2025; originally announced August 2025.

  35. arXiv:2508.14090  [pdf, ps, other

    cs.CL cs.AI

    DLLMQuant: Quantizing Diffusion-based Large Language Models

    Authors: Chen Xu, Dawei Yang

    Abstract: Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when dire… ▽ More

    Submitted 25 August, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

    Comments: 12 pages, 6 figures

  36. arXiv:2508.13826  [pdf, ps, other

    eess.IV cs.CV

    Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction

    Authors: Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including relia… ▽ More

    Submitted 21 August, 2025; v1 submitted 19 August, 2025; originally announced August 2025.

  37. arXiv:2508.13517  [pdf

    cs.IR cs.AI cs.LG cs.SI

    Heterogeneous Influence Maximization in User Recommendation

    Authors: Hongru Hou, Jiachen Sun, Wenqing Lin, Wendong Bi, Xiangrong Wang, Deqing Yang

    Abstract: User recommendation systems enhance user engagement by encouraging users to act as inviters to interact with other users (invitees), potentially fostering information propagation. Conventional recommendation methods typically focus on modeling interaction willingness. Influence-Maximization (IM) methods focus on identifying a set of users to maximize the information propagation. However, existing… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

    Comments: Accepted in CIKM 2025

  38. arXiv:2508.10880  [pdf, ps, other

    cs.CR cs.AI cs.CL

    Searching for Privacy Risks in LLM Agents via Simulation

    Authors: Yanzhe Zhang, Diyi Yang

    Abstract: The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search-based… ▽ More

    Submitted 25 September, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

    Comments: Preprint

  39. arXiv:2508.09959  [pdf, ps, other

    cs.CV

    LIA-X: Interpretable Latent Portrait Animator

    Authors: Yaohui Wang, Di Yang, Xinyuan Chen, Francois Bremond, Yu Qiao, Antitza Dantcheva

    Abstract: We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpr… ▽ More

    Submitted 13 August, 2025; originally announced August 2025.

    Comments: Project Page: https://wyhsirius.github.io/LIA-X-project/

  40. arXiv:2508.09670  [pdf, ps, other

    cs.AI

    MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

    Authors: Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

    Abstract: Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert… ▽ More

    Submitted 13 August, 2025; originally announced August 2025.

  41. arXiv:2508.09158  [pdf, ps, other

    cs.LG cs.AI

    EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

    Authors: Siwen Jiao, Kangan Qian, Hao Ye, Yang Zhong, Ziang Luo, Sicong Jiang, Zilin Huang, Yangyi Fang, Jinyu Miao, Zheng Fu, Yunlong Wang, Kun Jiang, Diange Yang, Rui Fan, Baoyun Peng

    Abstract: Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preference… ▽ More

    Submitted 14 August, 2025; v1 submitted 5 August, 2025; originally announced August 2025.

  42. arXiv:2508.09123  [pdf, ps, other

    cs.AI cs.CV

    OpenCUA: Open Foundations for Computer-Use Agents

    Authors: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen , et al. (17 additional authors not shown)

    Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open… ▽ More

    Submitted 14 August, 2025; v1 submitted 12 August, 2025; originally announced August 2025.

    Comments: Updata author list, modify first page format, correct typos

  43. arXiv:2508.08961  [pdf, ps, other

    cs.SD eess.AS

    DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

    Authors: Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

    Abstract: Extending pre-trained Large Language Models (LLMs)'s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs… ▽ More

    Submitted 13 August, 2025; v1 submitted 12 August, 2025; originally announced August 2025.

  44. arXiv:2508.08774  [pdf, ps, other

    cs.AI cs.CL

    Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance

    Authors: Dongwook Choi, Taeyoon Kwon, Dongil Yang, Hyojun Kim, Jinyoung Yeo

    Abstract: Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle… ▽ More

    Submitted 12 August, 2025; originally announced August 2025.

    Comments: 7 pages, 2 figures

  45. arXiv:2508.07995  [pdf, ps, other

    cs.IR cs.AI

    DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval

    Authors: Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, Jiahai Wang

    Abstract: Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieva… ▽ More

    Submitted 25 August, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

  46. arXiv:2508.07626  [pdf, ps, other

    cs.CV cs.RO

    AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

    Authors: Dejie Yang, Zijing Zhao, Yang Liu

    Abstract: Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or tra… ▽ More

    Submitted 11 August, 2025; originally announced August 2025.

    Comments: Accepted by ICCV2025

  47. arXiv:2508.06988  [pdf, ps, other

    cs.CV

    TADoc: Robust Time-Aware Document Image Dewarping

    Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Yu Zhou

    Abstract: Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degree… ▽ More

    Submitted 9 August, 2025; originally announced August 2025.

    Comments: 8 pages, 8 figures

  48. arXiv:2508.06902  [pdf, ps, other

    cs.CV

    eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

    Authors: Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang, Jia Zhang, Jiayu Nie, Liangyu Fu, Yang Liu, Junxiao Xue, Hadi Amirpour, Wei Zhou

    Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale ann… ▽ More

    Submitted 9 August, 2025; originally announced August 2025.

  49. arXiv:2508.06471  [pdf, ps, other

    cs.CL

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Authors: GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai , et al. (147 additional authors not shown)

    Abstract: We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance acro… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.

  50. arXiv:2508.06074  [pdf, ps, other

    cs.AI cs.RO

    ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception

    Authors: Siyi Lu, Run Liu, Dongsheng Yang, Lei He

    Abstract: Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving usi… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.