Skip to main content

Showing 1–50 of 1,178 results for author: Ma, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05727  [pdf, ps, other

    eess.AS cs.CL cs.SD

    ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

    Authors: He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin

    Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and c… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 18 pages, 4 figures

  2. arXiv:2507.05173  [pdf, ps, other

    cs.CV

    Semantic Frame Interpolation

    Authors: Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao, Xiaobin Hu, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers hav… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: https://github.com/hyj542682306/Semantic-Frame-Interpolation

  3. arXiv:2507.04705  [pdf, ps, other

    cs.CV

    Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

    Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant t… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  4. arXiv:2507.03908  [pdf, ps, other

    cs.CV

    Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

    Authors: Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun

    Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  5. arXiv:2507.03427  [pdf, ps, other

    cs.CV

    Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense

    Authors: Lina Ma, Xiaowei Fu, Fuxiang Huang, Xinbo Gao, Lei Zhang

    Abstract: Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robu… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: To appear in IEEEE Transactions on Multimedia

  6. arXiv:2507.02654  [pdf, ps, other

    cs.AR

    Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure

    Authors: Rui Xie, Asad Ul Haq, Yunhua Fang, Linsen Ma, Sanchari Sen, Swagath Venkataramani, Liu Liu, Tong Zhang

    Abstract: High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a dom… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  7. arXiv:2506.23995  [pdf, ps, other

    cs.SE cs.AI cs.RO

    STCLocker: Deadlock Avoidance Testing for Autonomous Driving Systems

    Authors: Mingfei Cheng, Renzhi Wang, Xiaofei Xie, Yuan Zhou, Lei Ma

    Abstract: Autonomous Driving System (ADS) testing is essential to ensure the safety and reliability of autonomous vehicles (AVs) before deployment. However, existing techniques primarily focus on evaluating ADS functionalities in single-AV settings. As ADSs are increasingly deployed in multi-AV traffic, it becomes crucial to assess their cooperative performance, particularly regarding deadlocks, a fundament… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  8. arXiv:2506.23986  [pdf, ps, other

    cs.SD eess.AS

    StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

    Authors: Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie

    Abstract: Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token s… ▽ More

    Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

  9. arXiv:2506.23461  [pdf, ps, other

    cs.CV cs.AI

    Time-variant Image Inpainting via Interactive Distribution Transition Estimation

    Authors: Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Di Lin, Ivor Tsang, Lei Ma

    Abstract: In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the r… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  10. arXiv:2506.22039  [pdf, ps, other

    cs.LG cs.AI

    UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting

    Authors: Lu Han, Yu Liu, Qiwen Deng, Jian Jiang, Yinbo Sun, Zhe Yu, Binfeng Wang, Xingyu Lu, Lintao Ma, Han-Jia Ye, De-Chuan Zhan

    Abstract: Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates--such as categorical variables and multimodal data (e.g., images, text)--which are typically task-specific and difficult to… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  11. arXiv:2506.20945  [pdf, ps, other

    cs.SD eess.AS

    A Multi-Stage Framework for Multimodal Controllable Speech Synthesis

    Authors: Rui Niu, Weihao Wu, Jie Chen, Long Ma, Zhiyong Wu

    Abstract: Controllable speech synthesis aims to control the style of generated speech using reference input, which can be of various modalities. Existing face-based methods struggle with robustness and generalization due to data quality constraints, while text prompt methods offer limited diversity and fine-grained control. Although multimodal approaches aim to integrate various modalities, their reliance o… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted by ICME2025

  12. arXiv:2506.19681  [pdf, ps, other

    cs.CV

    Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

    Authors: Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Chan, Herui Yao, Hao Chen

    Abstract: Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information du… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Under Review

  13. arXiv:2506.19665  [pdf, ps, other

    cs.CV cs.CL

    Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation

    Authors: Yuanhe Tian, Lei Mao, Yan Song

    Abstract: Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where t… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: 7 pages, 3 figures

  14. arXiv:2506.19022  [pdf, ps, other

    cs.CV

    Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation

    Authors: Jinlong Li, Dong Zhao, Qi Zang, Zequn Jie, Lin Ma, Nicu Sebe

    Abstract: Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios with changing target distributions. Existing CTTA methods primarily focus on mitigating the challenges of catastrophic forgetting and error accumulation. Though there have been emerging methods based on forgetting adaptation with parameter-efficient fine-tuning, they still… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  15. arXiv:2506.16690  [pdf, ps, other

    cs.CV

    DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches

    Authors: Yun Xing, Yue Cao, Nhat Chung, Jie Zhang, Ivor Tsang, Ming-Ming Cheng, Yang Liu, Lei Ma, Qing Guo

    Abstract: Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digit… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  16. arXiv:2506.15711  [pdf, ps, other

    cs.LG cs.AI cs.CR cs.CV

    Shadow defense against gradient inversion attack in federated learning

    Authors: Le Jiang, Liyan Ma, Guang Yang

    Abstract: Federated learning (FL) has emerged as a transformative framework for privacy-preserving distributed training, allowing clients to collaboratively train a global model without sharing their local data. This is especially crucial in sensitive fields like healthcare, where protecting patient data is paramount. However, privacy leakage remains a critical challenge, as the communication of model updat… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  17. arXiv:2506.15231  [pdf, ps, other

    cs.CV

    Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images

    Authors: Liangjie Meng, Danxia Li, Jinrong He, Lili Ma, Zhixin Li

    Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed w… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 5 pages, 4 figures, 2 tables. Code available at https://github.com/mlj666219/C-AFBiFPN/tree/master

  18. MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution

    Authors: Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma

    Abstract: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages fro… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

  19. arXiv:2506.14315  [pdf, ps, other

    cs.GR cs.CV

    ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies

    Authors: Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma

    Abstract: Automatic creation of 3D scenes for immersive VR presence has been a significant research focus for decades. However, existing methods often rely on either high-poly mesh modeling with post-hoc simplification or massive 3D Gaussians, resulting in a complex pipeline or limited visual realism. In this paper, we demonstrate that such exhaustive modeling is unnecessary for achieving compelling immersi… ▽ More

    Submitted 18 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

    Comments: Project webpage: https://immersegen.github.io

  20. arXiv:2506.14113  [pdf, ps, other

    cs.LG cs.AI stat.ML

    SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting

    Authors: Yitian Zhang, Liheng Ma, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

    Abstract: Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a space of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-di… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  21. arXiv:2506.13056  [pdf, ps, other

    cs.AI cs.CV cs.LG

    Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

    Authors: Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma

    Abstract: Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, wh… ▽ More

    Submitted 26 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

    Comments: Project Page: https://github.com/MM-Thinking/Metis-RISE

  22. arXiv:2506.12609  [pdf, ps, other

    cs.CV

    Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation

    Authors: Lexiang Tang, Xianwei Zhuang, Bang Yang, Zhiyuan Hu, Hongxiang Li, Lu Ma, Jinghan Ru, Yuexian Zou

    Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through sys… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  23. arXiv:2506.12401  [pdf, ps, other

    cs.CV

    Feature Complementation Architecture for Visual Place Recognition

    Authors: Weiwei Wang, Meijia Wang, Haoyi Wang, Wenqiang Guo, Jiapan Guo, Changming Sun, Lingkun Ma, Weichuan Zhang

    Abstract: Visual place recognition (VPR) plays a crucial role in robotic localization and navigation. The key challenge lies in constructing feature representations that are robust to environmental changes. Existing methods typically adopt convolutional neural networks (CNNs) or vision Transformers (ViTs) as feature extractors. However, these architectures excel in different aspects -- CNNs are effective at… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  24. arXiv:2506.11908  [pdf, ps, other

    cs.LG cs.AI

    Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table

    Authors: Yufeng Wang, Peiyao Wang, Lu Ma, Yuewei Lin, Qun Liu, Haibin Ling

    Abstract: X-ray Absorption Spectroscopy (XAS) is a powerful technique for probing local atomic environments, yet its interpretation remains limited by the need for expert-driven analysis, computationally expensive simulations, and element-specific heuristics. Recent advances in machine learning have shown promise for accelerating XAS interpretation, but many existing models are narrowly focused on specific… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  25. arXiv:2506.11160  [pdf, ps, other

    eess.AS cs.SD

    S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

    Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Despite recent advances in multilingual speech-to-speech translation (S2ST), several critical challenges persist: 1) achieving high-quality translation remains a major hurdle, and 2) most existing methods heavily rely on large-scale parallel speech corpora, which are costly and difficult to obtain. To address these issues, we propose \textit{S2ST-Omni}, an efficient and scalable framework for mult… ▽ More

    Submitted 8 July, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Working in progress

  26. arXiv:2506.11127  [pdf, ps, other

    cs.CL cs.AI

    GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

    Authors: Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

    Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screensho… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  27. arXiv:2506.11119  [pdf

    cs.CL cs.SD eess.AS

    Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech

    Authors: Jingyu Li, Lingchao Mao, Hairong Wang, Zhendong Wang, Xi Mao, Xuelei Sherry Ni

    Abstract: Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is vital for timely intervention and care. Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers for cognitive decline. Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddin… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    MSC Class: 68T10 (Primary); 68U99 (Secondary) ACM Class: I.2.1; J.3

  28. arXiv:2506.10915  [pdf, ps, other

    cs.CV cs.AI cs.LG

    M4V: Multi-Modal Mamba for Text-to-Video Generation

    Authors: Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma

    Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence m… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  29. arXiv:2506.10890  [pdf, ps, other

    cs.CV

    CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation

    Authors: Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi, Lei Ma, Hui Zhang, Jie Shao, Xinglong Wu

    Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional vis… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  30. arXiv:2506.09378  [pdf, ps, other

    cs.CV

    UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images

    Authors: Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma

    Abstract: We propose a feed-forward Gaussian Splatting model that unifies 3D scene and semantic field reconstruction. Combining 3D scenes with semantic fields facilitates the perception and understanding of the surrounding environment. However, key challenges include embedding semantics into 3D representations, achieving generalizable real-time reconstruction, and ensuring practical applicability by using o… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  31. arXiv:2506.08889  [pdf, ps, other

    cs.LG cs.AI

    SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

    Authors: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

    Abstract: We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and c… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  32. arXiv:2506.08708  [pdf, ps, other

    cs.RO cs.AI cs.CV

    PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

    Authors: Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang

    Abstract: While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  33. arXiv:2506.08356  [pdf, ps, other

    cs.CV

    MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

    Authors: Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, Zsolt Kira

    Abstract: Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision… ▽ More

    Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  34. arXiv:2506.07527  [pdf, other

    cs.AI cs.LG

    Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

    Authors: Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, Wentao Zhang

    Abstract: Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model r… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 12 pages, 5 figures

  35. arXiv:2506.07419  [pdf, ps, other

    cs.SE

    Generate Realistic Test Scenes for V2X Communication Systems

    Authors: An Guo, Xinyu Gao, Chunrong Fang, Haoxiang Tian, Weisong Sun, Yanzhou Mu, Shuncheng Tang, Lei Ma, Zhenyu Chen

    Abstract: Accurately perceiving complex driving environments is essential for ensuring the safe operation of autonomous vehicles. With the tremendous progress in deep learning and communication technologies, cooperative perception with Vehicle-to-Everything (V2X) technologies has emerged as a solution to overcome the limitations of single-agent perception systems in perceiving distant objects and occlusions… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  36. arXiv:2506.03690  [pdf, other

    cs.CL

    Robust Preference Optimization via Dynamic Target Margins

    Authors: Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang

    Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 18 pages, 6 figures, accepted to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL2025)

  37. arXiv:2506.03643  [pdf, ps, other

    cs.CV

    Images are Worth Variable Length of Representations

    Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang

    Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision e… ▽ More

    Submitted 5 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

  38. arXiv:2506.01783  [pdf, ps, other

    cs.CV

    FaceCoT: A Benchmark Dataset for Face Anti-Spoofing with Chain-of-Thought Reasoning

    Authors: Honglu Zhang, Zhiqin Fang, Ningning Zhao, Saihui Hou, Long Ma, Renwang Pei, Zhaofeng He

    Abstract: Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggest… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  39. arXiv:2506.01116  [pdf, ps, other

    cs.AI q-bio.QM

    ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation

    Authors: Xinyi Liu, Lipeng Ma, Yixuan Li, Weidong Yang, Qingyuan Zhou, Jiayi Song, Shuhao Li, Ben Fei

    Abstract: Large Language Models (LLMs) are widely used across various scenarios due to their exceptional reasoning capabilities and natural language understanding. While LLMs demonstrate strong performance in tasks involving mathematics and coding, their effectiveness diminishes significantly when applied to chemistry-related problems. Chemistry problems typically involve long and complex reasoning steps, w… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  40. arXiv:2506.01039  [pdf, ps, other

    eess.AS cs.SD

    PseudoVC: Improving One-shot Voice Conversion with Pseudo Paired Data

    Authors: Songjun Cao, Qinghua Wu, Jie Chen, Jin Li, Long Ma

    Abstract: As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To addre… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 5 pages, 3 figures

  41. arXiv:2506.00836  [pdf, ps, other

    cs.CV

    Advancing from Automated to Autonomous Beamline by Leveraging Computer Vision

    Authors: Baolu Li, Hongkai Yu, Huiming Sun, Jin Ma, Yuewei Lin, Lu Ma, Yonghua Du

    Abstract: The synchrotron light source, a cutting-edge large-scale user facility, requires autonomous synchrotron beamline operations, a crucial technique that should enable experiments to be conducted automatically, reliably, and safely with minimum human intervention. However, current state-of-the-art synchrotron beamlines still heavily rely on human safety oversight. To bridge the gap between automated a… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  42. arXiv:2505.24329  [pdf, ps, other

    cs.CV

    DisTime: Distribution-based Time Representation for Video Large Language Models

    Authors: Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu

    Abstract: Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal gro… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  43. arXiv:2505.23862  [pdf, ps, other

    q-bio.QM cs.AI cs.LG

    A New Deep-learning-Based Approach For mRNA Optimization: High Fidelity, Computation Efficiency, and Multiple Optimization Factors

    Authors: Zheng Gong, Ziyi Jiang, Weihao Gao, Deng Zhuo, Lan Ma

    Abstract: The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: You can also contact [email protected] for more information

  44. arXiv:2505.21969  [pdf, ps, other

    cs.RO cs.AI

    DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

    Authors: Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan

    Abstract: Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from d… ▽ More

    Submitted 5 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

  45. arXiv:2505.21067  [pdf, ps, other

    cs.AI

    Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

    Authors: Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou

    Abstract: Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typic… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  46. arXiv:2505.20902  [pdf

    eess.IV cs.CV

    Multitemporal Latent Dynamical Framework for Hyperspectral Images Unmixing

    Authors: Ruiying Li, Bin Pan, Lan Ma, Xia Xu, Zhenwei Shi

    Abstract: Multitemporal hyperspectral unmixing can capture dynamical evolution of materials. Despite its capability, current methods emphasize variability of endmembers while neglecting dynamics of abundances, which motivates our adoption of neural ordinary differential equations to model abundances temporally. However, this motivation is hindered by two challenges: the inherent complexity in defining, mode… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 11 Pages,8 figures

    MSC Class: 68T07 ACM Class: I.4.10

  47. arXiv:2505.20830  [pdf, ps, other

    cs.CV

    Causality-Driven Infrared and Visible Image Fusion

    Authors: Linli Ma, Suzhen Lin, Jianchao Zeng, Zanxia Jin, Yanbo Wang, Fengyuan Li, Yubing Luo

    Abstract: Image fusion aims to combine complementary information from multiple source images to generate more comprehensive scene representations. Existing methods primarily rely on the stacking and design of network architectures to enhance the fusion performance, often ignoring the impact of dataset scene bias on model training. This oversight leads the model to learn spurious correlations between specifi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  48. arXiv:2505.20469  [pdf, other

    cs.CV cs.AI

    CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

    Authors: Lei Tian, Xiaomin Li, Liqian Ma, Hefei Huang, Zirui Zheng, Hao Yin, Taiqing Li, Huchuan Lu, Xu Jia

    Abstract: Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations.… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  49. arXiv:2505.20148  [pdf, ps, other

    cs.AI

    MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

    Authors: Ziming Wei, Bingqian Lin, Zijian Jiao, Yunshuang Nie, Liang Ma, Yuecheng Liu, Yuzheng Zhuang, Xiaodan Liang

    Abstract: Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluati… ▽ More

    Submitted 27 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  50. arXiv:2505.19161  [pdf, ps, other

    cs.CV

    Benchmarking Laparoscopic Surgical Image Restoration and Beyond

    Authors: Jialun Pei, Diandian Guo, Donghui Yang, Zhixi Li, Yuxin Feng, Long Ma, Bo Du, Pheng-Ann Heng

    Abstract: In laparoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impair visual clarity. These degenerations can se… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.