Skip to main content

Showing 1–50 of 1,262 results for author: Ma, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.26008  [pdf, ps, other

    cs.CV cs.AI cs.CG

    PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion

    Authors: Zhiwei Zhang, Ruikai Xu, Weijian Zhang, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma

    Abstract: In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

    Comments: Accepted by ACM MM 2025 Conference

  2. arXiv:2509.25866  [pdf, ps, other

    cs.CV

    DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

    Authors: Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, Jing Zhang

    Abstract: The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  3. arXiv:2509.25826  [pdf, ps, other

    cs.LG

    Kairos: Towards Adaptive and Generalizable Time Series Foundation Models

    Authors: Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Lintao Ma, Xingyu Lu, Kan Ren

    Abstract: Time series foundation models (TSFMs) have emerged as a powerful paradigm for time series analysis, driven by large-scale pretraining on diverse data corpora. However, time series inherently exhibit heterogeneous information density over time, influenced by system states and signal complexity, presenting significant modeling challenges especially in a zero-shot scenario. Current TSFMs rely on non-… ▽ More

    Submitted 30 September, 2025; originally announced September 2025.

  4. arXiv:2509.25646  [pdf, ps, other

    cs.LG math.NA

    Deep set based operator learning with uncertainty quantification

    Authors: Lei Ma, Ling Guo, Hao Wu, Tao Zhou

    Abstract: Learning operators from data is central to scientific machine learning. While DeepONets are widely used for their ability to handle complex domains, they require fixed sensor numbers and locations, lack mechanisms for uncertainty quantification (UQ), and are thus limited in practical applicability. Recent permutationinvariant extensions, such as the Variable-Input Deep Operator Network (VIDON), re… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  5. arXiv:2509.25027  [pdf, ps, other

    cs.CV

    STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

    Authors: Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao

    Abstract: Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: Code available at https://github.com/krennic999/STAGE

  6. arXiv:2509.23625  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    RIV: Recursive Introspection Mask Diffusion Vision Language Model

    Authors: YuQian Li, Limeng Qiao, Lin Ma

    Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two no… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

  7. arXiv:2509.23368  [pdf, ps, other

    cs.CL cs.AI

    MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction

    Authors: Xinchun Su, Chunxu Luo, Yixuan Li, Weidong Yang, Lipeng Ma

    Abstract: In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judg… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  8. arXiv:2509.23352  [pdf, ps, other

    cs.CV cs.AI

    Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

    Authors: Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, Yu Shi, Zhen Chen, Junshi Huang, Jason Li

    Abstract: The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-w… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  9. arXiv:2509.23331  [pdf, ps, other

    cs.CL

    C-Evolve: Consensus-based Evolution for Prompt Groups

    Authors: Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, Guo-jun Qi

    Abstract: Prompt evolution algorithms offer a powerful paradigm for enhancing AI systems based on closed-source models, while few work explores whether aggregating results from multiple prompts to reach a consensus can further advance the system capability boundary. In this paper, we introduce Consensus-Evolve (C-Evolve), an evolutionary algorithm that discovers a group of prompts whose aggregated outputs a… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

    Comments: 70 pages,7 figures

  10. arXiv:2509.23304  [pdf, ps, other

    cs.CV

    Seeing the Unseen in Low-light Spike Streams

    Authors: Liwen Hu, Yang Li, Mianzhi Liu, Yijia Guo, Shenghao Xie, Ziluo Ding, Tiejun Huang, Lei Ma

    Abstract: Spike camera, a type of neuromorphic sensor with high-temporal resolution, shows great promise for high-speed visual tasks. Unlike traditional cameras, spike camera continuously accumulates photons and fires asynchronous spike streams. Due to unique data modality, spike streams require reconstruction methods to become perceptible to the human eye. However, lots of methods struggle to handle spik… ▽ More

    Submitted 27 September, 2025; originally announced September 2025.

  11. arXiv:2509.23071  [pdf, ps, other

    cs.CL cs.AI

    From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents

    Authors: Muzhi Li, Jinhu Qi, Yihong Wu, Minghao Zhao, Liheng Ma, Yifan Li, Xinyu Wang, Yingxue Zhang, Ho-fung Leung, Irwin King

    Abstract: Retrieval-augmented generation agents development is hindered by the lack of process-level supervision to effectively guide agentic capabilities like task decomposition, retriever invocation, and stepwise decision-making. While reinforcement learning offers a potential solution, it suffers from sparse rewards and the limited reasoning capabilities of large language models (LLMs). Meanwhile, existi… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  12. arXiv:2509.22720  [pdf, ps, other

    cs.CV cs.AI cs.LG

    LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

    Authors: Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng, Topojoy Biswas, Evren Korpeoglu, Kaushiki Nag, Kannan Achan

    Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning method… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI

  13. arXiv:2509.22281  [pdf, ps, other

    cs.CV cs.RO

    MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

    Authors: Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang

    Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel t… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

    Comments: Accepted by NeurIPS 2025; Project page: https://mesatask.github.io/

  14. arXiv:2509.20881  [pdf, ps, other

    cs.SE

    PseudoBridge: Pseudo Code as the Bridge for Better Semantic and Logic Alignment in Code Retrieval

    Authors: Yixuan Li, Xinyi Liu, Weidong Yang, Ben Fei, Shuhao Li, Mingjie Zhou, Lipeng Ma

    Abstract: Code search aims to precisely find relevant code snippets that match natural language queries within massive codebases, playing a vital role in software development. Recent advances leverage pre-trained language models (PLMs) to bridge the semantic gap between unstructured natural language (NL) and structured programming languages (PL), yielding significant improvements over traditional informatio… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  15. arXiv:2509.20798  [pdf, ps, other

    cs.AI cs.SE

    LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Automated Log Analysis

    Authors: Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao

    Abstract: Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align wi… ▽ More

    Submitted 27 September, 2025; v1 submitted 25 September, 2025; originally announced September 2025.

    Comments: under review

  16. arXiv:2509.20271  [pdf, ps, other

    cs.CV

    A Versatile Foundation Model for AI-enabled Mammogram Interpretation

    Authors: Fuxiang Huang, Jiayi Zhu, Yunfang Yu, Yu Xie, Yuan Guo, Qingcong Kong, Mingxiang Wu, Xinrui Jiang, Shu Yang, Jiabo Ma, Ziyi Liu, Zhe Xu, Zhixuan Chen, Yujie Tan, Zifan He, Luhui Mao, Xi Wang, Junlin Hou, Lei Zhang, Qiong Luo, Zhenhui Li, Herui Yao, Hao Chen

    Abstract: Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in tra… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: 64 pages, 7 figures, 40 tables

  17. arXiv:2509.18631  [pdf, ps, other

    cs.RO cs.AI

    Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

    Authors: Shuo Cheng, Liqian Ma, Zhenyang Chen, Ajay Mandlekar, Caelan Garrett, Danfei Xu

    Abstract: Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training frame… ▽ More

    Submitted 24 September, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

  18. arXiv:2509.18445  [pdf, ps, other

    cs.LG physics.app-ph

    MeshODENet: A Graph-Informed Neural Ordinary Differential Equation Neural Network for Simulating Mesh-Based Physical Systems

    Authors: Kangzheng Liu, Leixin Ma

    Abstract: The simulation of complex physical systems using a discretized mesh is a cornerstone of applied mechanics, but traditional numerical solvers are often computationally prohibitive for many-query tasks. While Graph Neural Networks (GNNs) have emerged as powerful surrogate models for mesh-based data, their standard autoregressive application for long-term prediction is often plagued by error accumula… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: 9 pages, 7 figures

  19. arXiv:2509.17765  [pdf, ps, other

    cs.CL cs.AI cs.CV eess.AS

    Qwen3-Omni Technical Report

    Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen , et al. (13 additional authors not shown)

    Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omn… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: https://github.com/QwenLM/Qwen3-Omni

  20. arXiv:2509.17664  [pdf, ps, other

    cs.CV cs.AI

    SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

    Authors: Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, Jieping Ye

    Abstract: While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundame… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted by NeurIPS 2025

  21. arXiv:2509.17660  [pdf, ps, other

    cs.CV

    Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study

    Authors: Yikun Ma, Bo Li, Ying Chen, Zijie Yue, Shuchang Xu, Jingyao Li, Lei Ma, Liang Zhong, Duowu Zou, Leiming Xu, Yunshi Zhong, Xiaobo Li, Weiqun Ding, Minmin Zhang, Dongli He, Zhenghong Li, Ye Chen, Ye Zhao, Jialong Zhuo, Xiaofen Wu, Lisha Yi, Miaojing Shi, Huihui Sun

    Abstract: The early detection of esophagogastric junction adenocarcinoma (EGJA) is crucial for improving patient prognosis, yet its current diagnosis is highly operator-dependent. This paper aims to make the first attempt to develop an artificial intelligence (AI) foundation model-based method for both screening and staging diagnosis of EGJA using endoscopic images. In this cohort and learning study, we con… ▽ More

    Submitted 23 September, 2025; v1 submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted to eClinicalMedicine, Part of The Lancet Discovery Science

  22. arXiv:2509.16268  [pdf, ps, other

    cs.SE cs.AI

    Digging Into the Internal: Causality-Based Analysis of LLM Function Calling

    Authors: Zhenlan Ji, Daoyuan Wu, Wenxuan Wang, Pingchuan Ma, Shuai Wang, Lei Ma

    Abstract: Function calling (FC) has emerged as a powerful technique for facilitating large language models (LLMs) to interact with external systems and perform structured tasks. However, the mechanisms through which it influences model behavior remain largely under-explored. Besides, we discover that in addition to the regular usage of FC, this technique can substantially enhance the compliance of LLMs with… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

  23. arXiv:2509.14436  [pdf, ps, other

    cs.IR cs.AI

    When Content is Goliath and Algorithm is David: The Style and Semantic Effects of Generative Search Engine

    Authors: Lijia Ma, Juan Qin, Xingchen Xu, Yong Tan

    Abstract: Generative search engines (GEs) leverage large language models (LLMs) to deliver AI-generated summaries with website citations, establishing novel traffic acquisition channels while fundamentally altering the search engine optimization landscape. To investigate the distinctive characteristics of GEs, we collect data through interactions with Google's generative and conventional search platforms, c… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: 59 pages, 6 figures, 20 tables

    ACM Class: H.3.3; I.2.7; J.4

  24. arXiv:2509.12765  [pdf, ps, other

    cs.IR cs.AI cs.CL

    InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering

    Authors: Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, Han Li

    Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

    Comments: EMNLP'25 Oral Presentation. Contact: [email protected]

  25. arXiv:2509.10140  [pdf, ps, other

    cs.CV

    Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

    Authors: Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, Xingang Wang

    Abstract: Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To m… ▽ More

    Submitted 12 September, 2025; originally announced September 2025.

  26. arXiv:2509.09630  [pdf, ps, other

    cs.SE cs.CR

    I Know Who Clones Your Code: Interpretable Smart Contract Similarity Detection

    Authors: Zhenguang Liu, Lixun Ma, Zhongzheng Mu, Chengkun Wei, Xiaojun Xu, Yingying Jiao, Kui Ren

    Abstract: Widespread reuse of open-source code in smart contract development boosts programming efficiency but significantly amplifies bug propagation across contracts, while dedicated methods for detecting similar smart contract functions remain very limited. Conventional abstract-syntax-tree (AST) based methods for smart contract similarity detection face challenges in handling intricate tree structures,… ▽ More

    Submitted 11 September, 2025; originally announced September 2025.

  27. arXiv:2509.07604  [pdf, ps, other

    cs.LG

    K2-Think: A Parameter-Efficient Reasoning System

    Authors: Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin , et al. (6 additional authors not shown)

    Abstract: K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technica… ▽ More

    Submitted 14 September, 2025; v1 submitted 9 September, 2025; originally announced September 2025.

    Comments: To access the K2-Think reasoning system, please visit www.k2think.ai

  28. Micro-Expression Recognition via Fine-Grained Dynamic Perception

    Authors: Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung

    Abstract: Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic p… ▽ More

    Submitted 7 September, 2025; originally announced September 2025.

  29. arXiv:2509.03505  [pdf, ps, other

    cs.LG cs.AI cs.CL

    LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

    Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng , et al. (13 additional authors not shown)

    Abstract: We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through q… ▽ More

    Submitted 3 September, 2025; originally announced September 2025.

    Comments: 56 pages

  30. arXiv:2509.03377  [pdf, ps, other

    cs.AR

    Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing

    Authors: Rui Xie, Asad Ul Haq, Linsen Ma, Yunhua Fang, Zirak Burzin Engineer, Liu Liu, Tong Zhang

    Abstract: Large language model (LLM) inference is bottlenecked by the limited bandwidth of CXL-based memory used for capacity expansion. We introduce CXL-NDP, a transparent near-data processing architecture that amplifies effective CXL bandwidth without requiring changes to the CXL.mem interface or AI models. CXL-NDP integrates a precision-scalable bit-plane layout for dynamic quantization with transparent… ▽ More

    Submitted 8 September, 2025; v1 submitted 3 September, 2025; originally announced September 2025.

  31. arXiv:2509.02322  [pdf, ps, other

    cs.CV

    OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

    Authors: Longrong Yang, Zhixiong Zeng, Yufeng Zhong, Jing Huang, Liming Zheng, Lei Chen, Haibo Qiu, Zequn Qin, Lin Ma, Xi Li

    Abstract: Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks. Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively. However, many complex tasks typically require agents to interleavely interact with these two types of environment. We initially mix GUI… ▽ More

    Submitted 2 September, 2025; originally announced September 2025.

  32. arXiv:2509.01322  [pdf, ps, other

    cs.CL cs.AI cs.DC cs.LG

    LongCat-Flash Technical Report

    Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu , et al. (157 additional authors not shown)

    Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depen… ▽ More

    Submitted 19 September, 2025; v1 submitted 1 September, 2025; originally announced September 2025.

  33. arXiv:2509.00195  [pdf, ps, other

    cs.LG

    Democratizing Agentic AI with Fast Test-Time Scaling on the Edge

    Authors: Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, Hongxiang Fan

    Abstract: Deploying agentic AI on edge devices is crucial for privacy and responsiveness, but memory constraints typically relegate these systems to smaller Large Language Models (LLMs) with inferior reasoning capabilities. Test-Time Scaling (TTS) can bridge this reasoning gap by dedicating more compute during inference, but existing methods incur prohibitive overhead on edge hardware. To overcome this, we… ▽ More

    Submitted 29 August, 2025; originally announced September 2025.

  34. arXiv:2508.21767  [pdf, ps, other

    cs.CV

    UItron: Foundational GUI Agent with Advanced Perception and Planning

    Authors: Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

    Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories… ▽ More

    Submitted 29 August, 2025; originally announced August 2025.

    Comments: 24 pages

  35. arXiv:2508.20835  [pdf, ps, other

    cs.CV

    PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

    Authors: Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan

    Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, posse… ▽ More

    Submitted 29 August, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

  36. arXiv:2508.18295  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems

    Authors: Huangyu Dai, Lingtao Mao, Ben Chen, Zihan Wang, Zihan Liang, Ying Han, Chenyi Lei, Han Li

    Abstract: Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword custo… ▽ More

    Submitted 22 August, 2025; originally announced August 2025.

  37. arXiv:2508.16995  [pdf, ps, other

    stat.ML cs.LG

    GraphPPD: Posterior Predictive Modelling for Graph-Level Inference

    Authors: Soumyasundar Pal, Liheng Ma, Amine Natik, Yingxue Zhang, Mark Coates

    Abstract: Accurate modelling and quantification of predictive uncertainty is crucial in deep learning since it allows a model to make safer decisions when the data is ambiguous and facilitates the users' understanding of the model's confidence in its predictions. Along with the tremendously increasing research focus on \emph{graph neural networks} (GNNs) in recent years, there have been numerous techniques… ▽ More

    Submitted 23 August, 2025; originally announced August 2025.

  38. arXiv:2508.15548  [pdf, ps, other

    cs.AI

    DeepThink3D: Enhancing Large Language Models with Programmatic Reasoning in Complex 3D Situated Reasoning Tasks

    Authors: Jiayi Song, Rui Wan, Lipeng Ma, Weidong Yang, Qingyuan Zhou, Yixuan Li, Ben Fei

    Abstract: This work enhances the ability of large language models (LLMs) to perform complex reasoning in 3D scenes. Recent work has addressed the 3D situated reasoning task by invoking tool usage through large language models. Large language models call tools via APIs and integrate the generated programs through a chain of thought to solve problems based on the program results. However, due to the simplicit… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

  39. arXiv:2508.13587  [pdf, ps, other

    cs.AI cs.CV

    Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

    Authors: Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma

    Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

    Comments: technical report

  40. arXiv:2508.13579  [pdf, ps, other

    cs.AI

    Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance

    Authors: Yue Fang, Yuxin Guo, Jiaran Gao, Hongxin Ding, Xinke Jiang, Weibin Liao, Yongxin Xu, Yinghao Zhu, Zhibang Yang, Liantao Ma, Junfeng Zhao, Yasha Wang

    Abstract: Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve… ▽ More

    Submitted 19 August, 2025; originally announced August 2025.

  41. arXiv:2508.13231  [pdf, ps, other

    cs.AR cs.AI cs.PF

    Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

    Authors: Yunhua Fang, Rui Xie, Asad Ul Haq, Linsen Ma, Kaoutar El Maghraoui, Naigang Wang, Meng Wang, Liu Liu, Tong Zhang

    Abstract: Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects su… ▽ More

    Submitted 15 September, 2025; v1 submitted 17 August, 2025; originally announced August 2025.

    Comments: IEEE Computer Architecture Letter

  42. arXiv:2508.13070  [pdf, ps, other

    cs.CL cs.AI

    Reinforced Context Order Recovery for Adaptive Reasoning and Planning

    Authors: Long Ma, Fangwei Zhong, Yizhou Wang

    Abstract: Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that curre… ▽ More

    Submitted 18 August, 2025; originally announced August 2025.

  43. arXiv:2508.11913  [pdf, ps, other

    cs.CR

    WebGeoInfer: A Structure-Free and Multi-Stage Framework for Geolocation Inference of Devices Exposing Information

    Authors: Huipeng Yang, Li Yang, Lichuan Ma, Lu Zhou, Junbo Jia, Anyuan Sang, Xinyue Wang

    Abstract: Remote management devices facilitate critical infrastructure monitoring for administrators but simultaneously increase asset exposure. Sensitive geographical information overlooked in exposed device management pages poses substantial security risks. Therefore, identifying devices that reveal location information due to administrator negligence is crucial for cybersecurity regulation. Despite the r… ▽ More

    Submitted 16 August, 2025; originally announced August 2025.

  44. arXiv:2508.10760  [pdf, ps, other

    q-bio.BM cs.AI

    FROGENT: An End-to-End Full-process Drug Design Agent

    Authors: Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, Junkai Ji

    Abstract: Powerful AI tools for drug discovery reside in isolated web apps, desktop programs, and code libraries. Such fragmentation forces scientists to manage incompatible interfaces and specialized scripts, which can be a cumbersome and repetitive process. To address this issue, a Full-pROcess druG dEsign ageNT, named FROGENT, has been proposed. Specifically, FROGENT utilizes a Large Language Model and t… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

    Comments: 9 pages, 5 figures

  45. arXiv:2508.09476  [pdf, ps, other

    cs.CV

    From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

    Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a M… ▽ More

    Submitted 14 August, 2025; v1 submitted 13 August, 2025; originally announced August 2025.

  46. arXiv:2508.08730  [pdf, ps, other

    cs.CL

    Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation

    Authors: Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, Liantao Ma

    Abstract: Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by m… ▽ More

    Submitted 12 August, 2025; originally announced August 2025.

  47. arXiv:2508.08549  [pdf, ps, other

    cs.CV

    Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation

    Authors: Wei Li, Pengcheng Zhou, Linye Ma, Wenyi Zhao, Huihua Yang

    Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation h… ▽ More

    Submitted 11 August, 2025; originally announced August 2025.

  48. arXiv:2508.08219  [pdf, ps, other

    cs.CV

    SAGOnline: Segment Any Gaussians Online

    Authors: Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li

    Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and z… ▽ More

    Submitted 11 August, 2025; originally announced August 2025.

    Comments: 19 pages, 10 figures

  49. arXiv:2508.05602  [pdf, ps, other

    cs.CV

    LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

    Authors: Tao Sun, Oliver Liu, JinJin Li, Lan Ma

    Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have… ▽ More

    Submitted 7 August, 2025; originally announced August 2025.

    Comments: Published in the First Workshop of Evaluation of Multi-Modal Generation 2025

  50. arXiv:2508.04915  [pdf, ps, other

    cs.AI cs.CL cs.MA

    ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis

    Authors: Huiya Zhao, Yinghao Zhu, Zixiang Wang, Yasha Wang, Junyi Gao, Liantao Ma

    Abstract: The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

    Comments: Code: https://github.com/PKU-AICare/ConfAgents