Skip to main content

Showing 1–50 of 1,639 results for author: Ma, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09698  [pdf, ps, other

    cs.RO cs.AI

    ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

    Authors: Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

    Abstract: Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 47 pages, 29 figures. Under review

  2. arXiv:2505.09103  [pdf, ps, other

    cs.RO

    VGC-RIO: A Tightly Integrated Radar-Inertial Odometry with Spatial Weighted Doppler Velocity and Local Geometric Constrained RCS Histograms

    Authors: Jianguang Xiang, Xiaofeng He, Zizhuo Chen, Lilian Zhang, Xincan Luo, Jun Mao

    Abstract: Recent advances in 4D radar-inertial odometry have demonstrated promising potential for autonomous lo calization in adverse conditions. However, effective handling of sparse and noisy radar measurements remains a critical challenge. In this paper, we propose a radar-inertial odometry with a spatial weighting method that adapts to unevenly distributed points and a novel point-description histogram… ▽ More

    Submitted 14 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

  3. arXiv:2505.08915  [pdf, ps, other

    cs.LG cond-mat.dis-nn cond-mat.stat-mech

    An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models

    Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari

    Abstract: Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear netwo… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  4. arXiv:2505.06371  [pdf, ps, other

    cs.LG cs.AI

    The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

    Authors: Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan Wu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, Mosharaf Chowdhury

    Abstract: As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environ… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Leaderboard: https://ml.energy/leaderboard

  5. arXiv:2505.06191  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.RO

    Neuro-Symbolic Concepts

    Authors: Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu

    Abstract: This article presents a concept-centric paradigm for building agents that can learn continually and reason flexibly. The concept-centric agent utilizes a vocabulary of neuro-symbolic concepts. These concepts, such as object, relation, and action concepts, are grounded on sensory inputs and actuation outputs. They are also compositional, allowing for the creation of novel concepts through their str… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: To appear in Communications of the ACM

  6. arXiv:2505.05360  [pdf, other

    cs.RO

    DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning

    Authors: Wenru Liu, Pei Liu, Jun Ma

    Abstract: We present DSDrive, a streamlined end-to-end paradigm tailored for integrating the reasoning and planning of autonomous vehicles into a unified framework. DSDrive leverages a compact LLM that employs a distillation method to preserve the enhanced reasoning capabilities of a larger-sized vision language model (VLM). To effectively align the reasoning and planning tasks, a waypoint-driven dual-head… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  7. arXiv:2505.04877  [pdf, other

    cs.CV cs.AI

    Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

    Authors: Lianbo Ma, Jianlun Ma, Yuee Zhou, Guoyang Xie, Qiang He, Zhichao Lu

    Abstract: Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on s… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  8. arXiv:2505.04723  [pdf, other

    cs.CL

    SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

    Authors: Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma

    Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader appl… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  9. arXiv:2505.03460  [pdf, other

    cs.RO

    LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

    Authors: Xinyuan Zhang, Yonglin Tian, Fei Lin, Yue Liu, Jing Ma, Kornélia Sára Szatmáry, Fei-Yue Wang

    Abstract: The growing demand for intelligent logistics, particularly fine-grained terminal delivery, underscores the need for autonomous UAV (Unmanned Aerial Vehicle)-based delivery systems. However, most existing last-mile delivery studies rely on ground robots, while current UAV-based Vision-Language Navigation (VLN) tasks primarily focus on coarse-grained, long-range goals, making them unsuitable for pre… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  10. arXiv:2505.03380  [pdf, other

    cs.CV cs.AI eess.IV

    Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

    Authors: Haonan Wang, Jiaji Mao, Lehan Wang, Qixiang Zhang, Marawan Elbatel, Yi Qin, Huijun Hu, Baoxun Li, Wenhui Deng, Weifeng Qin, Hongrui Li, Jialin Liang, Jun Shen, Xiaomeng Li

    Abstract: Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise ana… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  11. arXiv:2505.03320  [pdf, other

    cs.CL

    Recall with Reasoning: Chain-of-Thought Distillation for Mamba's Long-Context Memory and Extrapolation

    Authors: Junyu Ma, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu

    Abstract: Mamba's theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba's long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, tea… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  12. arXiv:2505.03007  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

    Authors: Nikolay Safonov, Alexey Bryncev, Andrey Moskalenko, Dmitry Kulikov, Dmitry Vatolin, Radu Timofte, Haibo Lei, Qifan Gao, Qing Luo, Yaqing Li, Jie Song, Shaozhe Hao, Meisong Zheng, Jingyi Xu, Chengbin Wu, Jiahui Liu, Ying Chen, Xin Deng, Mai Xu, Peipei Liang, Jie Ma, Junjie Jin, Yingxue Pang, Fangzhou Luo, Kai Chen , et al. (6 additional authors not shown)

    Abstract: This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such vid… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  13. arXiv:2505.01768  [pdf, ps, other

    eess.IV cs.CV

    Continuous Filtered Backprojection by Learnable Interpolation Network

    Authors: Hui Lin, Dong Zeng, Qi Xie, Zerui Mao, Jianhua Ma, Deyu Meng

    Abstract: Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, nam… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: 14 pages, 10 figures

  14. arXiv:2505.00515  [pdf, other

    cs.RO cs.AI cs.MA

    Safety-Critical Traffic Simulation with Guided Latent Diffusion Model

    Authors: Mingxing Peng, Ruoyu Yao, Xusen Guo, Yuting Xie, Xianda Chen, Jun Ma

    Abstract: Safety-critical traffic simulation plays a crucial role in evaluating autonomous driving systems under rare and challenging scenarios. However, existing approaches often generate unrealistic scenarios due to insufficient consideration of physical plausibility and suffer from low generation efficiency. To address these limitations, we propose a guided latent diffusion model (LDM) capable of generat… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 7 pages, 3 figures

  15. arXiv:2504.21487  [pdf, other

    cs.CV

    DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration

    Authors: Hebaixu Wang, Jing Zhang, Haonan Guo, Di Wang, Jiayi Ma, Bo Du

    Abstract: Diffusion models have achieved remarkable progress in universal image restoration. While existing methods speed up inference by reducing sampling steps, substantial step intervals often introduce cumulative errors. Moreover, they struggle to balance the commonality of degradation representations and restoration quality. To address these challenges, we introduce \textbf{DGSolver}, a diffusion gener… ▽ More

    Submitted 8 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  16. arXiv:2504.20569  [pdf, other

    cs.CR

    VIMU: Effective Physics-based Realtime Detection and Recovery against Stealthy Attacks on UAVs

    Authors: Yunbo Wang, Cong Sun, Qiaosen Liu, Bingnan Su, Zongxu Zhang, Michael Norris, Gang Tan, Jianfeng Ma

    Abstract: Sensor attacks on robotic vehicles have become pervasive and manipulative. Their latest advancements exploit sensor and detector characteristics to bypass detection. Recent security efforts have leveraged the physics-based model to detect or mitigate sensor attacks. However, these approaches are only resilient to a few sensor attacks and still need improvement in detection effectiveness. We presen… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: In proceedings of ACSAC 2024. Author version with figure fixes

  17. arXiv:2504.19867  [pdf, other

    cs.CL cs.DC cs.LG

    semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

    Authors: Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang

    Abstract: Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticat… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 18 pages, 16 figures

  18. arXiv:2504.19148  [pdf

    cs.AI

    A Dynamic Fuzzy Rule and Attribute Management Framework for Fuzzy Inference Systems in High-Dimensional Data

    Authors: Ke Liu, Jing Ma, Edmund M-K Lai

    Abstract: This paper presents an Adaptive Dynamic Attribute and Rule (ADAR) framework designed to address the challenges posed by high-dimensional data in neuro-fuzzy inference systems. By integrating dual weighting mechanisms-assigning adaptive importance to both attributes and rules-together with automated growth and pruning strategies, ADAR adaptively streamlines complex fuzzy models without sacrificing… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  19. arXiv:2504.18904  [pdf, other

    cs.RO

    RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

    Authors: Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu , et al. (12 additional authors not shown)

    Abstract: Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing evaluation protocols. Collecting real-world data is resource-intensive and inefficient, while benchmarking in real-world scenarios remains highly complex. Synthetic data and simulation off… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

  20. arXiv:2504.17033  [pdf, ps, other

    cs.DS

    Breaking the Sorting Barrier for Directed Single-Source Shortest Paths

    Authors: Ran Duan, Jiayi Mao, Xiao Mao, Xinkai Shu, Longhui Yin

    Abstract: We give a deterministic $O(m\log^{2/3}n)$-time algorithm for single-source shortest paths (SSSP) on directed graphs with real non-negative edge weights in the comparison-addition model. This is the first result to break the $O(m+n\log n)$ time bound of Dijkstra's algorithm on sparse graphs, showing that Dijkstra's algorithm is not optimal for SSSP.

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 17 pages

  21. arXiv:2504.16649  [pdf, other

    cs.RO

    PP-Tac: Paper Picking Using Tactile Feedback in Dexterous Robotic Hands

    Authors: Pei Lin, Yuzhe Huang, Wanlin Li, Jianpeng Ma, Chenxi Xiao, Ziyuan Jiao

    Abstract: Robots are increasingly envisioned as human companions, assisting with everyday tasks that often involve manipulating deformable objects. Although recent advances in robotic hardware and embodied AI have expanded their capabilities, current systems still struggle with handling thin, flat, and deformable objects such as paper and fabric. This limitation arises from the lack of suitable perception t… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: accepted by Robotics: Science and Systems(RSS) 2025

  22. arXiv:2504.15585  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

    Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Junyuan Mao, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Chengwei Liu, Yifan Zhang, Qiankun Li , et al. (57 additional authors not shown)

    Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concer… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  23. arXiv:2504.15415  [pdf, other

    cs.CV cs.CL

    IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

    Authors: David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin

    Abstract: Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  24. arXiv:2504.14862  [pdf, other

    cs.RO

    FERMI: Flexible Radio Mapping with a Hybrid Propagation Model and Scalable Autonomous Data Collection

    Authors: Yiming Luo, Yunfei Wang, Hongming Chen, Chengkai Wu, Ximin Lyu, Jinni Zhou, Jun Ma, Fu Zhang, Boyu Zhou

    Abstract: Communication is fundamental for multi-robot collaboration, with accurate radio mapping playing a crucial role in predicting signal strength between robots. However, modeling radio signal propagation in large and occluded environments is challenging due to complex interactions between signals and obstacles. Existing methods face two key limitations: they struggle to predict signal strength for tra… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Published at RSS 2025

  25. arXiv:2504.14847  [pdf, other

    cs.CV

    Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning

    Authors: Xixi Wan, Aihua Zheng, Zi Wang, Bo Jiang, Jin Tang, Jixin Ma

    Abstract: Multi-modal data provides abundant and diverse object information, crucial for effective modal interactions in Re-Identification (ReID) tasks. However, existing approaches often overlook the quality variations in local features and fail to fully leverage the complementary information across modalities, particularly in the case of low-quality features. In this paper, we propose to address this issu… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  26. arXiv:2504.14776  [pdf, other

    cs.HC

    Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation

    Authors: Zhecheng Wang, Jiaju Ma, Eitan Grinspun, Bryan Wang, Tovi Grossman

    Abstract: Scriptwriting has traditionally been text-centric, a modality that only partially conveys the produced audiovisual experience. A formative study with professional writers informed us that connecting textual and audiovisual modalities can aid ideation and iteration, especially for writing dialogues. In this work, we present Script2Screen, an AI-assisted tool that integrates scriptwriting with audio… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  27. arXiv:2504.14482  [pdf, other

    cs.CL cs.SD

    DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

    Authors: Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng

    Abstract: Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Accepted by ICME 2025. Dataset and code are publicly available: [https://github.com/uirlx/DialogueAgents](https://github.com/uirlx/DialogueAgents)

  28. arXiv:2504.14478  [pdf, other

    cs.RO

    ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion

    Authors: Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, Boyu Zhou

    Abstract: Navigating unknown environments to find a target object is a significant challenge. While semantic information is crucial for navigation, relying solely on it for decision-making may not always be efficient, especially in environments with weak semantic cues. Additionally, many methods are susceptible to misdetections, especially in environments with visually similar objects. To address these limi… ▽ More

    Submitted 22 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

  29. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  30. arXiv:2504.12773  [pdf, other

    cs.CL cs.AI

    Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration

    Authors: Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma

    Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that c… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: 10 pages, 5 figures

  31. arXiv:2504.12330  [pdf, other

    cs.CL cs.AI

    HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

    Authors: Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma

    Abstract: While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowl… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  32. arXiv:2504.12276  [pdf, other

    cs.CV

    The Tenth NTIRE 2025 Image Denoising Challenge Report

    Authors: Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timofte, Yawei Li, Xiangyu Kong, Hyunhee Park, Xiaoxuan Yu, Suejin Han, Hakjae Jeon, Jia Li, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Jingyu Ma, Zhijuan Huang, Huiyuan Fu, Hongyuan Yu, Boqi Zhang, Jiawei Shi, Heng Zhang, Huadong Ma, Deepak Kumar Tyagi , et al. (69 additional authors not shown)

    Abstract: This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent ad… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  33. arXiv:2504.12234  [pdf, other

    cs.SE

    MOS: Towards Effective Smart Contract Vulnerability Detection through Mixture-of-Experts Tuning of Large Language Models

    Authors: Hang Yuan, Lei Yu, Zhirong Huang, Jingyuan Zhang, Junyi Lu, Shiqi Cheng, Li Yang, Fengjun Zhang, Jiajia Ma, Chun Zuo

    Abstract: Smart contract vulnerabilities pose significant security risks to blockchain systems, potentially leading to severe financial losses. Existing methods face several limitations: (1) Program analysis-based approaches rely on predefined patterns, lacking flexibility for new vulnerability types; (2) Deep learning-based methods lack explanations; (3) Large language model-based approaches suffer from hi… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  34. arXiv:2504.12034  [pdf, other

    cs.SE cs.CR

    OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine

    Authors: Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Haoyu Wang, Ying Gao, Yinliang Yue

    Abstract: As Ethereum continues to thrive, the Ethereum Virtual Machine (EVM) has become the cornerstone powering tens of millions of active smart contracts. Intuitively, security issues in EVMs could lead to inconsistent behaviors among smart contracts or even denial-of-service of the entire blockchain network. However, to the best of our knowledge, only a limited number of studies focus on the security of… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: To appear in ISSTA'25

  35. arXiv:2504.11671  [pdf, other

    cs.AI cs.CY cs.LG econ.GN

    Steering Prosocial AI Agents: Computational Basis of LLM's Decision Making in Social Simulation

    Authors: Ji Ma

    Abstract: Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's in… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  36. arXiv:2504.11009  [pdf, other

    cs.MM

    MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique

    Authors: Shuhang Liu, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Qing Wang, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma

    Abstract: Visual language models (VLMs) have demonstrated strong performance across diverse multimodal reasoning tasks but still face challenges such as hallucinations, resulting in incorrect reasoning outcomes. Inspired by recent research on external feedback mechanisms in large language models (LLMs), we propose a multimodal actor-critic framework to enhance VLM reasoning capabilities. Specifically, the a… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  37. arXiv:2504.10222  [pdf, other

    cs.MM

    PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search

    Authors: Pengfei Hu, Zhenrong Zhang, Qikai Chang, Shuhang Liu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma, Qingfeng Liu

    Abstract: Recent work increasingly focuses on improving the reasoning capabilities of Multimodal Large Language Models (MLLMs). Among existing methods, Process Reward Models (PRMs) stand out for offering dense, step-wise supervision to guide intermediate reasoning. However, how to effectively integrate PRMs into search strategies remains an open question. In this paper, we introduce PRM-BAS (PRM-Guided Beam… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  38. arXiv:2504.10146  [pdf, other

    cs.LG cs.AI

    GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions

    Authors: Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Ziran Qin, Jinwen Ma

    Abstract: We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. Traditionally, solving geometry problems and generating diagrams have been treated as separate tasks in machine learning, with no models successfully integrating both to support p… ▽ More

    Submitted 8 May, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

  39. arXiv:2504.09664  [pdf, other

    cs.LG

    Adapting to the Unknown: Robust Meta-Learning for Zero-Shot Financial Time Series Forecasting

    Authors: Anxian Liu, Junying Ma, Guang Zhang

    Abstract: Financial time series forecasting in the zero-shot setting is essential for risk management and investment decision-making, particularly during abrupt market regime shifts or in emerging markets with limited historical data. While Model-Agnostic Meta-Learning (MAML)-based approaches have shown promise in this domain, existing meta task construction strategies often lead to suboptimal performance,… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  40. arXiv:2504.09662  [pdf, other

    cs.MA cs.AI cs.HC

    AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

    Authors: Jenny Ma, Riya Sahni, Karthik Sreedhar, Lydia B. Chilton

    Abstract: Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulatio… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  41. arXiv:2504.09648  [pdf, other

    cs.LG cs.CV stat.CO

    RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions

    Authors: Guixian Chen, Jianhao Ma, Salar Fattahi

    Abstract: In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples -- some of which are tampered by an adaptive and strong adversary -- we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  42. arXiv:2504.09075  [pdf

    physics.geo-ph cs.DC

    Parallel Seismic Data Processing Performance with Cloud-based Storage

    Authors: Sasmita Mohapatra, Weiming Yang, Zhengtang Yang, Chenxiao Wang, Jinxin Ma, Gary L. Pavlis, Yinzhi Wang

    Abstract: This article introduces a general processing framework to effectively utilize waveform data stored on modern cloud platforms. The focus is hybrid processing schemes where a local system drives processing. We show that downloading files and doing all processing locally is problematic even when the local system is a high-performance compute cluster. Benchmark tests with parallel processing show that… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  43. arXiv:2504.08714  [pdf, other

    cs.CV cs.CL cs.LG

    Generating Fine Details of Entity Interactions

    Authors: Xinyi Gu, Jiayuan Mao

    Abstract: Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: Project Page: https://concepts-ai.com/p/detailscribe/

  44. arXiv:2504.08240  [pdf, other

    cs.RO eess.SP

    InSPE: Rapid Evaluation of Heterogeneous Multi-Modal Infrastructure Sensor Placement

    Authors: Zhaoliang Zheng, Yun Zhang, Zongling Meng, Johnson Liu, Xin Xia, Jiaqi Ma

    Abstract: Infrastructure sensing is vital for traffic monitoring at safety hotspots (e.g., intersections) and serves as the backbone of cooperative perception in autonomous driving. While vehicle sensing has been extensively studied, infrastructure sensing has received little attention, especially given the unique challenges of diverse intersection geometries, complex occlusions, varying traffic conditions,… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  45. arXiv:2504.07981  [pdf, other

    cs.CV cs.HC cs.MM

    ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

    Authors: Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua

    Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes,… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

    Comments: 13pages

    MSC Class: 68-11 68-04 ACM Class: I.2.7; I.2.10

  46. arXiv:2504.07898  [pdf, other

    cs.IR cs.CL cs.LG

    How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective

    Authors: Qi Liu, Jiaxin Mao, Ji-Rong Wen

    Abstract: Recent studies have shown that large language models (LLMs) can assess relevance and support information retrieval (IR) tasks such as document ranking and relevance judgment generation. However, the internal mechanisms by which off-the-shelf LLMs understand and operationalize relevance remain largely unexplored. In this paper, we systematically investigate how different LLM modules contribute to r… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  47. arXiv:2504.07570  [pdf, ps, other

    cs.IR

    Exploring Human-Like Thinking in Search Simulations with Large Language Models

    Authors: Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Zixuan Yang, Jiaxin Mao

    Abstract: Simulating user search behavior is a critical task in information retrieval, which can be employed for user behavior modeling, data augmentation, and system evaluation. Recent advancements in large language models (LLMs) have opened up new possibilities for generating human-like actions including querying, browsing, and clicking. In this work, we explore the integration of human-like thinking into… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  48. arXiv:2504.07481  [pdf

    physics.ao-ph cs.LG

    A Mechanism-Learning Deeply Coupled Model for Remote Sensing Retrieval of Global Land Surface Temperature

    Authors: Tian Xie, Menghui Jiang, Huanfeng Shen, Huifang Li, Chao Zeng, Jun Ma, Guanhao Zhang, Liangpei Zhang

    Abstract: Land surface temperature (LST) retrieval from remote sensing data is pivotal for analyzing climate processes and surface energy budgets. However, LST retrieval is an ill-posed inverse problem, which becomes particularly severe when only a single band is available. In this paper, we propose a deeply coupled framework integrating mechanistic modeling and machine learning to enhance the accuracy and… ▽ More

    Submitted 22 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  49. arXiv:2504.07439  [pdf, other

    cs.IR cs.CL

    LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking

    Authors: Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, Jiaxin Mao

    Abstract: Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  50. arXiv:2504.07375  [pdf, other

    cs.CV

    Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

    Authors: Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang

    Abstract: Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awarenes… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.