Skip to main content

Showing 1–50 of 812 results for author: Mao, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.08194  [pdf, other

    cs.RO

    CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding

    Authors: Wenxuan Ma, Xiaoge Cao, Yixiang Zhang, Chaofan Zhang, Shaobo Yang, Peng Hao, Bin Fang, Yinghao Cai, Shaowei Cui, Shuo Wang

    Abstract: Recent advancements in integrating tactile sensing with vision-language models (VLMs) have demonstrated remarkable potential for robotic multimodal perception. However, existing tactile descriptions remain limited to superficial attributes like texture, neglecting critical contact states essential for robotic manipulation. To bridge this gap, we propose CLTP, an intuitive and effective language ta… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 16 pages

  2. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2505.07286  [pdf, other

    q-bio.BM cs.AI cs.LG

    Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule

    Authors: Keyue Qiu, Yuxuan Song, Zhehuan Fan, Peidong Liu, Zhe Zhang, Mingyue Zheng, Hao Zhou, Wei-Ying Ma

    Abstract: Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities -- continuous 3D positions and discrete 2D topologies -- which jointly determine molecular geometries. By establishing the fact that noise schedules dec… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML 2025

  4. arXiv:2505.07096  [pdf, ps, other

    cs.RO cs.AI cs.LG

    X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

    Authors: Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, Sanjiban Choudhury

    Abstract: Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for l… ▽ More

    Submitted 14 May, 2025; v1 submitted 11 May, 2025; originally announced May 2025.

  5. arXiv:2505.05240  [pdf, other

    cs.CV

    PADriver: Towards Personalized Autonomous Driving

    Authors: Genghua Kou, Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Ziheng Zhang, Osamu Yoshie, Tiancai Wang, Ying Li, Xiangyu Zhang

    Abstract: In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  6. arXiv:2505.04519  [pdf, other

    cs.CL

    Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

    Authors: Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu , et al. (49 additional authors not shown)

    Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing r… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  7. arXiv:2505.04481  [pdf, other

    cs.CV

    CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

    Authors: Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, Xiangdong Zhou

    Abstract: Recently, Large Language Models (LLMs) have achieved significant success, prompting increased interest in expanding their generative capabilities beyond general text into domain-specific areas. This study investigates the generation of parametric sequences for computer-aided design (CAD) models using LLMs. This endeavor represents an initial step towards creating parametric 3D shapes with LLMs, as… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  8. arXiv:2505.03175  [pdf, other

    cs.IT

    Multiuser Communications Aided by Cross-Linked Movable Antenna Array: Architecture and Optimization

    Authors: Lipeng Zhu, He Sun, Wenyan Ma, Zhenyu Xiao, Rui Zhang

    Abstract: Movable antenna (MA) has been regarded as a promising technology to enhance wireless communication performance by enabling flexible antenna movement. However, the hardware cost of conventional MA systems scales with the number of movable elements due to the need for independently controllable driving components. To reduce hardware cost, we propose in this paper a novel architecture named cross-lin… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  9. arXiv:2505.02156  [pdf, other

    cs.CL cs.AI cs.LG

    Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents

    Authors: Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao

    Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current approaches. While existing methods either lack this kind of reasoning capability or enforce uniform long chain-of-thought reasoning across all scenarios, resulting in excessive token usage and inappropriate social simulation. In this paper, we propose… ▽ More

    Submitted 6 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

    Comments: Work in Progress. The code and data are available, see https://github.com/MozerWang/AMPO

  10. arXiv:2505.01225  [pdf, other

    cs.CV

    Core-Set Selection for Data-efficient Land Cover Segmentation

    Authors: Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hänsch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakuş, Paul L. Rosin

    Abstract: The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  11. arXiv:2505.00788  [pdf, other

    cs.CV

    SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

    Authors: Wufei Ma, Luoxin Ye, Nessa McWeeney, Celso M de Melo, Alan Yuille, Jieneng Chen

    Abstract: Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study th… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 highlight, camera ready version

  12. arXiv:2504.20024  [pdf, other

    cs.CV

    SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

    Authors: Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jieneng Chen, Jianwen Xie, Alan Yuille

    Abstract: Recent studies in 3D spatial reasoning explore data-driven approaches and achieve enhanced spatial reasoning performance with reinforcement learning (RL). However, these methods typically perform spatial reasoning in an implicit manner, and it remains underexplored whether the acquired 3D knowledge generalizes to unseen question types at any stage of the training. In this work we introduce Spatial… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: Project page: https://spatial-reasoner.github.io

  13. arXiv:2504.19819  [pdf, other

    cs.CV

    Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

    Authors: Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu

    Abstract: Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  14. arXiv:2504.18509  [pdf, other

    cs.CV

    Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

    Authors: Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, Wei-Chiu Ma

    Abstract: Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often ov… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: CVPR 2025. Project page and codes: https://eval3d.github.io/

  15. arXiv:2504.18186  [pdf, other

    cs.RO

    Sampling-Based Grasp and Collision Prediction for Assisted Teleoperation

    Authors: Simon Manschitz, Berk Gueler, Wei Ma, Dirk Ruiken

    Abstract: Shared autonomy allows for combining the global planning capabilities of a human operator with the strengths of a robot such as repeatability and accurate control. In a real-time teleoperation setting, one possibility for shared autonomy is to let the human operator decide for the rough movement and to let the robot do fine adjustments, e.g., when the view of the operator is occluded. We present a… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  16. arXiv:2504.16665  [pdf, other

    cs.CV

    A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification

    Authors: Wenping Ma, Boyou Xue, Mengru Ma, Chuang Chen, Hekai Zhang, Hao Zhu

    Abstract: Multispectral (MS) and panchromatic (PAN) images describe the same land surface, so these images not only have their own advantages, but also have a lot of similar information. In order to separate these similar information and their respective advantages, reduce the feature redundancy in the fusion stage. This paper introduces a diff-attention aware state space fusion model (DAS2F-Model) for mult… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 12 pages,9 figures

  17. arXiv:2504.15278  [pdf, other

    cs.CV cs.RO

    DRAWER: Digital Reconstruction and Articulation With Environment Realism

    Authors: Hongchi Xia, Entong Su, Marius Memmel, Arhan Jain, Raymond Yu, Numfor Mbiziwo-Tiapo, Ali Farhadi, Abhishek Gupta, Shenlong Wang, Wei-Chiu Ma

    Abstract: Creating virtual digital replicas from real-world data unlocks significant potential across domains like gaming and robotics. In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a photorealistic and interactive digital environment. Our approach centers on two main contributions: (i) a reconstruction module based on a dual scene representation tha… ▽ More

    Submitted 22 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: Project page: https://drawer-art.github.io/

  18. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  19. arXiv:2504.13782  [pdf, other

    quant-ph cs.DC

    Robust Decentralized Quantum Kernel Learning for Noisy and Adversarial Environment

    Authors: Wenxuan Ma, Kuan-Cheng Chen, Shang Yu, Mengxiang Liu, Ruilong Deng

    Abstract: This paper proposes a general decentralized framework for quantum kernel learning (QKL). It has robustness against quantum noise and can also be designed to defend adversarial information attacks forming a robust approach named RDQKL. We analyze the impact of noise on QKL and study the robustness of decentralized QKL to the noise. By integrating robust decentralized optimization techniques, our me… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  20. arXiv:2504.12104  [pdf, other

    cs.CV

    Logits DeConfusion with CLIP for Few-Shot Learning

    Authors: Shuo Li, Fang Liu, Zehua Hao, Xinyi Wang, Lingling Li, Xu Liu, Puhua Chen, Wenping Ma

    Abstract: With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effe… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: CVPR 2025

  21. arXiv:2504.10254  [pdf, other

    cs.CV cs.AI

    MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

    Authors: Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Puhua Chen, Wenping Ma

    Abstract: Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enh… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 5 pages,4 figures,Technical report on Complex Video Object Segmentation

  22. arXiv:2504.09858  [pdf, other

    cs.AI cs.CL

    Reasoning Models Can Be Effective Without Thinking

    Authors: Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, Matei Zaharia

    Abstract: Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When co… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 33 pages, 7 main figures, 2 tables

  23. arXiv:2504.07998  [pdf, other

    cs.GR cs.AI cs.AR cs.CV

    CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

    Authors: Jinming Lu, Minghao She, Wendong Mao, Zhongfeng Wang

    Abstract: Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: ISCAS 2025

  24. arXiv:2504.07940  [pdf, other

    cs.CV

    Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

    Authors: Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, Wei-Chiu Ma

    Abstract: 360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of vid… ▽ More

    Submitted 17 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

    Comments: Project page: https://red-fairy.github.io/argus/

  25. arXiv:2504.04237  [pdf, other

    cs.IR

    Short Video Segment-level User Dynamic Interests Modeling in Personalized Recommendation

    Authors: Zhiyu He, Zhixin Ling, Jiayu Li, Zhiqiang Guo, Weizhi Ma, Xinchen Luo, Min Zhang, Guorui Zhou

    Abstract: The rapid growth of short videos has necessitated effective recommender systems to match users with content tailored to their evolving preferences. Current video recommendation models primarily treat each video as a whole, overlooking the dynamic nature of user preferences with specific video segments. In contrast, our research focuses on segment-level user interest modeling, which is crucial for… ▽ More

    Submitted 5 May, 2025; v1 submitted 5 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted by SIGIR 2025

  26. arXiv:2504.03165  [pdf, other

    cs.CL

    Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

    Authors: Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu

    Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge integration during large language model (LLM) inference in recent years. However, current RAG implementations face challenges in effectively addressing noise, repetition and redundancy in retrieved content, primarily due to their limited ability to exploit fine-grained inter-document relationships. To addre… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  27. arXiv:2503.22164  [pdf, other

    q-bio.BM cs.AI

    PharmAgents: Building a Virtual Pharma with Large Language Model Agents

    Authors: Bowen Gao, Yanwen Huang, Yiqiao Liu, Wenxuan Xie, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

    Abstract: The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development--especially for small molecule therapeutics--is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligenc… ▽ More

    Submitted 31 March, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  28. arXiv:2503.21411  [pdf, other

    cs.AI

    Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap

    Authors: Tong Nie, Jian Sun, Wei Ma

    Abstract: Modern transportation systems face pressing challenges due to increasing demand, dynamic environments, and heterogeneous information integration. The rapid evolution of Large Language Models (LLMs) offers transformative potential to address these challenges. Extensive knowledge and high-level capabilities derived from pretraining evolve the default role of LLMs as text generators to become versati… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  29. arXiv:2503.21323  [pdf

    cs.CV cs.LG

    DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset

    Authors: Ling Feng, Tianyu Xie, Wei Ma, Ruijie Fu, Yingxiao Zhang, Jun Li, Bei Zhou

    Abstract: The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this pa… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  30. arXiv:2503.20220  [pdf, other

    cs.CV

    DINeMo: Learning Neural Mesh Models with no 3D Annotations

    Authors: Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille

    Abstract: Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods de… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Technical report

  31. arXiv:2503.19325  [pdf, other

    cs.CV

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Authors: Yuchao Gu, Weijia Mao, Mike Zheng Shou

    Abstract: Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models t… ▽ More

    Submitted 17 April, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Project page at https://farlongctx.github.io/

  32. arXiv:2503.17983  [pdf, other

    cs.CV

    Histomorphology-driven multi-instance learning for breast cancer WSI classification

    Authors: Baizhi Wang, Rui Yan, Wenxin Ma, Xu Zhang, Yuhao Wang, Xiaolong Li, Yunjie Gu, Zihang Jiang, S. Kevin Zhou

    Abstract: Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morpho… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: 10 pages,5 figures

  33. arXiv:2503.16709  [pdf, other

    cs.CV cs.AI

    QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

    Authors: Xuan Shen, Weize Ma, Jing Liu, Changdi Yang, Rui Ding, Quanyi Wang, Henghui Ding, Wei Niu, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu

    Abstract: Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications. However, deploying accurate depth estimation models on resource-limited edge devices, especially Application-Specific Integrated Circuits (ASICs), is challenging due to the high computational and memory demands. Recent advancements in foundational depth estimation deliver… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  34. arXiv:2503.14476  [pdf, other

    cs.LG cs.CL

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai , et al. (10 additional authors not shown)

    Abstract: Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecouple… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: Project Page: https://dapo-sia.github.io/

  35. arXiv:2503.12374  [pdf, other

    cs.SE cs.AI

    Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution

    Authors: Zhi Chen, Wei Ma, Lingxiao Jiang

    Abstract: AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments… ▽ More

    Submitted 19 March, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

  36. arXiv:2503.11579  [pdf, other

    cs.CV

    Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

    Authors: Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen

    Abstract: State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long se… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Project Page: https://tiger-ai-lab.github.io/Vamba/

  37. arXiv:2503.08638  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    YuE: Scaling Open Foundation Models for Long-Form Music Generation

    Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

    Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: https://github.com/multimodal-art-projection/YuE

  38. arXiv:2503.06976  [pdf, other

    cs.CV

    Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

    Authors: Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xiang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, Qing Chang

    Abstract: Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 29 pages, 10 figures, 16 tables

  39. arXiv:2503.06955  [pdf, other

    cs.CV

    Motion Anything: Any to Motion Generation

    Authors: Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

    Abstract: Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail t… ▽ More

    Submitted 11 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  40. arXiv:2503.06661  [pdf, other

    cs.CV cs.AI

    AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP

    Authors: Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, S. Kevin Zhou

    Abstract: Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discriminati… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

    Comments: 8 pages, 7 figures

    Journal ref: CVPR 2025

  41. arXiv:2503.06218  [pdf, other

    cs.CL

    KnowLogic: A Benchmark for Commonsense Reasoning via Knowledge-Driven Data Synthesis

    Authors: Weidong Zhan, Yue Wang, Nan Hu, Liming Xiao, Jingyuan Ma, Yuhang Qin, Zheng Li, Yixin Yang, Sirui Deng, Jinkun Ding, Wenhan Ma, Rui Li, Weilin Luo, Qun Liu, Zhifang Sui

    Abstract: Current evaluations of commonsense reasoning in LLMs are hindered by the scarcity of natural language corpora with structured annotations for reasoning tasks. To address this, we introduce KnowLogic, a benchmark generated through a knowledge-driven synthetic data strategy. KnowLogic integrates diverse commonsense knowledge, plausible scenarios, and various types of logical reasoning. One of the ke… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  42. arXiv:2503.06075  [pdf, other

    cs.RO

    FSDP: Fast and Safe Data-Driven Overtaking Trajectory Planning for Head-to-Head Autonomous Racing Competitions

    Authors: Cheng Hu, Jihao Huang, Wule Mao, Yonghao Fu, Xuemin Chi, Haotong Qin, Nicolas Baumann, Zhitao Liu, Michele Magno, Lei Xie

    Abstract: Generating overtaking trajectories in autonomous racing is a challenging task, as the trajectory must satisfy the vehicle's dynamics and ensure safety and real-time performance running on resource-constrained hardware. This work proposes the Fast and Safe Data-Driven Planner to address this challenge. Sparse Gaussian predictions are introduced to improve both the computational efficiency and accur… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: submitted to IROS 2025

  43. arXiv:2503.05383  [pdf, other

    cs.AI cs.MA

    AVA: Attentive VLM Agent for Mastering StarCraft II

    Authors: Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, Guohao Li

    Abstract: We introduce Attentive VLM Agent (AVA), a multimodal StarCraft II agent that aligns artificial agent perception with the human gameplay experience. Traditional frameworks such as SMAC rely on abstract state representations that diverge significantly from human perception, limiting the ecological validity of agent behavior. Our agent addresses this limitation by incorporating RGB visual inputs and… ▽ More

    Submitted 9 May, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: Under Review

  44. arXiv:2503.04952  [pdf, other

    cs.AI cs.RO

    INTENT: Trajectory Prediction Framework with Intention-Guided Contrastive Clustering

    Authors: Yihong Tang, Wei Ma

    Abstract: Accurate trajectory prediction of road agents (e.g., pedestrians, vehicles) is an essential prerequisite for various intelligent systems applications, such as autonomous driving and robotic navigation. Recent research highlights the importance of environmental contexts (e.g., maps) and the "multi-modality" of trajectories, leading to increasingly complex model structures. However, real-world deplo… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  45. arXiv:2503.03651  [pdf, other

    cs.CV

    DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

    Authors: Rui Zhao, Weijia Mao, Mike Zheng Shou

    Abstract: Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bid… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  46. arXiv:2503.02918  [pdf, other

    cs.LG cs.AI

    Straight-Line Diffusion Model for Efficient 3D Molecular Generation

    Authors: Yuyan Ni, Shikun Feng, Haohan Chi, Bowen Zheng, Huan-ang Gao, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan

    Abstract: Diffusion-based models have shown great promise in molecular generation but often require a large number of sampling steps to generate valid samples. In this paper, we introduce a novel Straight-Line Diffusion Model (SLDM) to tackle this problem, by formulating the diffusion process to follow a linear trajectory. The proposed process aligns well with the noise sensitivity characteristic of molecul… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  47. arXiv:2503.01424  [pdf, other

    cs.AI cs.CL

    From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

    Authors: Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

    Abstract: Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in t… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  48. arXiv:2502.19108  [pdf, other

    cs.IR cs.MM

    A 106K Multi-Topic Multilingual Conversational User Dataset with Emoticons

    Authors: Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Qinglang Guo, Min Zhang

    Abstract: Instant messaging has become a predominant form of communication, with texts and emoticons enabling users to express emotions and ideas efficiently. Emoticons, in particular, have gained significant traction as a medium for conveying sentiments and information, leading to the growing importance of emoticon retrieval and recommendation systems. However, one of the key challenges in this area has be… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  49. arXiv:2502.18538  [pdf, other

    cs.LG cs.AI

    Revisiting Convolution Architecture in the Realm of DNA Foundation Models

    Authors: Yu Bo, Weian Mao, Yanjun Shao, Weiqiang Bai, Peng Ye, Xinzhu Ma, Junbo Zhao, Hao Chen, Chunhua Shen

    Abstract: In recent years, a variety of methods based on Transformer and state space model (SSM) architectures have been proposed, advancing foundational DNA language models. However, there is a lack of comparison between these recent approaches and the classical architecture convolutional networks (CNNs) on foundation model benchmarks. This raises the question: are CNNs truly being surpassed by these recen… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  50. arXiv:2502.17905  [pdf, other

    cs.IT eess.SP

    A Tutorial on Movable Antennas for Wireless Networks

    Authors: Lipeng Zhu, Wenyan Ma, Weidong Mei, Yong Zeng, Qingqing Wu, Boyu Ning, Zhenyu Xiao, Xiaodan Shao, Jun Zhang, Rui Zhang

    Abstract: Movable antenna (MA) has been recognized as a promising technology to enhance the performance of wireless communication and sensing by enabling antenna movement. Such a significant paradigm shift from conventional fixed antennas (FAs) to MAs offers tremendous new opportunities towards realizing more versatile, adaptive and efficient next-generation wireless networks such as 6G. In this paper, we p… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: Accepted for publiation in the IEEE Communications Surveys & Tutorials