Skip to main content

Showing 1–50 of 308 results for author: Cao, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.08129  [pdf, other

    cs.LG cs.AI

    High-order Regularization for Machine Learning and Learning-based Control

    Authors: Xinghua Liu, Ming Cao

    Abstract: The paper proposes a novel regularization procedure for machine learning. The proposed high-order regularization (HR) provides new insight into regularization, which is widely used to train a neural network that can be utilized to approximate the action-value function in general reinforcement learning problems. The proposed HR method ensures the provable convergence of the approximation algorithm,… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  2. arXiv:2505.05467  [pdf, other

    cs.CV cs.AI cs.CL

    StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

    Authors: Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang

    Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer comb… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  3. arXiv:2505.03673  [pdf, other

    cs.RO

    RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration

    Authors: Huajie Tan, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Yaoxu Lyu, Mingyu Cao, Zhongyuan Wang, Shanghang Zhang

    Abstract: The dawn of embodied intelligence has ushered in an unprecedented imperative for resilient, cognition-enabled multi-agent collaboration across next-generation ecosystems, revolutionizing paradigms in autonomous manufacturing, adaptive service robotics, and cyber-physical production architectures. However, current robotic systems face significant limitations, such as limited cross-embodiment adapta… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 22 pages, 10 figures

  4. arXiv:2504.20504  [pdf, other

    eess.IV cs.LG physics.comp-ph

    Quality-factor inspired deep neural network solver for solving inverse scattering problems

    Authors: Yutong Du, Zicheng Liu, Miao Cao, Zupeng Liang, Yali Zong, Changyou Li

    Abstract: Deep neural networks have been applied to address electromagnetic inverse scattering problems (ISPs) and shown superior imaging performances, which can be affected by the training dataset, the network architecture and the applied loss function. Here, the quality of data samples is cared and valued by the defined quality factor. Based on the quality factor, the composition of the training dataset i… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  5. arXiv:2504.19314  [pdf, other

    cs.CL

    BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    Authors: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua

    Abstract: As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese.… ▽ More

    Submitted 1 May, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

    Comments: Under Review

  6. arXiv:2504.15415  [pdf, other

    cs.CV cs.CL

    IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

    Authors: David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin

    Abstract: Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  7. arXiv:2504.12636  [pdf, other

    cs.RO

    A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

    Authors: Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, Xiaodan Liang

    Abstract: Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that foc… ▽ More

    Submitted 6 May, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  8. arXiv:2504.10828  [pdf, other

    cs.RO

    Following Is All You Need: Robot Crowd Navigation Using People As Planners

    Authors: Yuwen Liao, Xinhang Xu, Ruofei Bai, Yizhuo Yang, Muqing Cao, Shenghai Yuan, Lihua Xie

    Abstract: Navigating in crowded environments requires the robot to be equipped with high-level reasoning and planning techniques. Existing works focus on developing complex and heavyweight planners while ignoring the role of human intelligence. Since humans are highly capable agents who are also widely available in a crowd navigation setting, we propose an alternative scheme where the robot utilises people… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  9. arXiv:2503.23875  [pdf, other

    cs.RO cs.AI cs.MA

    GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models

    Authors: Wenkang Ji, Huaben Chen, Mingyang Chen, Guobin Zhu, Lufeng Xu, Roderich Groß, Rui Zhou, Ming Cao, Shiyu Zhao

    Abstract: The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  10. arXiv:2503.23084  [pdf, other

    cs.CL cs.AI

    The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

    Authors: Yihuai Hong, Dian Zhou, Meng Cao, Lei Yu, Zhijing Jin

    Abstract: Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic under… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  11. arXiv:2503.22272  [pdf, other

    cs.RO

    Robust simultaneous UWB-anchor calibration and robot localization for emergency situations

    Authors: Xinghua Liu, Ming Cao

    Abstract: In this work, we propose a factor graph optimization (FGO) framework to simultaneously solve the calibration problem for Ultra-WideBand (UWB) anchors and the robot localization problem. Calibrating UWB anchors manually can be time-consuming and even impossible in emergencies or those situations without special calibration tools. Therefore, automatic estimation of the anchor positions becomes a nec… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  12. arXiv:2503.22171  [pdf, other

    cs.CV

    An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

    Authors: Min Cao, ZiYin Zeng, YuXin Lu, Mang Ye, Dong Yi, Jinqiao Wang

    Abstract: Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: 20 pages,13 figures

  13. arXiv:2503.18943  [pdf, other

    cs.CV

    SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

    Authors: Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan

    Abstract: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is… ▽ More

    Submitted 27 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Technical report

  14. arXiv:2503.18923  [pdf, other

    cs.CV

    Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

    Authors: Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, Xiaodan Liang

    Abstract: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmark… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: 24 pages

  15. arXiv:2503.16867  [pdf, other

    cs.CV

    ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

    Authors: Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, Ruihua Song

    Abstract: Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Ali… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  16. arXiv:2503.07504  [pdf, other

    cs.RO

    PIPE Planner: Pathwise Information Gain with Map Predictions for Indoor Robot Exploration

    Authors: Seungjae Baek, Brady Moon, Seungchan Kim, Muqing Cao, Cherie Ho, Sebastian Scherer, Jeong hwan Jeon

    Abstract: Autonomous exploration in unknown environments requires estimating the information gain of an action to guide planning decisions. While prior approaches often compute information gain at discrete waypoints, pathwise integration offers a more comprehensive estimation but is often computationally challenging or infeasible and prone to overestimation. In this work, we propose the Pathwise Information… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 8 pages, 8 figures

  17. arXiv:2503.06446  [pdf, other

    cs.CV

    M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

    Authors: Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li

    Abstract: Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-trai… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  18. arXiv:2503.05639  [pdf, other

    cs.CV cs.AI cs.MM

    VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

    Authors: Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu

    Abstract: Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context pre… ▽ More

    Submitted 8 April, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: Project page available at https://yxbian23.github.io/project/video-painter

  19. arXiv:2503.01136  [pdf, other

    cs.CV

    Prior-guided Hierarchical Harmonization Network for Efficient Image Dehazing

    Authors: Xiongfei Su, Siyuan Li, Yuning Cui, Miao Cao, Yulun Zhang, Zheng Chen, Zongliang Wu, Zedong Wang, Yuanlong Zhang, Xin Yuan

    Abstract: Image dehazing is a crucial task that involves the enhancement of degraded images to recover their sharpness and textures. While vision Transformers have exhibited impressive results in diverse dehazing tasks, their quadratic complexity and lack of dehazing priors pose significant drawbacks for real-world applications. In this paper, guided by triple priors, Bright Channel Prior (BCP), Dark Chan… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  20. arXiv:2502.18411  [pdf, other

    cs.CV

    OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

    Authors: Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen

    Abstract: Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with… ▽ More

    Submitted 28 February, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  21. arXiv:2502.16105  [pdf, other

    cs.CV cs.AI

    NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions

    Authors: Tue M. Cao, Nhat X. Hoang, Hieu H. Pham, Phi Le Nguyen, My T. Thai

    Abstract: Understanding the inner workings of neural networks is essential for enhancing model performance and interpretability. Current research predominantly focuses on examining the connection between individual neurons and the model's final predictions. Which suffers from challenges in interpreting the internal workings of the model, particularly when neurons encode multiple unrelated features. In this… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: The Thirteenth International Conference on Learning Representations

  22. arXiv:2502.15130  [pdf, other

    cs.CV

    TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

    Authors: Xiuwei Chen, Sihao Lin, Xiao Dong, Zisheng Chen, Meng Cao, Jianhua Han, Hang Xu, Xiaodan Liang

    Abstract: Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training speci… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  23. arXiv:2502.14739  [pdf, other

    cs.CL

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que , et al. (72 additional authors not shown)

    Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-orient… ▽ More

    Submitted 28 March, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  24. arXiv:2502.13390  [pdf, other

    eess.SP cs.IT cs.LG

    Deep-Unfolded Massive Grant-Free Transmission in Cell-Free Wireless Communication Systems

    Authors: Gangle Sun, Mengyao Cao, Wenjin Wang, Wei Xu, Christoph Studer

    Abstract: Grant-free transmission and cell-free communication are vital in improving coverage and quality-of-service for massive machine-type communication. This paper proposes a novel framework of joint active user detection, channel estimation, and data detection (JACD) for massive grant-free transmission in cell-free wireless communication systems. We formulate JACD as an optimization problem and solve i… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Comments: To appear in the IEEE Transactions on Signal Processing

  25. arXiv:2502.11718  [pdf, other

    cs.CL cs.CV

    ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

    Authors: Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng

    Abstract: The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major t… ▽ More

    Submitted 26 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: 24 pages, 21 figures

  26. arXiv:2502.09325  [pdf, other

    cs.CV

    A Benchmark for Crime Surveillance Video Analysis with Large Models

    Authors: Haoran Chen, Dong Yi, Moyan Cao, Chensen Huang, Guibo Zhu, Jinqiao Wang

    Abstract: Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  27. arXiv:2502.05792  [pdf, other

    cs.RO

    AToM: Adaptive Theory-of-Mind-Based Human Motion Prediction in Long-Term Human-Robot Interactions

    Authors: Yuwen Liao, Muqing Cao, Xinhang Xu, Lihua Xie

    Abstract: Humans learn from observations and experiences to adjust their behaviours towards better performance. Interacting with such dynamic humans is challenging, as the robot needs to predict the humans accurately for safe and efficient operations. Long-term interactions with dynamic humans have not been extensively studied by prior works. We propose an adaptive human prediction model based on the Theory… ▽ More

    Submitted 12 February, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

    Comments: submitted to ICRA 2025

  28. arXiv:2502.04778  [pdf, other

    cs.LG cs.AI

    Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

    Authors: Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang

    Abstract: The primary focus of offline reinforcement learning (RL) is to manage the risk of hazardous exploitation of out-of-distribution actions. An effective approach to achieve this goal is through behavior regularization, which augments conventional RL objectives by incorporating constraints that enforce the policy to remain close to the behavior policy. Nevertheless, existing literature on behavior-reg… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

    Comments: Under review

  29. arXiv:2501.16309  [pdf, other

    physics.med-ph cs.AI

    Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology

    Authors: Meiyun Cao, Shaw Hu, Jason Sharp, Edward Clouser, Jason Holmes, Linda L. Lam, Xiaoning Ding, Diego Santos Toesca, Wendy S. Lindholm, Samir H. Patel, Sujay A. Vora, Peilong Wang, Wei Liu

    Abstract: Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, wa… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  30. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  31. arXiv:2501.13751  [pdf, other

    eess.IV cs.CV

    On Disentangled Training for Nonlinear Transform in Learned Image Compression

    Authors: Han Li, Shaohui Li, Wenrui Dai, Maida Cao, Nuowen Kan, Chenglin Li, Junni Zou, Hongkai Xiong

    Abstract: Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs, but is challenged by training inefficiency that could incur more than two weeks to train a state-of-the-art model from scratch. Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms. In this paper, we first reveal that… ▽ More

    Submitted 15 February, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

    Comments: Accepted by ICLR2025

  32. arXiv:2501.12273  [pdf, other

    cs.CL cs.AI

    Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

    Authors: Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen

    Abstract: The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage syn… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

    Comments: Tech Report. Github: https://github.com/InternLM/Condor

  33. arXiv:2501.06566  [pdf, other

    cs.RO eess.SY

    Cooperative Aerial Robot Inspection Challenge: A Benchmark for Heterogeneous Multi-UAV Planning and Lessons Learned

    Authors: Muqing Cao, Thien-Minh Nguyen, Shenghai Yuan, Andreas Anastasiou, Angelos Zacharia, Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou, Marios M. Polycarpou, Xinhang Xu, Mingjie Zhang, Fei Gao, Boyu Zhou, Ben M. Chen, Lihua Xie

    Abstract: We propose the Cooperative Aerial Robot Inspection Challenge (CARIC), a simulation-based benchmark for motion planning algorithms in heterogeneous multi-UAV systems. CARIC features UAV teams with complementary sensors, realistic constraints, and evaluation metrics prioritizing inspection quality and efficiency. It offers a ready-to-use perception-control software stack and diverse scenarios to sup… ▽ More

    Submitted 14 January, 2025; v1 submitted 11 January, 2025; originally announced January 2025.

    Comments: Please find our website at https://ntu-aris.github.io/caric

  34. arXiv:2412.16880  [pdf, other

    cs.RO

    Large-Scale UWB Anchor Calibration and One-Shot Localization Using Gaussian Process

    Authors: Shenghai Yuan, Boyang Lou, Thien-Minh Nguyen, Pengyu Yin, Muqing Cao, Xinghang Xu, Jianping Li, Jie Xu, Siyu Chen, Lihua Xie

    Abstract: Ultra-wideband (UWB) is gaining popularity with devices like AirTags for precise home item localization but faces significant challenges when scaled to large environments like seaports. The main challenges are calibration and localization in obstructed conditions, which are common in logistics environments. Traditional calibration methods, dependent on line-of-sight (LoS), are slow, costly, and un… ▽ More

    Submitted 6 March, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

    Comments: This work has been accepted to IEEE International Conference on Robotics and Automation (ICRA) @ 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/redistribution, creating new works, or reuse of any copyrighted components of this work in other media

  35. arXiv:2412.15819  [pdf, other

    cs.CV cs.HC eess.SP

    Robustness-enhanced Myoelectric Control with GAN-based Open-set Recognition

    Authors: Cheng Wang, Ziyang Feng, Pin Zhang, Manjiang Cao, Yiming Yuan, Tengfei Chang

    Abstract: Electromyography (EMG) signals are widely used in human motion recognition and medical rehabilitation, yet their variability and susceptibility to noise significantly limit the reliability of myoelectric control systems. Existing recognition algorithms often fail to handle unfamiliar actions effectively, leading to system instability and errors. This paper proposes a novel framework based on Gener… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: 11 pages, 14 figures

  36. arXiv:2412.14531  [pdf, other

    cs.CV

    Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

    Authors: Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng

    Abstract: Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: Project page: https://github.com/ljzycmd/SCD

  37. arXiv:2412.14487  [pdf, other

    cs.CV

    Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

    Authors: Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, Bo Zheng

    Abstract: Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a nove… ▽ More

    Submitted 23 February, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

  38. arXiv:2412.14161  [pdf, other

    cs.CL

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

    Abstract: We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agen… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Preprint

  39. arXiv:2412.12886  [pdf, other

    cs.LG

    TimeCHEAT: A Channel Harmony Strategy for Irregularly Sampled Multivariate Time Series Analysis

    Authors: Jiexi Liu, Meng Cao, Songcan Chen

    Abstract: Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Due to their non-uniform intervals between successive observations and varying sampling rates among series, the channel-independent (CI) strategy, which has been demonstrated more desirable for complete multivariate time series forecasting in recent studies, has failed. This failure can be further attributed to the samp… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  40. arXiv:2412.12087  [pdf, other

    cs.CV

    Instruction-based Image Manipulation by Watching How Things Move

    Authors: Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia

    Abstract: This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captur… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Project page: https://ljzycmd.github.io/projects/InstructMove/

  41. arXiv:2412.10089  [pdf, other

    cs.CV

    Guidance Not Obstruction: A Conjugate Consistent Enhanced Strategy for Domain Generalization

    Authors: Meng Cao, Songcan Chen

    Abstract: Domain generalization addresses domain shift in real-world applications. Most approaches adopt a domain angle, seeking invariant representation across domains by aligning their marginal distributions, irrespective of individual classes, naturally leading to insufficient exploration of discriminative information. Switching to a class angle, we find that multiple domain-related peaks or clusters wit… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  42. arXiv:2412.09828  [pdf, other

    cs.CV

    MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

    Authors: Xunnong Xu, Mengying Cao

    Abstract: Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we prop… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: Technical Report

  43. arXiv:2412.08087  [pdf, other

    cs.CY cs.DL

    From Division to Unity: A Large-Scale Study on the Emergence of Computational Social Science, 1990-2021

    Authors: Honglin Bao, Jiawei Zhang, Mingxuan Cao, James A. Evans

    Abstract: We present a comprehensive study on the emergence of Computational Social Science (CSS) - an interdisciplinary field leveraging computational methods to address social science questions - and its impact on adjacent social sciences. We trained a robust CSS classifier using papers from CSS-focused venues and applied it to 11 million papers spanning 1990 to 2021. Our analysis yielded three key findin… ▽ More

    Submitted 15 December, 2024; v1 submitted 10 December, 2024; originally announced December 2024.

  44. arXiv:2412.07119  [pdf, other

    cs.CV

    DiffCLIP: Few-shot Language-driven Multimodal Classifier

    Authors: Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, Yunsong Li

    Abstract: Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  45. arXiv:2412.04903  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation

    Authors: Yongxin Wang, Meng Cao, Haokun Lin, Mingfei Han, Liang Ma, Jin Jiang, Yuhao Cheng, Xiaodan Liang

    Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods requ… ▽ More

    Submitted 16 December, 2024; v1 submitted 6 December, 2024; originally announced December 2024.

    Comments: 19 pages

  46. arXiv:2412.01800  [pdf, other

    cs.CV

    PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

    Authors: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang

    Abstract: Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of p… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

  47. arXiv:2412.01063  [pdf, other

    cs.LG stat.ML

    MuSiCNet: A Gradual Coarse-to-Fine Framework for Irregularly Sampled Multivariate Time Series Analysis

    Authors: Jiexi Liu, Meng Cao, Songcan Chen

    Abstract: Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Most existing methods treat ISMTS as synchronized regularly sampled time series with missing values, neglecting that the irregularities are primarily attributed to variations in sampling rates. In this paper, we introduce a novel perspective that irregularity is essentially relative in some senses. With sampling rates a… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

    Comments: IJCAI2024 AI4TS workshop best paper runner-up

  48. arXiv:2412.00555  [pdf, other

    cs.RO

    Learning Dynamic Weight Adjustment for Spatial-Temporal Trajectory Planning in Crowd Navigation

    Authors: Muqing Cao, Xinhang Xu, Yizhuo Yang, Jianping Li, Tongxing Jin, Pengfei Wang, Tzu-Yi Hung, Guosheng Lin, Lihua Xie

    Abstract: Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectiv… ▽ More

    Submitted 30 November, 2024; originally announced December 2024.

    Comments: submitted to ICRA 2025

  49. arXiv:2412.00069  [pdf, other

    cs.LG cs.CL

    Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

    Authors: Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin

    Abstract: Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of… ▽ More

    Submitted 16 February, 2025; v1 submitted 25 November, 2024; originally announced December 2024.

  50. arXiv:2411.18328  [pdf, other

    cs.CV

    EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

    Authors: Meiqi Cao, Xiangbo Shu, Jiachao Zhang, Rui Yan, Zechao Li, Jinhui Tang

    Abstract: Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to hand… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.