Skip to main content

Showing 1–50 of 564 results for author: Zheng, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10527  [pdf, other

    cs.CL

    WorldPM: Scaling Human Preference Modeling

    Authors: Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin

    Abstract: Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from pub… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.10425  [pdf, ps, other

    cs.LG

    Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

    Authors: Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong

    Abstract: Large language models (LLMs) excel at complex tasks thanks to advances in reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LL… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  3. arXiv:2505.09388  [pdf, other

    cs.CL

    Qwen3 Technical Report

    Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou , et al. (35 additional authors not shown)

    Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  4. arXiv:2505.09118  [pdf, other

    cs.CV

    Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

    Authors: Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li

    Abstract: Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizin… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  5. arXiv:2505.08265  [pdf, other

    cs.LG cs.AI

    LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification

    Authors: Hang Gao, Wenxuan Huang, Fengge Wu, Junsuo Zhao, Changwen Zheng, Huaping Liu

    Abstract: The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the int… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  6. arXiv:2505.06524  [pdf, ps, other

    cs.CV

    Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation

    Authors: Jingyao Wang, Jianqi Zhang, Wenwen Qiang, Changwen Zheng

    Abstract: Despite the strength of the Segment Anything Model (SAM), it struggles with generalization issues in open-vocabulary multi-entity segmentation (OVMS). Through empirical and causal analyses, we find that (i) the prompt bias is the primary cause of the generalization issues; (ii) this bias is closely tied to the task-irrelevant generating factors within the prompts, which act as confounders and affe… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  7. arXiv:2505.06321  [pdf, other

    cs.LG cs.AI

    Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning

    Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu

    Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI 2025

  8. arXiv:2505.01068  [pdf, other

    cs.CL cs.AI

    Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

    Authors: Yijie Jin, Junjie Peng, Xuanchao Lin, Haochen Yuan, Lan Wang, Cangzhi Zheng

    Abstract: Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, fr… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  9. arXiv:2504.21382  [pdf, ps, other

    cs.DC

    Robust and Scalable Renaming with Subquadratic Bits

    Authors: Sirui Bai, Xinyu Fu, Yuheng Wang, Yuyi Wang, Chaodong Zheng

    Abstract: In the renaming problem, a set of $n$ nodes, each with a unique identity from a large namespace $[N]$, needs to obtain new unique identities in a smaller namespace $[M]$. A renaming algorithm is strong if $M=n$. Renaming is a classical problem in distributed computing with a range of applications, and there exist many time-efficient solutions for fault-tolerant renaming in synchronous message-pass… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  10. arXiv:2504.15894  [pdf, ps, other

    cs.HC cs.AI

    Supporting Data-Frame Dynamics in AI-assisted Decision Making

    Authors: Chengbo Zheng, Tim Miller, Alina Bialkowski, H Peter Soyer, Monika Janda

    Abstract: High stakes decision-making often requires a continuous interplay between evolving evidence and shifting hypotheses, a dynamic that is not well supported by current AI decision support systems. In this paper, we introduce a mixed-initiative framework for AI assisted decision making that is grounded in the data-frame theory of sensemaking and the evaluative AI paradigm. Our approach enables both hu… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning, Report Number: CHI25-WS-AUGMENTED-REASONING

    Report number: CHI25-WS-AUGMENTED-REASONING

    Journal ref: Proceedings of the 2025 ACM CHI Workshop on Human-AI Interaction for Augmented Reasoning

  11. arXiv:2504.15649  [pdf, other

    eess.IV cs.CV

    RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution

    Authors: Biao Wu, Diankai Zhang, Shaoli Liu, Si Gao, Chengjian Zheng, Ning Wang

    Abstract: As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for res… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Champion Solution for CVPR 2025 MAI VSR Track

  12. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  13. arXiv:2504.08958  [pdf, other

    cs.CL cs.AI

    Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

    Authors: Mehmet Arif Demirtaş, Claire Zheng, Max Fowler, Kathryn Cunningham

    Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overa… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: Accepted as full paper at AIED 2025

  14. arXiv:2504.07961  [pdf, other

    cs.CV

    Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

    Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

    Abstract: We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 16 pages, 5 figures, Project page: https://geo4d.github.io/

    ACM Class: I.4.5

  15. arXiv:2504.03006  [pdf, other

    cs.CV

    DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

    Authors: Jing Gao, Ce Zheng, Laszlo A. Jeni, Zackory Erickson

    Abstract: In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learni… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: 16 pages, 19 figures. Accepted to CVPR 2025

  16. arXiv:2504.02137  [pdf, other

    cs.IR cs.AI

    Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID

    Authors: Carolina Zheng, Minhui Huang, Dmitrii Pedchenko, Kaushik Rangadurai, Siyu Wang, Gaby Nahum, Jie Lei, Yang Yang, Tao Liu, Zutian Luo, Xiaohan Wei, Dinesh Ramasamy, Jiyan Yang, Yiping Han, Lin Yang, Hangjun Xu, Rong Jin, Shuang Yang

    Abstract: The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many sys… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  17. arXiv:2504.00532  [pdf, other

    cs.SE cs.CL

    SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

    Authors: Hongru Ma, Yanjie Liang, Jiasheng Si, Weiyu Zhang, Hongjiao Guan, Chaoqun Zheng, Bing Xu, Wenpeng Lu

    Abstract: Large language models (LLMs) have revolutionized code generation, significantly enhancing developer productivity. However, for a vast number of users with minimal coding knowledge, LLMs provide little support, as they primarily generate isolated code snippets rather than complete, large-scale project code. Without coding expertise, these users struggle to interpret, modify, and iteratively refine… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: 23 pages

  18. arXiv:2503.22677  [pdf, other

    cs.CV cs.AI cs.LG

    DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

    Authors: Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

    Abstract: Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications. One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity. Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima.… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: Project page: https://ruiningli.com/dso

  19. arXiv:2503.22344  [pdf, other

    cs.CV

    Semantix: An Energy Guided Sampler for Semantic Style Transfer

    Authors: Huiang He, Minghui Hu, Chuanxia Zheng, Chaoyue Wang, Tat-Jen Cham

    Abstract: Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves tra… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: 28 pages, 19 figures, Accepted to ICLR 2025

  20. arXiv:2503.21699  [pdf, other

    cs.MM cs.AI cs.CV cs.SD eess.AS

    MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX

    Authors: Liuyue Xie, George Z. Wei, Avik Kuthiala, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, László A. Jeni

    Abstract: Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Eva… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  21. arXiv:2503.21082  [pdf, other

    cs.CV

    Can Video Diffusion Model Reconstruct 4D Geometry?

    Authors: Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, Jürgen Schmidhuber, Bernard Ghanem

    Abstract: Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemp… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  22. arXiv:2503.17827  [pdf, other

    cs.CV

    4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

    Authors: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understand… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  23. arXiv:2503.17784  [pdf, other

    cs.AI

    MEPNet: Medical Entity-balanced Prompting Network for Brain CT Report Generation

    Authors: Xiaodan Zhang, Yanzhao Shi, Junzhong Ji, Chengxin Zheng, Liangqiong Qu

    Abstract: The automatic generation of brain CT reports has gained widespread attention, given its potential to assist radiologists in diagnosing cranial diseases. However, brain CT scans involve extensive medical entities, such as diverse anatomy regions and lesions, exhibiting highly inconsistent spatial patterns in 3D volumetric space. This leads to biased learning of medical entities in existing methods,… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: AAAI 2025 Oral Paper

  24. arXiv:2503.13439  [pdf, other

    cs.CV

    Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

    Authors: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham

    Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occl… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: Project Page: https://sm0kywu.github.io/Amodal3R/

  25. arXiv:2503.10529  [pdf, other

    cs.CV cs.AI

    PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

    Authors: Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li

    Abstract: 3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmen… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Technical Report

  26. arXiv:2503.10432  [pdf, ps, other

    cs.LG cs.CL

    BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models

    Authors: Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang

    Abstract: In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs' cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images an… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 6 pages, 7 figures, conference

  27. arXiv:2503.10170  [pdf, other

    cs.RO cs.CV

    GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction

    Authors: Jianheng Liu, Yunfei Wan, Bowen Wang, Chunran Zheng, Jiarong Lin, Fu Zhang

    Abstract: Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applic… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  28. arXiv:2503.06578  [pdf, other

    cs.RO eess.SY

    Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning

    Authors: Canlun Zheng, Zhanyu Guo, Zikang Yin, Chunyu Wang, Zhikun Wang, Shiyu Zhao

    Abstract: The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targ… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  29. arXiv:2503.06412  [pdf, other

    cs.RO cs.MA eess.SY

    Vision-Based Cooperative MAV-Capturing-MAV

    Authors: Canlun Zheng, Yize Mi, Hanqing Guo, Huaben Chen, Shiyu Zhao

    Abstract: MAV-capturing-MAV (MCM) is one of the few effective methods for physically countering misused or malicious MAVs.This paper presents a vision-based cooperative MCM system, where multiple pursuer MAVs equipped with onboard vision systems detect, localize, and pursue a target MAV. To enhance robustness, a distributed state estimation and control framework enables the pursuer MAVs to autonomously coor… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  30. arXiv:2503.05246  [pdf, other

    cs.LG

    Mastering Continual Reinforcement Learning through Fine-Grained Sparse Network Allocation and Dormant Neuron Exploration

    Authors: Chengqi Zheng, Haiyan Yin, Jianda Chen, Terence Ng, Yew-Soon Ong, Ivor Tsang

    Abstract: Continual Reinforcement Learning (CRL) is essential for developing agents that can learn, adapt, and accumulate knowledge over time. However, a fundamental challenge persists as agents must strike a delicate balance between plasticity, which enables rapid skill acquisition, and stability, which ensures long-term knowledge retention while preventing catastrophic forgetting. In this paper, we introd… ▽ More

    Submitted 9 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

  31. arXiv:2503.05164  [pdf, other

    cs.RO cs.AI

    A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

    Authors: Shanhe You, Xuewen Luo, Xinhe Liang, Jiashu Yu, Chen Zheng, Jiangtao Gong

    Abstract: Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: 8 pages, 3 figures

    MSC Class: 68T45

    Journal ref: ICRA2025

  32. arXiv:2503.01474  [pdf, other

    cs.RO

    Interactive Navigation for Legged Manipulators with Learned Arm-Pushing Controller

    Authors: Zhihai Bi, Kai Chen, Chunxin Zheng, Yulin Li, Haoang Li, Jun Ma

    Abstract: Interactive navigation is crucial in scenarios where proactively interacting with objects can yield shorter paths, thus significantly improving traversal efficiency. Existing methods primarily focus on using the robot body to relocate large obstacles (which could be comparable to the size of a robot). However, they prove ineffective in narrow or constrained spaces where the robot's dimensions rest… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  33. arXiv:2503.01125  [pdf, other

    cs.RO

    TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

    Authors: Zikang Yin, Canlun Zheng, Shiliang Guo, Zhikun Wang, Shiyu Zhao

    Abstract: Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online param… ▽ More

    Submitted 7 March, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

    Comments: For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk

  34. arXiv:2502.18277  [pdf, other

    cs.CL

    Self-Adjust Softmax

    Authors: Chuanyang Zheng, Yihang Gao, Guoxuan Chen, Han Shi, Jing Xiong, Xiaozhe Ren, Chao Huang, Xin Jiang, Zhenguo Li, Yu Li

    Abstract: The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: Tech Report

  35. arXiv:2502.14917  [pdf, other

    cs.CV cs.AI

    Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

    Authors: Rui Zhao, Qirui Yuan, Jinyu Li, Haofeng Hu, Yun Li, Chengyuan Zheng, Fei Gao

    Abstract: End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achiev… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  36. arXiv:2502.14739  [pdf, other

    cs.CL

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que , et al. (72 additional authors not shown)

    Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-orient… ▽ More

    Submitted 28 March, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  37. arXiv:2502.14583  [pdf, other

    cs.LG cs.AI

    A Theory for Conditional Generative Modeling on Multiple Data Sources

    Authors: Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu

    Abstract: The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specif… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 35 pages

  38. arXiv:2502.14361  [pdf, other

    cs.AI cs.IR

    Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning

    Authors: Jiachen Zhu, Congmin Zheng, Jianghao Lin, Kounianhua Du, Ying Wen, Yong Yu, Jun Wang, Weinan Zhang

    Abstract: While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and que… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  39. arXiv:2502.14317  [pdf, other

    cs.CL

    ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

    Authors: Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

    Abstract: Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly fine-tuning. In contrast, recent training-free approaches suffer from the attention sink phenomenon, leading to severe performance degradation. In this paper, we introduce… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: We will release the code soon

  40. ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities

    Authors: Chanjin Zheng, Zengyi Yu, Yilin Jiang, Mingzi Zhang, Xunuo Lu, Jing Jin, Liteng Gao

    Abstract: Can Multimodal Large Language Models (MLLMs), with capabilities in perception, recognition, understanding, and reasoning, function as independent assistants in art evaluation dialogues? Current MLLM evaluation methods, which rely on subjective human scoring or costly interviews, lack comprehensive coverage of various scenarios. This paper proposes a process-oriented Human-Computer Interaction (HCI… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: 18 pages, 12 figures. Accepted by CHI 2025

  41. arXiv:2502.12640  [pdf, other

    cs.CV

    RecDreamer: Consistent Text-to-3D Generation via Uniform Score Distillation

    Authors: Chenxi Zheng, Yihong Lin, Bangzhen Liu, Xuemiao Xu, Yongwei Nie, Shengfeng He

    Abstract: Current text-to-3D generation methods based on score distillation often suffer from geometric inconsistencies, leading to repeated patterns across different poses of 3D assets. This issue, known as the Multi-Face Janus problem, arises because existing methods struggle to maintain consistency across varying poses and are biased toward a canonical pose. While recent work has improved pose control an… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  42. arXiv:2502.12002  [pdf, other

    cs.SD cs.CV eess.AS

    NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing

    Authors: Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng

    Abstract: Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits o… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  43. arXiv:2502.08974  [pdf, other

    cs.CV

    Topo2Seq: Enhanced Topology Reasoning via Topology Sequence Learning

    Authors: Yiming Yang, Yueru Luo, Bingkun He, Erlong Li, Zhipeng Cao, Chao Zheng, Shuqi Mei, Zhen Li

    Abstract: Extracting lane topology from perspective views (PV) is crucial for planning and control in autonomous driving. This approach extracts potential drivable trajectories for self-driving vehicles without relying on high-definition (HD) maps. However, the unordered nature and weak long-range perception of the DETR-like framework can result in misaligned segment endpoints and limited topological predic… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  44. arXiv:2502.08089  [pdf, other

    cs.RO eess.SY

    A Cooperative Bearing-Rate Approach for Observability-Enhanced Target Motion Estimation

    Authors: Canlun Zheng, Hanqing Guo, Shiyu Zhao

    Abstract: Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the \emph{bearing rate} i… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

    Comments: accepted by icra 2025

  45. arXiv:2502.06398  [pdf, other

    cs.LG stat.ML

    Learning Counterfactual Outcomes Under Rank Preservation

    Authors: Peng Wu, Haoxuan Li, Chunyuan Zheng, Yan Zeng, Jiawei Chen, Yang Liu, Ruocheng Guo, Kun Zhang

    Abstract: Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  46. arXiv:2502.05870  [pdf, other

    cs.HC

    Understanding Design Fixation in Generative AI

    Authors: Liuqing Chen, Yaxuan Song, Chunyuan Zheng, Qianzhi Jing, Preben Hansen, Lingyun Sun

    Abstract: Generative AI (GenAI) provides new opportunities for creativity support, but the phenomenon of GenAI design fixation remains underexplored. While human design fixation typically constrains ideas to familiar or existing solutions, our findings reveal that GenAI similarly experience design fixation, limiting its ability to generate novel and diverse design outcomes. To advance understanding of GenAI… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

  47. arXiv:2502.05454  [pdf, other

    cs.RO cs.LG

    Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following

    Authors: Vivek Myers, Bill Chunyuan Zheng, Anca Dragan, Kuan Fang, Sergey Levine

    Abstract: Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositi… ▽ More

    Submitted 13 February, 2025; v1 submitted 8 February, 2025; originally announced February 2025.

  48. Designing LLM-simulated Immersive Spaces to Enhance Autistic Children's Social Affordances Understanding

    Authors: Yancheng Cao, Yangyang HE, Yonglin Chen, Menghan Chen, Shanhe You, Yulin Qiu, Min Liu, Chuan Luo, Chen Zheng, Xin Tong, Jing Liang, Jiangtao Gong

    Abstract: One of the key challenges faced by autistic children is understanding social affordances in complex environments, which further impacts their ability to respond appropriately to social signals. In traffic scenarios, this impairment can even lead to safety concerns. In this paper, we introduce an LLM-simulated immersive projection environment designed to improve this ability in autistic children wh… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: iui2025

  49. Rotation-Adaptive Point Cloud Domain Generalization via Intricate Orientation Learning

    Authors: Bangzhen Liu, Chenxi Zheng, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Shengfeng He

    Abstract: The vulnerability of 3D point cloud analysis to unpredictable rotations poses an open yet challenging problem: orientation-aware 3D domain generalization. Cross-domain robustness and adaptability of 3D representations are crucial but not easily achieved through rotation augmentation. Motivated by the inherent advantages of intricate orientations in enhancing generalizability, we propose an innovat… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

    Comments: 13pages, supplementary included, early accepted by TPAMI

    ACM Class: I.2.10

  50. arXiv:2501.18623  [pdf, other

    cs.CV cs.GR

    VLMaterial: Procedural Material Generation with Large Vision-Language Models

    Authors: Beichen Li, Rundi Wu, Armando Solar-Lezama, Changxi Zheng, Liang Shi, Bernd Bickel, Wojciech Matusik

    Abstract: Procedural materials, represented as functional node graphs, are ubiquitous in computer graphics for photorealistic material appearance design. They allow users to perform intuitive and precise editing to achieve desired visual appearances. However, creating a procedural material given an input image requires professional knowledge and significant effort. In this work, we leverage the ability to c… ▽ More

    Submitted 18 February, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

    Comments: ICLR 2025 Spotlight