Skip to main content

Showing 1–50 of 971 results for author: Dong, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05920  [pdf, ps, other

    cs.CV

    High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

    Authors: Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu

    Abstract: State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visu… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.05649  [pdf, ps, other

    cs.CR cs.AI cs.LG

    DESIGN: Encrypted GNN Inference via Server-Side Input Graph Pruning

    Authors: Kaixiang Zhao, Joseph Yousry Attalla, Qian Lou, Yushun Dong

    Abstract: Graph Neural Networks (GNNs) have achieved state-of-the-art performance in various graph-based learning tasks. However, enabling privacy-preserving GNNs in encrypted domains, such as under Fully Homomorphic Encryption (FHE), typically incurs substantial computational overhead, rendering real-time and privacy-preserving inference impractical. In this work, we propose DESIGN (EncrypteD GNN Inference… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: Under Review in Conference on Neural Information Processing Systems (NeurIPS 2025)

  3. Hardware-Free Event Cameras Temporal Synchronization Based on Event Density Alignment

    Authors: Wenxuan Li, Yan Dong, Shaoqiang Qiu, Bin Han

    Abstract: Event cameras are a novel type of sensor designed for capturing the dynamic changes of a scene. Due to factors such as trigger and transmission delays, a time offset exists in the data collected by multiple event cameras, leading to inaccurate information fusion. Thus, the collected data needs to be synchronized to overcome any potential time offset issue. Hardware synchronization methods require… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: 12 pages, 8 figures. Conference paper, International Conference on Intelligent Robotics and Applications 2024

    Journal ref: ICIRA 2023. Lecture Notes in Computer Science(), vol 14273. Springer, Singapore

  4. arXiv:2507.04311  [pdf, ps, other

    cs.RO

    Vibration-aware Lidar-Inertial Odometry based on Point-wise Post-Undistortion Uncertainty

    Authors: Yan Dong, Enci Xu, Shaoqiang Qiu, Wenxuan Li, Yang Liu, Bin Han

    Abstract: High-speed ground robots moving on unstructured terrains generate intense high-frequency vibrations, leading to LiDAR scan distortions in Lidar-inertial odometry (LIO). Accurate and efficient undistortion is extremely challenging due to (1) rapid and non-smooth state changes during intense vibrations and (2) unpredictable IMU noise coupled with a limited IMU sampling frequency. To address this iss… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: 8 pages, 10 figures, 5 tables. Accepted by Robotics and Automation Letters at June 30

  5. arXiv:2507.04105  [pdf, ps, other

    cs.AI cs.MA

    Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing

    Authors: Jinwei Hu, Yi Dong, Zhengtao Ding, Xiaowei Huang

    Abstract: This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verificatio… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Preprint accepted by Chinese Journal of Aeronautics

  6. arXiv:2507.04100  [pdf, ps, other

    cs.LG cs.AI eess.SY

    Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems

    Authors: Jinwei Hu, Zezhi Tang, Xin Jin, Benyuan Zhang, Yi Dong, Xiaowei Huang

    Abstract: This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributio… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Preprint accepted by IEEE Transactions on Industrial Cyber Physical Systems

  7. arXiv:2507.03402  [pdf, ps, other

    cs.CV cs.AI

    Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

    Authors: Yuran Dong, Mang Ye

    Abstract: To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothe… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: 18 pages, 17 figures, ICCV25

  8. arXiv:2507.03271  [pdf, ps, other

    stat.ML cs.LG

    LILI clustering algorithm: Limit Inferior Leaf Interval Integrated into Causal Forest for Causal Interference

    Authors: Yiran Dong, Di Fan, Chuanhou Gao

    Abstract: Causal forest methods are powerful tools in causal inference. Similar to traditional random forest in machine learning, causal forest independently considers each causal tree. However, this independence consideration increases the likelihood that classification errors in one tree are repeated in others, potentially leading to significant bias in causal e ect estimation. In this paper, we propose a… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  9. arXiv:2507.02877  [pdf, ps, other

    q-bio.GN cs.AI cs.GR cs.HC

    AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations

    Authors: Chi Zhang, Yu Dong, Yang Wang, Yuetong Han, Guihua Shan, Bixia Tang

    Abstract: Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circu… ▽ More

    Submitted 17 June, 2025; originally announced July 2025.

  10. arXiv:2507.02546  [pdf, ps, other

    cs.CV

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Authors: Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, Jiaolong Yang

    Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative g… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Project page: https://wangrc.site/MoGe2Page/

  11. arXiv:2507.02270  [pdf, ps, other

    cs.CV

    MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

    Authors: Fanghai Yi, Zehong Zheng, Zexiao Liang, Yihang Dong, Xiyang Fang, Wangyu Wu, Xuhang Chen

    Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color ac… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE SMC 2025

  12. arXiv:2507.01603  [pdf, ps, other

    cs.CV

    DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

    Authors: Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang

    Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  13. arXiv:2507.01006  [pdf, ps, other

    cs.CV cs.AI cs.LG

    GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

    Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  14. arXiv:2506.23844  [pdf, ps, other

    cs.AI

    A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

    Authors: Hang Su, Jun Luo, Chang Liu, Xiao Yang, Yichi Zhang, Yinpeng Dong, Jun Zhu

    Abstract: Recent advances in large language models (LLMs) have catalyzed the rise of autonomous AI agents capable of perceiving, reasoning, and acting in dynamic, open-ended environments. These large-model agents mark a paradigm shift from static inference systems to interactive, memory-augmented entities. While these capabilities significantly expand the functional scope of AI, they also introduce qualitat… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 18 pages

  15. arXiv:2506.22930  [pdf, ps, other

    cs.CV

    Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

    Authors: Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng

    Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual mu… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  16. arXiv:2506.22769  [pdf, ps, other

    cs.RO

    Learning Efficient Robotic Garment Manipulation with Standardization

    Authors: Changshi Zhou, Feng Luan, Jiarui Hu, Shaoqiang Meng, Zhipeng Wang, Yanchao Dong, Yanmin Zhou, Bin He

    Abstract: Garment manipulation is a significant challenge for robots due to the complex dynamics and potential self-occlusion of garments. Most existing methods of efficient garment unfolding overlook the crucial role of standardization of flattened garments, which could significantly simplify downstream tasks like folding, ironing, and packing. This paper presents APS-Net, a novel approach to garment manip… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  17. arXiv:2506.22521  [pdf, ps, other

    cs.CR cs.AI cs.LG

    A Survey on Model Extraction Attacks and Defenses for Large Language Models

    Authors: Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, Yushun Dong

    Abstract: Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies inclu… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  18. arXiv:2506.21356  [pdf, ps, other

    cs.CV

    ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

    Authors: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu

    Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits bo… ▽ More

    Submitted 27 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

  19. arXiv:2506.21018  [pdf, ps, other

    cs.CV

    LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

    Authors: Lei Hao, Lina Xu, Chang Liu, Yanni Dong

    Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  20. arXiv:2506.20241  [pdf, ps, other

    cs.CL cs.AI

    Enhancing Large Language Models through Structured Reasoning

    Authors: Yubo Dong, Hehe Fan

    Abstract: Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical deduction and systematic planning, primarily due to their reliance on implicit statistical relationships without structured knowledge representation.Inspired by cogniti… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Preprint. Under review

  21. arXiv:2506.18931  [pdf, ps, other

    cs.LG cs.AI

    Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs

    Authors: Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn

    Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this i… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 13 pages, 3 figures

  22. arXiv:2506.17964  [pdf, ps, other

    cs.CE

    Learning from the Storm: A Multivariate Machine Learning Approach to Predicting Hurricane-Induced Economic Losses

    Authors: Bolin Shen, Eren Erman Ozguven, Yue Zhao, Guang Wang, Yiqun Xie, Yushun Dong

    Abstract: Florida is particularly vulnerable to hurricanes, which frequently cause substantial economic losses. While prior studies have explored specific contributors to hurricane-induced damage, few have developed a unified framework capable of integrating a broader range of influencing factors to comprehensively assess the sources of economic loss. In this study, we propose a comprehensive modeling frame… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  23. arXiv:2506.17709  [pdf, ps, other

    cs.LG cs.CR stat.ML

    CEGA: A Cost-Effective Approach for Graph-Based Model Extraction and Acquisition

    Authors: Zebin Wang, Menghan Lin, Bolin Shen, Ken Anderson, Molei Liu, Tianxi Cai, Yushun Dong

    Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable utility across diverse applications, and their growing complexity has made Machine Learning as a Service (MLaaS) a viable platform for scalable deployment. However, this accessibility also exposes GNN to serious security threats, most notably model extraction attacks (MEAs), in which adversaries strategically query a deployed model to const… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Report number: Accepted as a conference paper at ICML 2025

  24. arXiv:2506.17609  [pdf, ps, other

    cs.CL cs.LG

    TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting

    Authors: Lincan Li, Eren Erman Ozguven, Yue Zhao, Guang Wang, Yiqun Xie, Yushun Dong

    Abstract: Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such… ▽ More

    Submitted 29 June, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

    Comments: Short research paper

  25. arXiv:2506.17125  [pdf, ps, other

    cs.SE

    Large Language Model Unlearning for Source Code

    Authors: Xue Jiang, Yihong Dong, Zheng Fang, Yingwei Ma, Tangxinyu Wang, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Yongbin Li, Ge Li

    Abstract: LLM4SE has demonstrated significant success, but LLMs' potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM un… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  26. arXiv:2506.17121  [pdf, ps, other

    cs.CL

    Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

    Authors: Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen

    Abstract: Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: We release our code publicly at https://github.com/princeton-pli/PruLong

  27. arXiv:2506.17074  [pdf, ps, other

    cs.CV

    Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion

    Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan

    Abstract: We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresse… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Technical Report. Project page: https://assembler3d.github.io

  28. arXiv:2506.15307  [pdf, ps, other

    cs.LG

    SecFwT: Efficient Privacy-Preserving Fine-Tuning of Large Language Models Using Forward-Only Passes

    Authors: Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Xun Zhou, Hui Wang, Yue Yu, Zenglin Xu

    Abstract: Large language models (LLMs) have transformed numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains, such as healthcare and finance, is constrained by the scarcity of accessible training data due to stringent privacy requirements. Secure multi-party computation (MPC)-based privacy-preserving machine learning offers a powerful approach to protect both model paramet… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  29. arXiv:2506.15065  [pdf, ps, other

    cs.LG cs.RO

    HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

    Authors: Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song

    Abstract: Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  30. arXiv:2506.14142  [pdf, ps, other

    cs.CV cs.CL

    RadFabric: Agentic AI System with Reasoning Capability for Radiology

    Authors: Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou, Fang Zeng, Yijun Luo, Tianyu Lin, Yihang Su, Yichen Wu, Kai Zhang, Zhen Xiang, Tianming Liu, Ninghao Liu, Lichao Sun, Yixuan Yuan, Xiang Li

    Abstract: Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadF… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 4 figures, 2 tables

  31. arXiv:2506.13766  [pdf, ps, other

    cs.CV

    PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images

    Authors: Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong

    Abstract: Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow it… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Report number: 16

  32. arXiv:2506.13705  [pdf, ps, other

    cs.LG cs.AI

    TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning

    Authors: Junru Zhang, Lang Feng, Xu Guo, Yuhan Wu, Yabo Dong, Duanqing Xu

    Abstract: Time-series reasoning remains a significant challenge in multimodal large language models (MLLMs) due to the dynamic temporal patterns, ambiguous semantics, and lack of temporal priors. In this work, we introduce TimeMaster, a reinforcement learning (RL)-based method that enables time-series MLLMs to perform structured, interpretable reasoning directly over visualized time-series inputs and task p… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Preprint

  33. arXiv:2506.13654  [pdf, ps, other

    cs.CV cs.AI

    Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

    Authors: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu

    Abstract: We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Project page: https://egolife-ai.github.io/Ego-R1/

  34. arXiv:2506.12336  [pdf, ps, other

    cs.CV

    Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

    Authors: Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong, Wenbo Hu

    Abstract: Recent advancements in multimodal large language models for video understanding (videoLLMs) have improved their ability to process dynamic multimodal data. However, trustworthiness challenges factual inaccuracies, harmful content, biases, hallucinations, and privacy risks, undermine reliability due to video data's spatiotemporal complexities. This study introduces Trust-videoLLMs, a comprehensive… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  35. arXiv:2506.12012  [pdf, ps, other

    cs.AI

    Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making

    Authors: Xiaopeng Yuan, Xingjian Zhang, Ke Xu, Yifan Xu, Lijun Yu, Jindong Wang, Yushun Dong, Haohan Wang

    Abstract: Large language models (LLMs) are increasingly used for tasks that require complex reasoning. Most benchmarks focus on final outcomes but overlook the intermediate reasoning steps - such as planning, revision, and decision making under resource constraints. We argue that measuring these internal processes is essential for understanding model behavior and improving reliability. We propose using stra… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 19 pages, 7 figures. Under review

  36. arXiv:2506.11902  [pdf, ps, other

    cs.LG cs.CL

    TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

    Authors: Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong

    Abstract: Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025 main conference

  37. arXiv:2506.11080  [pdf, ps, other

    cs.CL

    MANBench: Is Your Multimodal Model Smarter than Human?

    Authors: Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang

    Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emph… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Multimodal Benchmark, Project Url: https://github.com/micdz/MANBench, ACL2025 Findings

  38. arXiv:2506.10574  [pdf, ps, other

    cs.CV cs.MM cs.SD eess.AS

    DanceChat: Large Language Model-Guided Music-to-Dance Generation

    Authors: Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan

    Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dan… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: check demos at https://dancechat.github.io/anon/

  39. arXiv:2506.09048  [pdf, other

    cs.LG

    Understanding Task Vectors in In-Context Learning: Emergence, Functionality, and Limitations

    Authors: Yuxin Dong, Jiachen Jiang, Zhihui Zhu, Xia Ning

    Abstract: Task vectors offer a compelling mechanism for accelerating inference in in-context learning (ICL) by distilling task-specific information into a single, reusable representation. Despite their empirical success, the underlying principles governing their emergence and functionality remain unclear. This work proposes the Linear Combination Conjecture, positing that task vectors act as single in-conte… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  40. arXiv:2506.07636  [pdf, ps, other

    cs.AI

    SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

    Authors: Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, Yuxiao Dong

    Abstract: Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality t… ▽ More

    Submitted 22 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted to Findings of ACL'25

  41. arXiv:2506.06199  [pdf, ps, other

    cs.RO cs.CV

    3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

    Authors: Hongyan Zhi, Peihao Chen, Siyuan Zhou, Yubo Dong, Quanxi Wu, Lei Han, Mingkui Tan

    Abstract: Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  42. arXiv:2506.05438  [pdf, ps, other

    cs.LG cs.AI

    An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

    Authors: Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

    Abstract: Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation a… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  43. arXiv:2506.04982  [pdf, ps, other

    cs.RO

    GEX: Democratizing Dexterity with Fully-Actuated Dexterous Hand and Exoskeleton Glove

    Authors: Yunlong Dong, Xing Liu, Jun Wan, Zelin Deng

    Abstract: This paper introduces GEX, an innovative low-cost dexterous manipulation system that combines the GX11 tri-finger anthropomorphic hand (11 DoF) with the EX12 tri-finger exoskeleton glove (12 DoF), forming a closed-loop teleoperation framework through kinematic retargeting for high-fidelity control. Both components employ modular 3D-printed finger designs, achieving ultra-low manufacturing costs wh… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  44. arXiv:2506.03362  [pdf, ps, other

    cs.RO

    Robustness-Aware Tool Selection and Manipulation Planning with Learned Energy-Informed Guidance

    Authors: Yifei Dong, Yan Zhang, Sylvain Calinon, Florian T. Pokorny

    Abstract: Humans subconsciously choose robust ways of selecting and using tools, based on years of embodied experience -- for example, choosing a ladle instead of a flat spatula to serve meatballs. However, robustness under uncertainty remains underexplored in robotic tool-use planning. This paper presents a robustness-aware framework that jointly selects tools and plans contact-rich manipulation trajectori… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  45. arXiv:2506.03214  [pdf, ps, other

    q-bio.NC cs.AI cs.CL

    A Pre-trained Framework for Multilingual Brain Decoding Using Non-invasive Recordings

    Authors: Yi Guo, Yihang Dong, Michael Kwok-Po Ng, Shuqiang Wang

    Abstract: Brain-computer interfaces (BCIs) with speech decoding from brain recordings have broad application potential in fields such as clinical rehabilitation and cognitive neuroscience. However, current decoding methods remain limited to single-language, single-subject, and single neuroimaging modality settings, restricting their clinical applicability and generalizability. Here we propose a joint multil… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  46. arXiv:2506.02658  [pdf, ps, other

    cs.SE

    Computational Thinking Reasoning in Large Language Models

    Authors: Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Jingjing Xu, Hao Zhu, Lecheng Wang, Jia Li, Yihong Dong, Jing Mai, Bin Gu, Zhi Jin

    Abstract: While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they often struggle with complex tasks that require specific thinking paradigms, such as divide-and-conquer and procedural deduction, \etc Previous researches integrate external, reliable tools to alleviate logical inconsistencies and hallucinations in LLMs' problem-solving processes. However, we argue that the… ▽ More

    Submitted 3 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  47. arXiv:2506.02362  [pdf, ps, other

    cs.CR cs.AI

    MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models

    Authors: Xueqi Cheng, Minxing Zheng, Shixiang Zhu, Yushun Dong

    Abstract: Model extraction attacks aim to replicate the functionality of a black-box model through query access, threatening the intellectual property (IP) of machine-learning-as-a-service (MLaaS) providers. Defending against such attacks is challenging, as it must balance efficiency, robustness, and utility preservation in the real-world scenario. Despite the recent advances, most existing defenses presume… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  48. arXiv:2506.01616  [pdf, ps, other

    cs.AI

    MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

    Authors: Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, Jun Zhu

    Abstract: The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' lim… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  49. arXiv:2506.01495  [pdf, ps, other

    cs.CL

    CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

    Authors: Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng

    Abstract: Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly a… ▽ More

    Submitted 26 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

  50. arXiv:2506.01307  [pdf, ps, other

    cs.CR cs.AI

    Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

    Authors: Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, Richang Hong

    Abstract: Large Language Models (LLMs) have evolved into Multimodal Large Language Models (MLLMs), significantly enhancing their capabilities by integrating visual information and other types, thus aligning more closely with the nature of human intelligence, which processes a variety of data forms beyond just text. Despite advancements, the undesirable generation of these models remains a critical concern,… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.