-
Identification of gapless phases by squaring a twist operator
Authors:
Hang Su,
Yuan Yao,
Akira Furusaki
Abstract:
We propose a general necessary condition for a spin chain with SO(3) spin-rotation symmetry to be gapped. Specifically, we prove that the ground state(s) of an SO(3)-symmetric gapped spin chain must be spin singlet(s), and the expectation value of the square of a twist operator asymptotically approaches unity in the thermodynamic limit, where finite-size corrections are inversely proportional to t…
▽ More
We propose a general necessary condition for a spin chain with SO(3) spin-rotation symmetry to be gapped. Specifically, we prove that the ground state(s) of an SO(3)-symmetric gapped spin chain must be spin singlet(s), and the expectation value of the square of a twist operator asymptotically approaches unity in the thermodynamic limit, where finite-size corrections are inversely proportional to the square root of the system size. This theorem provides (i) supporting evidence for various conjectured gapped phases, and (ii) a sufficient criterion for identifying gapless spin chains. We verify our theorem by numerical simulations for a variety of spin models and show that it offers a novel efficient way to identify gapless phases in spin chains with spin-rotation symmetry.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
Authors:
Juncheng Wu,
Sheng Liu,
Haoqin Tu,
Hang Yu,
Xiaoke Huang,
James Zou,
Cihang Xie,
Yuyin Zhou
Abstract:
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decom…
▽ More
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Authors:
Zijian Wu,
Jinjie Ni,
Xiangyan Liu,
Zichen Liu,
Hang Yan,
Michael Qizhe Shieh
Abstract:
Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises…
▽ More
Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Online Competitive Information Gathering for Partially Observable Trajectory Games
Authors:
Mel Krusniak,
Hang Xu,
Parker Palermo,
Forrest Laine
Abstract:
Game-theoretic agents must make plans that optimally gather information about their opponents. These problems are modeled by partially observable stochastic games (POSGs), but planning in fully continuous POSGs is intractable without heavy offline computation or assumptions on the order of belief maintained by each player. We formulate a finite history/horizon refinement of POSGs which admits comp…
▽ More
Game-theoretic agents must make plans that optimally gather information about their opponents. These problems are modeled by partially observable stochastic games (POSGs), but planning in fully continuous POSGs is intractable without heavy offline computation or assumptions on the order of belief maintained by each player. We formulate a finite history/horizon refinement of POSGs which admits competitive information gathering behavior in trajectory space, and through a series of approximations, we present an online method for computing rational trajectory plans in these games which leverages particle-based estimations of the joint state space and performs stochastic gradient play. We also provide the necessary adjustments required to deploy this method on individual agents. The method is tested in continuous pursuit-evasion and warehouse-pickup scenarios (alongside extensions to $N > 2$ players and to more complex environments with visual and physical obstacles), demonstrating evidence of active information gathering and outperforming passive competitors.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments
Authors:
Xiao Yang,
Jiawei Chen,
Jun Luo,
Zhengwei Fang,
Yinpeng Dong,
Hang Su,
Jun Zhu
Abstract:
The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' lim…
▽ More
The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs' actionable outputs, long-horizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation
Authors:
Kaihang Pan,
Yang Wu,
Wendong Bu,
Kai Shen,
Juncheng Li,
Yingting Wang,
Yunfei Li,
Siliang Tang,
Jun Xiao,
Fei Wu,
Hang Zhao,
Yueting Zhuang
Abstract:
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize…
▽ More
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA
Authors:
Yuelyu Ji,
Hang Zhang,
Shiven Verma,
Hui Ji,
Chun Li,
Yushui Han,
Yanshan Wang
Abstract:
We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals inf…
▽ More
We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages
Authors:
Hyangsuk Min,
Yuho Lee,
Minjeong Ban,
Jiaqi Deng,
Nicole Hee-Yeon Kim,
Taewon Yun,
Hang Su,
Jason Cai,
Hwanjun Song
Abstract:
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation o…
▽ More
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL-Lab/MSumBench.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
M2WLLM: Multi-Modal Multi-Task Ultra-Short-term Wind Power Prediction Algorithm Based on Large Language Model
Authors:
Hang Fana,
Mingxuan Lib,
Zuhan Zhanga,
Long Chengc,
Yujian Ye,
Dunnan Liua
Abstract:
The integration of wind energy into power grids necessitates accurate ultra-short-term wind power forecasting to ensure grid stability and optimize resource allocation. This study introduces M2WLLM, an innovative model that leverages the capabilities of Large Language Models (LLMs) for predicting wind power output at granular time intervals. M2WLLM overcomes the limitations of traditional and deep…
▽ More
The integration of wind energy into power grids necessitates accurate ultra-short-term wind power forecasting to ensure grid stability and optimize resource allocation. This study introduces M2WLLM, an innovative model that leverages the capabilities of Large Language Models (LLMs) for predicting wind power output at granular time intervals. M2WLLM overcomes the limitations of traditional and deep learning methods by seamlessly integrating textual information and temporal numerical data, significantly improving wind power forecasting accuracy through multi-modal data. Its architecture features a Prompt Embedder and a Data Embedder, enabling an effective fusion of textual prompts and numerical inputs within the LLMs framework. The Semantic Augmenter within the Data Embedder translates temporal data into a format that the LLMs can comprehend, enabling it to extract latent features and improve prediction accuracy. The empirical evaluations conducted on wind farm data from three Chinese provinces demonstrate that M2WLLM consistently outperforms existing methods, such as GPT4TS, across various datasets and prediction horizons. The results highlight LLMs' ability to enhance accuracy and robustness in ultra-short-term forecasting and showcase their strong few-shot learning capabilities.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Characterizing the limiting critical Potts measures on locally regular-tree-like expander graphs
Authors:
Hang Du,
Yanxin Zhou
Abstract:
For any integers $d,q\ge 3$, we consider the $q$-state ferromagnetic Potts model with an external field on a sequence of expander graphs that converges to the $d$-regular tree $\mathtt{T}_d$ in the Benjamini-Schramm sense. We show that along the critical line, any subsequential local weak limit of the Potts measures is a mixture of the free and wired Potts Gibbs measures on $\mathtt{T}_d$. Further…
▽ More
For any integers $d,q\ge 3$, we consider the $q$-state ferromagnetic Potts model with an external field on a sequence of expander graphs that converges to the $d$-regular tree $\mathtt{T}_d$ in the Benjamini-Schramm sense. We show that along the critical line, any subsequential local weak limit of the Potts measures is a mixture of the free and wired Potts Gibbs measures on $\mathtt{T}_d$. Furthermore, we show the possibility of an arbitrary extent of strong phase coexistence: for any $α\in [0,1]$, there exists a sequence of locally $\mathtt{T}_d$-like expander graphs $\{G_n\}$, such that the Potts measures on $\{G_n\}$ locally weakly converges to the $(α,1-α)$-mixture of the free and wired Potts Gibbs measures. Our result extends results of \cite{HJP23} which restrict to the zero-field case and also require $q$ to be sufficiently large relative to $d$, and results of \cite{BDS23} which restrict to the even $d$ case. We also confirm the phase coexistence prediction of \cite{BDS23}, asserting that the Potts local weak limit is a genuine mixture of the free and wired states in a generic setting. We further characterize the subsequential local weak limits of random cluster measures on such graph sequences, for any cluster parameter $q>2$ (not necessarily integer).
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
Authors:
Shilin Xu,
Yanwei Li,
Rui Yang,
Tao Zhang,
Yueyi Sun,
Wei Chow,
Linfeng Li,
Hang Song,
Qi Xu,
Yunhai Tong,
Xiangtai Li,
Hao Fei
Abstract:
Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-sour…
▽ More
Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors:
Junyu Chen,
Shuwen Wei,
Joel Honkamaa,
Pekka Marttinen,
Hang Zhang,
Min Liu,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao,
Lukas Förner,
Thomas Wendler,
Bailiang Jian,
Benedikt Wiestler,
Tim Hable,
Jin Kim,
Dan Ruan,
Frederic Madesta,
Thilo Sentker,
Wiebke Heyer,
Lianrui Zuo
, et al. (11 additional authors not shown)
Abstract:
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI…
▽ More
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
Authors:
Haohan Chi,
Huan-ang Gao,
Ziming Liu,
Jianing Liu,
Chenyu Liu,
Jinwei Li,
Kaisen Yang,
Yangcheng Yu,
Zeda Wang,
Wenyi Li,
Leichen Wang,
Xingtao Hu,
Hao Sun,
Hang Zhao,
Hao Zhao
Abstract:
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets.…
▽ More
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Quantum cohomology, shift operators, and Coulomb branches
Authors:
Ki Fung Chan,
Kwokwai Chan,
Chin Hang Eddie Lam
Abstract:
Given a complex reductive group $G$ and a $G$-representation $\mathbf{N}$, there is an associated quantized Coulomb branch algebra $\mathcal{A}_{G,\mathbf{N}}^\hbar$ defined by Braverman, Finkelberg and Nakajima. In this paper, we give a new interpretation of $\mathcal{A}_{G,\mathbf{N}}^\hbar$ as the largest subalgebra of the equivariant Borel--Moore homology of the affine Grassmannian on which sh…
▽ More
Given a complex reductive group $G$ and a $G$-representation $\mathbf{N}$, there is an associated quantized Coulomb branch algebra $\mathcal{A}_{G,\mathbf{N}}^\hbar$ defined by Braverman, Finkelberg and Nakajima. In this paper, we give a new interpretation of $\mathcal{A}_{G,\mathbf{N}}^\hbar$ as the largest subalgebra of the equivariant Borel--Moore homology of the affine Grassmannian on which shift operators can naturally be defined. As a main application, we show that if $X$ is a smooth semiprojective variety equipped with a $G$-action, and $X \to \mathbf{N}$ is a $G$-equivariant proper holomorphic map, then the equivariant big quantum cohomology $QH_G^\bullet(X)$ defines a quasi-coherent sheaf of algebras on the Coulomb branch with coisotropic support. Upon specializing the Novikov and bulk parameters, this sheaf becomes coherent with Lagrangian support. We also apply our construction to recover Teleman's gluing construction for Coulomb branches and derive different generalizations of the Peterson isomorphism.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving
Authors:
Yunshen Wang,
Yicheng Liu,
Tianyuan Yuan,
Yucheng Mao,
Yingshi Liang,
Xiuyu Yang,
Honggang Zhang,
Hang Zhao
Abstract:
Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D s…
▽ More
Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Scalable Complexity Control Facilitates Reasoning Ability of LLMs
Authors:
Liangkai Hang,
Junjie Yao,
Zhiwei Bai,
Tianyi Chen,
Yang Chen,
Rongjie Diao,
Hezhou Li,
Pengxiao Lin,
Zhiwei Wang,
Cheng Xu,
Zhongwang Zhang,
Zhangchen Zhou,
Zhiyu Li,
Zehao Lin,
Kai Chen,
Feiyu Xiong,
Yaoyu Zhang,
Weinan E,
Hongkang Yang,
Zhi-Qin John Xu
Abstract:
The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over va…
▽ More
The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation
Authors:
Hang Chen,
Maoyuan Ye,
Peng Yang,
Haibin He,
Juhua Liu,
Bo Du
Abstract:
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the…
▽ More
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
The experimental determination of exchange mass terms in surface states on both terminations of MnBi4Te7
Authors:
Dezhi Song,
Fuyang Hang,
Gang Yao,
Jun Zhang,
Ye-Ping Jiang,
Jin-Feng Jia
Abstract:
The intrinsic antiferromagnetic topological insulators in the Mn-Bi-Te family, composed of superlattice-like MnBi2Te4/(Bi2Te3)n (n = 0, 1, 2, 3...) layered structure, present intriguing states of matter such as quantum anomalous Hall effect and the axion insulator. However, the surface state gap, which is the prerequisite for the observation of these states, remains elusive. Here by molecular beam…
▽ More
The intrinsic antiferromagnetic topological insulators in the Mn-Bi-Te family, composed of superlattice-like MnBi2Te4/(Bi2Te3)n (n = 0, 1, 2, 3...) layered structure, present intriguing states of matter such as quantum anomalous Hall effect and the axion insulator. However, the surface state gap, which is the prerequisite for the observation of these states, remains elusive. Here by molecular beam epitaxy, we obtain two types of MnBi4Te7 films with the exclusive Bi2Te3 (BT) or MnBi2Te4 (MBT) terminations. By scanning tunneling spectroscopy, the mass terms in the surface states are identified on both surface terminations. Experimental results reveal the existence of a hybridization gap of approximately 23 meV in surface states on the BT termination. This gap comes from the hybridization between the surface states and the spin-split states in the adjacent MBT layer. On the MBT termination, an exchange mass term of about 30 meV in surface states is identified by taking magnetic-field-dependent Landau level spectra as well as theoretical simulations. In addition, the mass term varies with the field in the film with a heavy BiMn doping level in the Mn layers. These findings demonstrate the existence of mass terms in surface states on both types of terminations in our epitaxial MnBi4Te7 films investigated by local probes.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Align-DA: Align Score-based Atmospheric Data Assimilation with Multiple Preferences
Authors:
Jing-An Sun,
Hang Fan,
Junchao Gong,
Ben Fei,
Kun Chen,
Fenghua Ling,
Wenlong Zhang,
Wanghan Xu,
Li Yan,
Pierre Gentine,
Lei Bai
Abstract:
Data assimilation (DA) aims to estimate the full state of a dynamical system by combining partial and noisy observations with a prior model forecast, commonly referred to as the background. In atmospheric applications, this problem is fundamentally ill-posed due to the sparsity of observations relative to the high-dimensional state space. Traditional methods address this challenge by simplifying b…
▽ More
Data assimilation (DA) aims to estimate the full state of a dynamical system by combining partial and noisy observations with a prior model forecast, commonly referred to as the background. In atmospheric applications, this problem is fundamentally ill-posed due to the sparsity of observations relative to the high-dimensional state space. Traditional methods address this challenge by simplifying background priors to regularize the solution, which are empirical and require continual tuning for application. Inspired by alignment techniques in text-to-image diffusion models, we propose Align-DA, which formulates DA as a generative process and uses reward signals to guide background priors, replacing manual tuning with data-driven alignment. Specifically, we train a score-based model in the latent space to approximate the background-conditioned prior, and align it using three complementary reward signals for DA: (1) assimilation accuracy, (2) forecast skill initialized from the assimilated state, and (3) physical adherence of the analysis fields. Experiments with multiple reward signals demonstrate consistent improvements in analysis quality across different evaluation metrics and observation-guidance strategies. These results show that preference alignment, implemented as a soft constraint, can automatically adapt complex background priors tailored to DA, offering a promising new direction for advancing the field.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
Authors:
Zitong Wang,
Hang Zhao,
Qianyu Zhou,
Xuequan Lu,
Xiangtai Li,
Yiren Song
Abstract:
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Im…
▽ More
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.
△ Less
Submitted 30 May, 2025; v1 submitted 24 May, 2025;
originally announced May 2025.
-
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Authors:
Dingming Li,
Hongxing Li,
Zixuan Wang,
Yuchen Yan,
Hang Zhang,
Siqi Chen,
Guiyang Hou,
Shengpei Jiang,
Wenqi Zhang,
Yongliang Shen,
Weiming Lu,
Yueting Zhuang
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric vi…
▽ More
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
Authors:
Yehui Tang,
Xiaosong Li,
Fangcheng Liu,
Wei Guo,
Hang Zhou,
Yaoyuan Wang,
Kai Han,
Xianzhi Yu,
Jinpeng Li,
Hui Zang,
Fei Mi,
Xiaojun Meng,
Zhicheng Liu,
Hanting Chen,
Binfan Zheng,
Can Chen,
Youliang Yan,
Ruiming Tang,
Peifeng Qin,
Xinghao Chen,
Dacheng Tao,
Yunhe Wang
Abstract:
The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts o…
▽ More
The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo. Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
△ Less
Submitted 28 May, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis
Authors:
Tianyi Xu,
Hongjie Chen,
Wang Qing,
Lv Hang,
Jian Kang,
Li Jie,
Zhennan Lin,
Yongxiang Li,
Xie Lei
Abstract:
Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resour…
▽ More
Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research
△ Less
Submitted 16 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
Authors:
Yaohua Zha,
Yanzi Wang,
Hang Guo,
Jinpeng Wang,
Tao Dai,
Bin Chen,
Zhihao Ouyang,
Xue Yuerong,
Ke Chen,
Shu-Tao Xia
Abstract:
Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-traine…
▽ More
Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information. Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. Code is available at https://github.com/zyh16143998882/PMA.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Automated Privacy Information Annotation in Large Language Model Interactions
Authors:
Hang Zeng,
Xiangyu Liu,
Yong Hu,
Chaoyue Niu,
Fan Wu,
Shaojie Tang,
Guihai Chen
Abstract:
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, t…
▽ More
Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, typically tagging personally identifiable information (PII) in anonymous content. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with cloud-based strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Authors:
Yunlong Tang,
Pinxin Liu,
Mingqian Feng,
Zhangyun Tan,
Rui Mao,
Chao Huang,
Jing Bi,
Yunzhong Xiao,
Susan Liang,
Hang Hua,
Ali Vosoughi,
Luchuan Song,
Zeliang Zhang,
Chenliang Xu
Abstract:
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspecti…
▽ More
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Breaking the Quadrillion Determinant Barrier in Numerically Exact Configuration Interaction
Authors:
Agam Shayit,
Can Liao,
Shiv Upadhyay,
Hang Hu,
Tianyuan Zhang,
Eugene DePrince III,
Chao Yang,
Xiaosong Li
Abstract:
The combinatorial scaling of configuration interaction (CI) has long restricted its applicability to only the simplest molecular systems. Here, we report the first numerically exact CI calculation exceeding one quadrillion ($10^{15}$) determinants, enabled by lossless categorical compression within the small-tensor-product distributed active space (STP-DAS) framework. As a demonstration, we conver…
▽ More
The combinatorial scaling of configuration interaction (CI) has long restricted its applicability to only the simplest molecular systems. Here, we report the first numerically exact CI calculation exceeding one quadrillion ($10^{15}$) determinants, enabled by lossless categorical compression within the small-tensor-product distributed active space (STP-DAS) framework. As a demonstration, we converged the relativistic full CI (FCI) ground state of a magnesium atom involving over $10^{15}$ complex-valued 2-spinor determinants in under 8.6 hours (time-to-completion) using 1500 nodes, representing the largest FCI calculation reported to date. Additionally, we achieved $\boldsymbolσ$-build times of just 5 minutes for systems with approximately 150 billion complex-valued 2-spinor determinants using only a few compute nodes. Extensive benchmarks confirm that the method retains numerical exactness with drastically reduced resource demands. Compared to previous state-of-the-art FCI calculations, this work represents a 3-orders-of-magnitude increase in CI space, a 6-orders-of-magnitude increase in FLOP count, and a 6-orders-of-magnitude improvement in computational speed. By introducing a lossless, categorically compressed representation of the CI expansion vectors and reformulating the $\boldsymbolσ$-build accordingly, we eliminate memory bottlenecks associated with storing excitation lists and CI vectors while significantly reducing computational cost. A compression-compatible preconditioner further enhances performance by generating compressed CI expansion vectors throughout Davidson iterations. This work establishes a new computational frontier for numerically exact CI methods, enabling chemically and physically accurate simulations of strongly correlated, spin-orbit coupled systems previously thought to be beyond reach.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
SuperAD: A Training-free Anomaly Classification and Segmentation Method for CVPR 2025 VAND 3.0 Workshop Challenge Track 1: Adapt & Detect
Authors:
Huaiyuan Zhang,
Hang Chen,
Yu Cheng,
Shunyi Wu,
Linghao Sun,
Linao Han,
Zeyu Shi,
Lei Qi
Abstract:
In this technical report, we present our solution to the CVPR 2025 Visual Anomaly and Novelty Detection (VAND) 3.0 Workshop Challenge Track 1: Adapt & Detect: Robust Anomaly Detection in Real-World Applications. In real-world industrial anomaly detection, it is crucial to accurately identify anomalies with physical complexity, such as transparent or reflective surfaces, occlusions, and low-contras…
▽ More
In this technical report, we present our solution to the CVPR 2025 Visual Anomaly and Novelty Detection (VAND) 3.0 Workshop Challenge Track 1: Adapt & Detect: Robust Anomaly Detection in Real-World Applications. In real-world industrial anomaly detection, it is crucial to accurately identify anomalies with physical complexity, such as transparent or reflective surfaces, occlusions, and low-contrast contaminations. The recently proposed MVTec AD 2 dataset significantly narrows the gap between publicly available benchmarks and anomalies found in real-world industrial environments. To address the challenges posed by this dataset--such as complex and varying lighting conditions and real anomalies with large scale differences--we propose a fully training-free anomaly detection and segmentation method based on feature extraction using the DINOv2 model named SuperAD. Our method carefully selects a small number of normal reference images and constructs a memory bank by leveraging the strong representational power of DINOv2. Anomalies are then segmented by performing nearest neighbor matching between test image features and the memory bank. Our method achieves competitive results on both test sets of the MVTec AD 2 dataset.
△ Less
Submitted 27 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Flow approach on Riesz type nonlocal energies
Authors:
Jiaxin He,
Qinfeng Li,
Juncheng Wei,
Hang Yang
Abstract:
Via continuous deformations based on natural flow evolutions, we prove several novel monotonicity results for Riesz-type nonlocal energies on triangles and quadrilaterals. Some of these results imply new and simpler proofs for known theorems without relying on any symmetrization arguments.
Via continuous deformations based on natural flow evolutions, we prove several novel monotonicity results for Riesz-type nonlocal energies on triangles and quadrilaterals. Some of these results imply new and simpler proofs for known theorems without relying on any symmetrization arguments.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
In-depth Investigation of Conduction Mechanism on Defect-induced Proton-conducting Electrolytes BaHfO$_3$
Authors:
Peng Feng,
Hang Ma,
Kuan Yang,
Yingjie Lv,
Ying Liang,
Tianxing Ma,
Jiajun Linghu,
Zhi-Peng Li
Abstract:
This study utilizes first-principles computational methods to comprehensively analyze the impact of A-site doping on the proton conduction properties of BaHfO$_3$. The goal is to offer theoretical support for the advancement of electrolyte materials for solid oxide fuel cells. Our research has uncovered that BaHfO$_3$ demonstrates promising potential for proton conduction, with a low proton migrat…
▽ More
This study utilizes first-principles computational methods to comprehensively analyze the impact of A-site doping on the proton conduction properties of BaHfO$_3$. The goal is to offer theoretical support for the advancement of electrolyte materials for solid oxide fuel cells. Our research has uncovered that BaHfO$_3$ demonstrates promising potential for proton conduction, with a low proton migration barrier of 0.28 eV, suggesting efficient proton conduction can be achieved at lower temperatures. Through A-site doping, particularly with low-valence state ions and the introduction of Ba vacancies, we can effectively decrease the formation energy of oxygen vacancies (Evac), leading to an increase in proton concentration. Additionally, our study reveals that the primary mechanism for proton migration in BaHfO$_3$ is the Grotthuss mechanism rather than the Vehicle mechanism. Examination of the changes in lattice parameters during proton migration indicates that while doping or vacancy control strategies do not alter the mode of H+ migration, they do influence the migration pathway and barrier. These findings provide valuable insights into optimizing the proton conduction properties of BaHfO$_3$ through A-site doping and lay a solid theoretical foundation for the development of novel, highly efficient Solid oxide fuel cell electrolyte materials.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models
Authors:
Hang Hua,
Ziyun Zeng,
Yizhi Song,
Yunlong Tang,
Liu He,
Daniel Aliaga,
Wei Xiong,
Jiebo Luo
Abstract:
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common kno…
▽ More
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design.
△ Less
Submitted 27 May, 2025; v1 submitted 25 May, 2025;
originally announced May 2025.
-
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Authors:
Kun Xiang,
Heng Li,
Terry Jingchen Zhang,
Yinya Huang,
Zirong Liu,
Peixin Qu,
Jixi He,
Jiaqi Chen,
Yu-Jie Yuan,
Jianhua Han,
Hang Xu,
Hanhui Li,
Mrinmaya Sachan,
Xiaodan Liang
Abstract:
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a…
▽ More
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
△ Less
Submitted 17 June, 2025; v1 submitted 25 May, 2025;
originally announced May 2025.
-
Designing Pin-pression Gripper and Learning its Dexterous Grasping with Online In-hand Adjustment
Authors:
Hewen Xiao,
Xiuping Liu,
Hang Zhao,
Jian Liu,
Kai Xu
Abstract:
We introduce a novel design of parallel-jaw grippers drawing inspiration from pin-pression toys. The proposed pin-pression gripper features a distinctive mechanism in which each finger integrates a 2D array of pins capable of independent extension and retraction. This unique design allows the gripper to instantaneously customize its finger's shape to conform to the object being grasped by dynamica…
▽ More
We introduce a novel design of parallel-jaw grippers drawing inspiration from pin-pression toys. The proposed pin-pression gripper features a distinctive mechanism in which each finger integrates a 2D array of pins capable of independent extension and retraction. This unique design allows the gripper to instantaneously customize its finger's shape to conform to the object being grasped by dynamically adjusting the extension/retraction of the pins. In addition, the gripper excels in in-hand re-orientation of objects for enhanced grasping stability again via dynamically adjusting the pins. To learn the dynamic grasping skills of pin-pression grippers, we devise a dedicated reinforcement learning algorithm with careful designs of state representation and reward shaping. To achieve a more efficient grasp-while-lift grasping mode, we propose a curriculum learning scheme. Extensive evaluations demonstrate that our design, together with the learned skills, leads to highly flexible and robust grasping with much stronger generality to unseen objects than alternatives. We also highlight encouraging physical results of sim-to-real transfer on a physically manufactured pin-pression gripper, demonstrating the practical significance of our novel gripper design and grasping skill. Demonstration videos for this paper are available at https://github.com/siggraph-pin-pression-gripper/pin-pression-gripper-video.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
A high-efficiency neuroevolution potential for tobermorite and calcium silicate hydrate systems with ab initio accuracy
Authors:
Xiao Xu,
Shijie Wang,
Haifeng Qin,
Zhiqiang Zhao,
Zheyong Fan,
Zhuhua Zhang,
Hang Yin
Abstract:
Tobermorite and Calcium Silicate Hydrate (C-S-H) systems are indispensable cement materials but still lack a satisfactory interatomic potential with both high accuracy and high computational efficiency for better understanding their mechanical performance. Here, we develop a Neuroevolution Machine Learning Potential (NEP) with Ziegler-Biersack-Littmark hybrid framework for tobermorite and C-S-H sy…
▽ More
Tobermorite and Calcium Silicate Hydrate (C-S-H) systems are indispensable cement materials but still lack a satisfactory interatomic potential with both high accuracy and high computational efficiency for better understanding their mechanical performance. Here, we develop a Neuroevolution Machine Learning Potential (NEP) with Ziegler-Biersack-Littmark hybrid framework for tobermorite and C-S-H systems, which conveys unprecedented efficiency in molecular dynamics simulations with substantially reduced training datasets. Our NEP model achieves prediction accuracy comparable to DFT calculations using just around 300 training structures, significantly fewer than other existing machine learning potentials trained for tobermorite. Critically, the GPU-accelerated NEP computations enable scalable simulations of large tobermorite systems, reaching several thousand atoms per GPU card with high efficiency. We demonstrate the NEP's versatility by accurately predicting mechanical properties, phonon density of states, and thermal conductivity of tobermorite. Furthermore, we extend the NEP application to large-scale simulations of amorphous C-S-H, highlighting its potential for comprehensive analysis of structural and mechanical behaviors under various realistic conditions.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
Authors:
Kai Mei,
Xi Zhu,
Hang Gao,
Shuhang Lin,
Yongfeng Zhang
Abstract:
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structur…
▽ More
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play
Authors:
Jiaxun Cui,
Chen Tang,
Jarrett Holtz,
Janice Nguyen,
Alessandro G. Allievi,
Hang Qiu,
Peter Stone
Abstract:
Past work has demonstrated that autonomous vehicles can drive more safely if they communicate with one another than if they do not. However, their communication has often not been human-understandable. Using natural language as a vehicle-to-vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers.…
▽ More
Past work has demonstrated that autonomous vehicles can drive more safely if they communicate with one another than if they do not. However, their communication has often not been human-understandable. Using natural language as a vehicle-to-vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers. In this work, we propose a suite of traffic tasks in autonomous driving where vehicles in a traffic scenario need to communicate in natural language to facilitate coordination in order to avoid an imminent collision and/or support efficient traffic flow. To this end, this paper introduces a novel method, LLM+Debrief, to learn a message generation and high-level decision-making policy for autonomous vehicles through multi-agent discussion. To evaluate LLM agents for driving, we developed a gym-like simulation environment that contains a range of driving scenarios. Our experimental results demonstrate that LLM+Debrief is more effective at generating meaningful and human-understandable natural language messages to facilitate cooperation and coordination than a zero-shot LLM agent. Our code and demo videos are available at https://talking-vehicles.github.io/.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
PhySense: Sensor Placement Optimization for Accurate Physics Sensing
Authors:
Yuezhou Ma,
Haixu Wu,
Hang Zhou,
Huikun Weng,
Jianmin Wang,
Mingsheng Long
Abstract:
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placeme…
▽ More
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered.
△ Less
Submitted 26 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Not All Tokens Are What You Need In Thinking
Authors:
Hang Yuan,
Bin Yu,
Haotian Li,
Shijun Yang,
Christina Dan Wang,
Zhou Yu,
Xueyin Xu,
Weizhen Qi,
Kai Chen
Abstract:
Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues…
▽ More
Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
d-Boolean algebras and their bitopological representation
Authors:
Hang Yang,
Dexue Zhang
Abstract:
We present a Stone duality for bitopological spaces in analogy to the duality between Stone spaces and Boolean algebras, in the same vein as the duality between d-sober bitopological spaces and spatial d-frames established by Jung and Moshier. Precisely, we introduce the notion of d-Boolean algebras and prove that the category of such algebras is dually equivalent to the category of Stone bitopolo…
▽ More
We present a Stone duality for bitopological spaces in analogy to the duality between Stone spaces and Boolean algebras, in the same vein as the duality between d-sober bitopological spaces and spatial d-frames established by Jung and Moshier. Precisely, we introduce the notion of d-Boolean algebras and prove that the category of such algebras is dually equivalent to the category of Stone bitopological spaces, which are compact and zero-dimensional bitopological spaces satisfying the T0 separation axiom.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Authors:
Huanran Chen,
Yinpeng Dong,
Zeming Wei,
Yao Huang,
Yichi Zhang,
Hang Su,
Jun Zhu
Abstract:
Recent studies have revealed that the loss landscape of large language models resembles a basin, within which the models perform nearly identically, and outside of which they lose all their capabilities. In this work, we conduct further studies on the loss landscape of large language models. We discover that pre-training creates a "basic capability" basin, and subsequent fine-tuning creates "speci…
▽ More
Recent studies have revealed that the loss landscape of large language models resembles a basin, within which the models perform nearly identically, and outside of which they lose all their capabilities. In this work, we conduct further studies on the loss landscape of large language models. We discover that pre-training creates a "basic capability" basin, and subsequent fine-tuning creates "specific capability" basins (e.g., math, safety, coding) within the basic capability basin. We further investigate two types of loss landscapes: the most-case landscape (i.e., the landscape along most directions) and the worst-case landscape (i.e., the landscape along the worst direction). We argue that as long as benign fine-tuning remains within the most-case basin, it will not compromise previous capabilities. Similarly, any fine-tuning (including the adversarial one) that stays within the worst-case basin would not compromise previous capabilities. Finally, we theoretically demonstrate that the size of the most-case basin can bound the size of the worst-case basin and the robustness with respect to input perturbations. We also show that, due to the over-parameterization property of current large language models, one can easily enlarge the basins by five times.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
Authors:
Hongyuan Tao,
Ying Zhang,
Zhenhao Tang,
Hongen Peng,
Xukun Zhu,
Bingchang Liu,
Yingguang Yang,
Ziyin Zhang,
Zhaogui Xu,
Haipeng Zhang,
Linchao Zhu,
Rui Wang,
Hang Yu,
Jianguo Li,
Peng Di
Abstract:
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLM…
▽ More
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM's attention mechanism and map node attributes to the LLM's input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.
△ Less
Submitted 19 June, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
Towards Texture- And Shape-Independent 3D Keypoint Estimation in Birds
Authors:
Valentin Schmuker,
Alex Hoi Hang Chan,
Bastian Goldluecke,
Urs Waldmann
Abstract:
In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used…
▽ More
In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used to estimate 2D keypoints. Following 3D-MuPPET, these 2D keypoints are triangulated to infer 3D poses, and identities are matched in the first frame and tracked in 2D across subsequent frames. Our proposed texture-independent approach achieves comparable accuracy to the original texture-dependent 3D-MuPPET framework. Additionally, we explore our approach's applicability to other bird species. To do that, we infer the 2D joint positions of four bird species without additional fine-tuning the model trained on pigeons and obtain preliminary promising results. Thus, we think that our approach serves as a solid foundation and inspires the development of more robust and accurate texture-independent pose estimation frameworks.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Monitoring in the Dark: Privacy-Preserving Runtime Verification of Cyber-Physical Systems
Authors:
Charles Koll,
Preston Tan Hang,
Mike Rosulek,
Houssam Abbas
Abstract:
In distributed Cyber-Physical Systems and Internet-of-Things applications, the nodes of the system send measurements to a monitor that checks whether these measurements satisfy given formal specifications. For instance in Urban Air Mobility, a local traffic authority will be monitoring drone traffic to evaluate its flow and detect emerging problematic patterns. Certain applications require both th…
▽ More
In distributed Cyber-Physical Systems and Internet-of-Things applications, the nodes of the system send measurements to a monitor that checks whether these measurements satisfy given formal specifications. For instance in Urban Air Mobility, a local traffic authority will be monitoring drone traffic to evaluate its flow and detect emerging problematic patterns. Certain applications require both the specification and the measurements to be private -- i.e. known only to their owners. Examples include traffic monitoring, testing of integrated circuit designs, and medical monitoring by wearable or implanted devices. In this paper we propose a protocol that enables privacy-preserving robustness monitoring. By following our protocol, both system (e.g. drone) and monitor (e.g. traffic authority) only learn the robustness of the measured trace w.r.t. the specification. But the system learns nothing about the formula, and the monitor learns nothing about the signal monitored. We do this using garbled circuits, for specifications in Signal Temporal Logic interpreted over timed state sequences. We analyze the runtime and memory overhead of privacy preservation, the size of the circuits, and their practicality for three different usage scenarios: design testing, offline monitoring, and online monitoring of Cyber-Physical Systems.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Challenger: Affordable Adversarial Driving Video Generation
Authors:
Zhiyuan Xu,
Bohan Li,
Huan-ang Gao,
Mingju Gao,
Yong Chen,
Ming Liu,
Chenxu Yan,
Hang Zhao,
Shuo Feng,
Hao Zhao
Abstract:
Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work,…
▽ More
Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.
△ Less
Submitted 22 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation
Authors:
He Wang,
Alexander Hanbo Li,
Yiqun Hu,
Sheng Zhang,
Hideo Kobayashi,
Jiani Zhang,
Henry Zhu,
Chung-Wei Hang,
Patrick Ng
Abstract:
Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optim…
▽ More
Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning -- a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves -- to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent's learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
Authors:
Ming Gao,
Shilong Wu,
Hang Chen,
Jun Du,
Chin-Hui Lee,
Shinji Watanabe,
Jingdong Chen,
Siniscalchi Sabato Marco,
Odette Scharenborg
Abstract:
Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recogni…
▽ More
Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.
△ Less
Submitted 27 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
Depth Transfer: Learning to See Like a Simulator for Real-World Drone Navigation
Authors:
Hang Yu,
Christophe De Wagter,
Guido C. H. E de Croon
Abstract:
Sim-to-real transfer is a fundamental challenge in robot reinforcement learning. Discrepancies between simulation and reality can significantly impair policy performance, especially if it receives high-dimensional inputs such as dense depth estimates from vision. We propose a novel depth transfer method based on domain adaptation to bridge the visual gap between simulated and real-world depth data…
▽ More
Sim-to-real transfer is a fundamental challenge in robot reinforcement learning. Discrepancies between simulation and reality can significantly impair policy performance, especially if it receives high-dimensional inputs such as dense depth estimates from vision. We propose a novel depth transfer method based on domain adaptation to bridge the visual gap between simulated and real-world depth data. A Variational Autoencoder (VAE) is first trained to encode ground-truth depth images from simulation into a latent space, which serves as input to a reinforcement learning (RL) policy. During deployment, the encoder is refined to align stereo depth images with this latent space, enabling direct policy transfer without fine-tuning. We apply our method to the task of autonomous drone navigation through cluttered environments. Experiments in IsaacGym show that our method nearly doubles the obstacle avoidance success rate when switching from ground-truth to stereo depth input. Furthermore, we demonstrate successful transfer to the photo-realistic simulator AvoidBench using only IsaacGym-generated stereo data, achieving superior performance compared to state-of-the-art baselines. Real-world evaluations in both indoor and outdoor environments confirm the effectiveness of our approach, enabling robust and generalizable depth-based navigation across diverse domains.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
First Lasing and Stable Operation of a Direct-Amplification Enabled Harmonic Generation Free-Electron laser
Authors:
Zheng Qi,
Junhao Liu,
Lanpeng Ni,
Tao Liu,
Zhen Wang,
Kaiqing Zhang,
Hanxiang Yang,
Zhangfeng Gao,
Nanshun Huang,
Si Chen,
Hang Luo,
Yaozong Xiao,
Cheng Yu,
Yongmei Wen,
Fei Gao,
Yangyang Lei,
Huan Zhao,
Yanyan Zhu,
Liping Sun,
Weiyi Yin,
Xingtao Wang,
Taihe Lan,
Xiaoqing Liu,
Lie Feng,
Wenyan Zhang
, et al. (5 additional authors not shown)
Abstract:
Seeded free-electron lasers (FELs) capable of operating at repetition rates up to the MHz level are in high demand for advanced time-resolved spectroscopies, which require both full longitudinal coherence and high average photon flux in the extreme ultraviolet (EUV) and x-ray regimes. However, conventional external-seed laser systems cannot sustain MHz operation with sufficient hundreds of megawat…
▽ More
Seeded free-electron lasers (FELs) capable of operating at repetition rates up to the MHz level are in high demand for advanced time-resolved spectroscopies, which require both full longitudinal coherence and high average photon flux in the extreme ultraviolet (EUV) and x-ray regimes. However, conventional external-seed laser systems cannot sustain MHz operation with sufficient hundreds of megawatts peak power requirement due to their limited total power. Here, we report the first lasing and stable operation of a direct-amplification-enabled harmonic generation FEL driven by a weak seed laser with MW-level peak power. Beginning with an ultraviolet seed laser with only 0.75 μJ pulse energy, we demonstrate its direct amplification to over 10 μJ within an 8-meter-long modulator. We observe coherent harmonic generation up to the 12th harmonic of the seed and achieve saturation of the 7th harmonic in the radiator. These results represent a crucial milestone toward the realization of MHz-class, fully coherent EUV and x-ray light sources.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
GeoMM: On Geodesic Perspective for Multi-modal Learning
Authors:
Shibin Mei,
Hang Wang,
Bingbing Ni
Abstract:
Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesi…
▽ More
Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time, to mine correlations between samples, aiming to address the limitations of common distance metric. Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning. Specifically, we construct a graph structure to represent the adjacency relationships among samples by thresholding distances between them and then apply the shortest-path algorithm to obtain geodesic distance within this graph. To facilitate efficient computation, we further propose a hierarchical graph structure through clustering and combined with incremental update strategies for dynamic status updates. Extensive experiments across various downstream tasks validate the effectiveness of our proposed method, demonstrating its capability to capture complex relationships between samples and improve the performance of multimodal learning models.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Conditioning Matters: Training Diffusion Policies is Faster Than You Think
Authors:
Zibin Dong,
Yicheng Liu,
Yinchuan Li,
Hang Zhao,
Jianye Hao
Abstract:
Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models. Although they demonstrate strong robot control capabilities, their training efficiency remains suboptimal. In this work, we identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into mo…
▽ More
Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models. Although they demonstrate strong robot control capabilities, their training efficiency remains suboptimal. In this work, we identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution, a phenomenon we term loss collapse. To overcome this, we propose Cocos, a simple yet general solution that modifies the source distribution in the conditional flow matching to be condition-dependent. By anchoring the source distribution around semantics extracted from condition inputs, Cocos encourages stronger condition integration and prevents the loss collapse. We provide theoretical justification and extensive empirical results across simulation and real-world benchmarks. Our method achieves faster convergence and higher success rates than existing approaches, matching the performance of large-scale pre-trained VLAs using significantly fewer gradient steps and parameters. Cocos is lightweight, easy to implement, and compatible with diverse policy architectures, offering a general-purpose improvement to diffusion policy training.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.