-
A Comprehensive Survey on Underwater Acoustic Target Positioning and Tracking: Progress, Challenges, and Perspectives
Authors:
Zhong Yang,
Zhengqiu Zhu,
Yong Zhao,
Yonglin Tian,
Changjun Fan,
Runkang Guo,
Wenhao Lu,
Jingwei Ge,
Bin Chen,
Yin Zhang,
Guohua Wu,
Rui Wang,
Gyorgy Eigner,
Guangquan Cheng,
Jincai Huang,
Zhong Liu,
Jun Zhang,
Imre J. Rudas,
Fei-Yue Wang
Abstract:
Underwater target tracking technology plays a pivotal role in marine resource exploration, environmental monitoring, and national defense security. Given that acoustic waves represent an effective medium for long-distance transmission in aquatic environments, underwater acoustic target tracking has become a prominent research area of underwater communications and networking. Existing literature re…
▽ More
Underwater target tracking technology plays a pivotal role in marine resource exploration, environmental monitoring, and national defense security. Given that acoustic waves represent an effective medium for long-distance transmission in aquatic environments, underwater acoustic target tracking has become a prominent research area of underwater communications and networking. Existing literature reviews often offer a narrow perspective or inadequately address the paradigm shifts driven by emerging technologies like deep learning and reinforcement learning. To address these gaps, this work presents a systematic survey of this field and introduces an innovative multidimensional taxonomy framework based on target scale, sensor perception modes, and sensor collaboration patterns. Within this framework, we comprehensively survey the literature (more than 180 publications) over the period 2016-2025, spanning from the theoretical foundations to diverse algorithmic approaches in underwater acoustic target tracking. Particularly, we emphasize the transformative potential and recent advancements of machine learning techniques, including deep learning and reinforcement learning, in enhancing the performance and adaptability of underwater tracking systems. Finally, this survey concludes by identifying key challenges in the field and proposing future avenues based on emerging technologies such as federated learning, blockchain, embodied intelligence, and large models.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
Authors:
Tao He,
Guang Huang,
Yu Yang,
Tianshi Xu,
Sicheng Zhao,
Guiguang Ding,
Pengyang Wang,
Feng Tian
Abstract:
Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existin…
▽ More
Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization
Authors:
Chengyu Huang,
Tanya Goyal
Abstract:
Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these di…
▽ More
Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Tunable Hybrid-Mode Coupler Enabling Strong Interactions between Transmons at Centimeter-Scale Distance
Authors:
Jianwen Xu,
Xiang Deng,
Wen Zheng,
Wenchang Yan,
Tao Zhang,
Zhenchuan Zhang,
Wanli Huang,
Xiaoyu Xia,
Xudong Liao,
Yu Zhang,
Jie Zhao,
Shaoxiong Li,
Xinsheng Tan,
Dong Lan,
Yang Yu
Abstract:
The transmon, a fabrication-friendly superconducting qubit, remains a leading candidate for scalable quantum computing. Recent advances in tunable couplers have accelerated progress toward high-performance quantum processors. However, extending coherent interactions beyond millimeter scales to enhance quantum connectivity presents a critical challenge. Here, we introduce a hybrid-mode coupler expl…
▽ More
The transmon, a fabrication-friendly superconducting qubit, remains a leading candidate for scalable quantum computing. Recent advances in tunable couplers have accelerated progress toward high-performance quantum processors. However, extending coherent interactions beyond millimeter scales to enhance quantum connectivity presents a critical challenge. Here, we introduce a hybrid-mode coupler exploiting resonator-transmon hybridization to simultaneously engineer the two lowest-frequency mode, enabling high-contrast coupling between centimeter-scale transmons. For a 1-cm coupler, our framework predicts flux-tunable $XX$ and $ZZ$ coupling strengths reaching 23 MHz and 100 MHz, with modulation contrasts exceeding $10^2$ and $10^4$, respectively, demonstrating quantitative agreement with an effective two-channel model. This work provides an efficient pathway to mitigate the inherent connectivity constraints imposed by short-range interactions, enabling transmon-based architectures compatible with hardware-efficient quantum tasks.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A Chebyshev criterion for at most two non-zero limit cycles in Abel equations
Authors:
Jianfeng Huang,
Renhao Tian,
Yulin Zhao
Abstract:
This paper investigates the Abel equation $\dot{x}=A(t)x^{3}+B(t)x^{2}$ on an interval $[0,T]$. The Smale-Pugh problem asks whether the maximum number of limit cycles of the equation is bounded in terms of a given class of coefficients. We establish for the first time a Chebyshev criterion, providing a positive answer to the problem when this class spanned by an extended Chebyshev system (ET-syste…
▽ More
This paper investigates the Abel equation $\dot{x}=A(t)x^{3}+B(t)x^{2}$ on an interval $[0,T]$. The Smale-Pugh problem asks whether the maximum number of limit cycles of the equation is bounded in terms of a given class of coefficients. We establish for the first time a Chebyshev criterion, providing a positive answer to the problem when this class spanned by an extended Chebyshev system (ET-system) $\mathcal{F}=\{f_{0},f_{1},f_{2}\}$ on $[0,T)$ with $f_{0}\not=0$.
As an application, we prove that the equation has at most three limit cycles (including $x=0$) when the coefficients $A$ and $B$ are both linear trigonometric functions or quadratic polynomials. This reestablishes the result of Yu et al. (J. Differ. Equ., 2024) and improves the work of Bravo et al. (Disc. Cont. Dyn. Syst., 2015 \& J. Differ. Equ., 2024). We also obtain the same maximum number of limit cycles for the equation with trinomial coefficients.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Into the Unknown: Applying Inductive Spatial-Semantic Location Embeddings for Predicting Individuals' Mobility Beyond Visited Places
Authors:
Xinglei Wang,
Tao Cheng,
Stephen Law,
Zichao Zeng,
Ilya Ilyankou,
Junyuan Liu,
Lu Yin,
Weiming Huang,
Natchapon Jongwiriyanurak
Abstract:
Predicting individuals' next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic con…
▽ More
Predicting individuals' next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic context, and accommodate previously unseen locations. To address these challenges, we explore the application of CaLLiPer -- a multimodal representation learning framework that fuses spatial coordinates and semantic features of points of interest through contrastive learning -- for location embedding in individual mobility prediction. CaLLiPer's embeddings are spatially explicit, semantically enriched, and inductive by design, enabling robust prediction performance even in scenarios involving emerging locations. Through extensive experiments on four public mobility datasets under both conventional and inductive settings, we demonstrate that CaLLiPer consistently outperforms strong baselines, particularly excelling in inductive scenarios. Our findings highlight the potential of multimodal, inductive location embeddings to advance the capabilities of human mobility prediction systems. We also release the code and data (https://github.com/xlwang233/Into-the-Unknown) to foster reproducibility and future research.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
Authors:
Xueqing Peng,
Lingfei Qian,
Yan Wang,
Ruoyu Xiang,
Yueru He,
Yang Ren,
Mingyang Jiang,
Jeff Zhao,
Huan He,
Yi Han,
Yun Feng,
Yuechen Jiang,
Yupeng Cao,
Haohang Li,
Yangyang Yu,
Xiaoyu Wang,
Penglei Gao,
Shengyuan Lin,
Keyi Wang,
Shanshan Yang,
Yilun Zhao,
Zhiwei Liu,
Peng Lu,
Jerry Huang,
Suyuchen Wang
, et al. (19 additional authors not shown)
Abstract:
Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global finan…
▽ More
Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
△ Less
Submitted 19 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation
Authors:
Nick Yiwen Huang,
Akin Caliskan,
Berkay Kicanaoglu,
James Tompkin,
Hyeongwoo Kim
Abstract:
We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D datas…
▽ More
We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
Authors:
Qianzhong Chen,
Naixiang Gao,
Suning Huang,
JunEn Low,
Timothy Chen,
Jiankai Sun,
Mac Schwager
Abstract:
Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework…
▽ More
Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Development of non amplified Depleted MAPS sensors towards 50 ps timing resolution on charged particles
Authors:
Raimon Casanova,
Yavuz Degerli,
Yujing Gan,
Sebastian Grinstein,
Fabrice Guilloux,
Tomasz Hemperek,
G. Huang,
Jean-Pierre Meyer,
Philippe Schwemling
Abstract:
The MiniCactus sensors are demonstrator sensors designed in LFoundry LF15A 150 nm technology, intended to study the performance of non amplified High Voltage High Resistivity CMOS sensors for measurement of time of arrival of charged particles. This paper presents the context, design features and some of the first test-beam results obtained with the latest MiniCactus sensor version, MiniCactus V2.…
▽ More
The MiniCactus sensors are demonstrator sensors designed in LFoundry LF15A 150 nm technology, intended to study the performance of non amplified High Voltage High Resistivity CMOS sensors for measurement of time of arrival of charged particles. This paper presents the context, design features and some of the first test-beam results obtained with the latest MiniCactus sensor version, MiniCactus V2. With a 175 micron thick sensor biased at -350 V, we have obtained a 60 ps time resolution on Minimum Ionizing Particles detected with a 500 micron by 500 micron pixel.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
Authors:
Shiting Huang,
Zhen Fang,
Zehui Chen,
Siyu Yuan,
Junjie Ye,
Yu Zeng,
Lin Chen,
Qi Mao,
Feng Zhao
Abstract:
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as…
▽ More
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at \href{https://github.com/Shellorley0513/CriticTool}{https://github.com/Shellorley0513/CriticTool}.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
A Survey on World Models Grounded in Acoustic Physical Information
Authors:
Xiaoliang Chen,
Le Chang,
Xin Yu,
Yunhe Huang,
Xianling Tu
Abstract:
This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high-fidelity environmental perception, causal physical reasoning, and predictive simulation of dynami…
▽ More
This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high-fidelity environmental perception, causal physical reasoning, and predictive simulation of dynamic events. The survey explains how acoustic signals, as direct carriers of mechanical wave energy from physical events, encode rich, latent information about material properties, internal geometric structures, and complex interaction dynamics. Specifically, this survey establishes the theoretical foundation by explaining how fundamental physical laws govern the encoding of physical information within acoustic signals. It then reviews the core methodological pillars, including Physics-Informed Neural Networks (PINNs), generative models, and self-supervised multimodal learning frameworks. Furthermore, the survey details the significant applications of acoustic world models in robotics, autonomous driving, healthcare, and finance. Finally, it systematically outlines the important technical and ethical challenges while proposing a concrete roadmap for future research directions toward robust, causal, uncertainty-aware, and responsible acoustic intelligence. These elements collectively point to a research pathway towards embodied active acoustic intelligence, empowering AI systems to construct an internal "intuitive physics" engine through sound.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios
Authors:
Jinyang Huang,
Xiachong Feng,
Qiguang Chen,
Hanjie Zhao,
Zihui Cheng,
Jiesong Bai,
Jingxuan Zhou,
Min Li,
Libo Qin
Abstract:
Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce…
▽ More
Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection
Authors:
Zongxian Yang,
Jiayu Qian,
Zegao Peng,
Haoyu Zhang,
Zhi-An Huang
Abstract:
Large reasoning models have recently made significant strides in mathematical and code reasoning, yet their success has not transferred smoothly to the medical domain. While multiple factors contribute to this disparity, a critical issue is the inadequate focus on the quality of intermediate reflection steps, which is particularly crucial in high-stakes medical scenarios. To address this challenge…
▽ More
Large reasoning models have recently made significant strides in mathematical and code reasoning, yet their success has not transferred smoothly to the medical domain. While multiple factors contribute to this disparity, a critical issue is the inadequate focus on the quality of intermediate reflection steps, which is particularly crucial in high-stakes medical scenarios. To address this challenge, we propose Med-REFL, a \underline{\textbf{Med}}ical \underline{\textbf{R}}easoning \underline{\textbf{E}}nhancement via self-corrected \underline{\textbf{F}}ine-grained ref\underline{\textbf{L}}ection. Our method leverages a tree-of-thought approach to decompose medical questions into fine-grained reasoning paths, quantitatively evaluating each step and its subsequent reflections. These assessments enable automatic construction of direct preference optimization data, reducing reliance on expensive expert annotations while guiding models to identify and correct reasoning errors. Experimental results on the MedQA-USMLE benchmark demonstrate Med-REFL achieves consistent improvements, with average gains up to 4.11\%. Notably, it further boosts the state-of-the-art performance of 7B/8B models by an additional 4.13\%. Furthermore, Med-REFL exhibits strong generalization capabilities and robustness across several challenging medical question-answering datasets. Our work illustrates that prioritizing reflection quality leads to more accurate and trustworthy reasoning in medical AI applications. Checkpoints, code, and data can be found \href{https://github.com/TianYin123/Med-REFL}{here}.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
Authors:
Zewei Zhou,
Tianhui Cai,
Seth Z. Zhao,
Yun Zhang,
Zhiyu Huang,
Bolei Zhou,
Jiaqi Ma
Abstract:
Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and actio…
▽ More
Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models
Authors:
Edward Li,
Zichen Wang,
Jiahe Huang,
Jeong Joon Park
Abstract:
We present a unified framework for solving partial differential equations (PDEs) using video-inpainting diffusion transformer models. Unlike existing methods that devise specialized strategies for either forward or inverse problems under full or partial observation, our approach unifies these tasks under a single, flexible generative framework. Specifically, we recast PDE-solving as a generalized…
▽ More
We present a unified framework for solving partial differential equations (PDEs) using video-inpainting diffusion transformer models. Unlike existing methods that devise specialized strategies for either forward or inverse problems under full or partial observation, our approach unifies these tasks under a single, flexible generative framework. Specifically, we recast PDE-solving as a generalized inpainting problem, e.g., treating forward prediction as inferring missing spatiotemporal information of future states from initial conditions. To this end, we design a transformer-based architecture that conditions on arbitrary patterns of known data to infer missing values across time and space. Our method proposes pixel-space video diffusion models for fine-grained, high-fidelity inpainting and conditioning, while enhancing computational efficiency through hierarchical modeling. Extensive experiments show that our video inpainting-based diffusion model offers an accurate and versatile solution across a wide range of PDEs and problem setups, outperforming state-of-the-art baselines.
△ Less
Submitted 16 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
Authors:
Haoru Xue,
Xiaoyu Huang,
Dantong Niu,
Qiayuan Liao,
Thomas Kragerud,
Jan Tommy Gravdahl,
Xue Bin Peng,
Guanya Shi,
Trevor Darrell,
Koushil Screenath,
Shankar Sastry
Abstract:
Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body co…
▽ More
Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically rendered kinematic demonstrations; at the low level, a reinforcement-learned WBC policy consumes these latent verbs to generate dynamics-level commands. In our benchmark, LeVERB can zero-shot attain a 80% success rate on simple visual navigation tasks, and 58.5% success rate overall, outperforming naive hierarchical whole-body VLA implementation by 7.8 times.
△ Less
Submitted 19 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
Authors:
Wenxuan Song,
Jiayi Chen,
Pengxiang Ding,
Yuxin Huang,
Han Zhao,
Donglin Wang,
Haoang Li
Abstract:
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi de…
▽ More
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Leveraging erasure errors in logical qubits with metastable $^{171}$Yb atoms
Authors:
Bichen Zhang,
Genyue Liu,
Guillaume Bornet,
Sebastian P. Horvath,
Pai Peng,
Shuo Ma,
Shilin Huang,
Shruti Puri,
Jeff D. Thompson
Abstract:
Implementing large-scale quantum algorithms with practical advantage will require fault-tolerance achieved through quantum error correction, but the associated overhead is a significant cost. The overhead can be reduced by engineering physical qubits with fewer errors, and by shaping the residual errors to be more easily correctable. In this work, we demonstrate quantum error correcting codes and…
▽ More
Implementing large-scale quantum algorithms with practical advantage will require fault-tolerance achieved through quantum error correction, but the associated overhead is a significant cost. The overhead can be reduced by engineering physical qubits with fewer errors, and by shaping the residual errors to be more easily correctable. In this work, we demonstrate quantum error correcting codes and logical qubit circuits in a metastable ${}^{171}$Yb qubit with a noise bias towards erasure errors, that is, errors whose location can be detected separate from any syndrome information. We show that dephasing errors on the nuclear spin qubit during coherent transport can be strongly suppressed, and implement robust entangling gates that maintain a high fidelity in the presence of gate beam inhomogeneity or pointing error. We demonstrate logical qubit encoding in the $[[4,2,2]]$ code, with error correction during decoding based on mid-circuit erasure measurements despite the fact that the code is too small to correct any Pauli errors. Finally, we demonstrate logical qubit teleportation between multiple code blocks with conditionally selected ancillas based on mid-circuit erasure checks, which is a key ingredient for leakage-robust error correction with neutral atoms.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
OneRec Technical Report
Authors:
Guorui Zhou,
Jiaxin Deng,
Jinghao Zhang,
Kuo Cai,
Lejian Ren,
Qiang Luo,
Qianqian Wang,
Qigen Hu,
Rui Huang,
Shiyao Wang,
Weifeng Ding,
Wuchao Li,
Xinchen Luo,
Xingmei Wang,
Zexuan Cheng,
Zixing Zhang,
Bin Zhang,
Boxuan Wang,
Chaoyi Ma,
Chengru Song,
Chenhui Wang,
Di Wang,
Dongxue Meng,
Fan Yang,
Fangyu Zhang
, et al. (40 additional authors not shown)
Abstract:
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimizat…
▽ More
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios.
To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Do more observations bring more information in rare events?
Authors:
Danyang Huang,
Liyuan Wang,
Liping Zhu
Abstract:
It is generally believed that more observations provide more information. However, we observe that in the independence test for rare events, the power of the test is, surprisingly, determined by the number of rare events rather than the total sample size. Moreover, the correlations tend to shrink to zero even as the total sample size increases, as long as the proportion of rare events decreases. W…
▽ More
It is generally believed that more observations provide more information. However, we observe that in the independence test for rare events, the power of the test is, surprisingly, determined by the number of rare events rather than the total sample size. Moreover, the correlations tend to shrink to zero even as the total sample size increases, as long as the proportion of rare events decreases. We demonstrate this phenomenon in both fixed and high-dimensional settings. To address these issues, we first rescale the covariances to account for the presence of rare events. We then propose a boosted procedure that uses only a small subset of non-rare events, yet achieves nearly the same power as using the full set of observations. As a result, computational complexity is significantly reduced. The theoretical properties, including asymptotic distribution and local power analysis, are carefully derived for both the rescaled statistic based on the full sample and the boosted test statistic based on subsampling. Furthermore, we extend the theory to multi-class rare events. Extensive simulations and real-world data analyses confirm the effectiveness and computational efficiency of the proposed approach.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning
Authors:
Zhiqiang Li,
Haiyong Bao,
Menghong Guan,
Hao Pan,
Cheng Huang,
Hong-Ning Dai
Abstract:
Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users. Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity. However, CFL faces difficulties in training when users are unwilling to share their cluster id…
▽ More
Despite federated learning (FL)'s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users. Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity. However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns. To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBS-CFL. The proposed EBS-CFL supports effectively training CFL while maintaining users' cluster identity confidentially. Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach. The server also authenticates correct gradient encoding by clients. EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size. When m = 1, EBS-CFL's computational efficiency of client is at least O(log n) times better than comparison schemes, where n is the number of clients.In addition, we validate the scheme through extensive experiments. Finally, we theoretically prove the scheme's security.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Largest dyadic dual VC-dimension of non-piercing families
Authors:
Xinqi Huang,
Yuzhen Qi,
Mingyuan Rong,
Zixiang Xu
Abstract:
The dyadic dual VC-dimension of a set system \( \mathcal{F} \) is the largest integer \( \ell \) such that there exist \( \ell \) sets \( F_1, F_{2}, \dots, F_\ell \in \mathcal{F} \), where every pair \( \{i, j\} \in \binom{[\ell]}{2} \) is witnessed by an element \( a_{i,j} \in F_i \cap F_j \) that does not belong to any other set \( F_k \) with \( k \in [\ell] \setminus \{i, j\} \). In this pape…
▽ More
The dyadic dual VC-dimension of a set system \( \mathcal{F} \) is the largest integer \( \ell \) such that there exist \( \ell \) sets \( F_1, F_{2}, \dots, F_\ell \in \mathcal{F} \), where every pair \( \{i, j\} \in \binom{[\ell]}{2} \) is witnessed by an element \( a_{i,j} \in F_i \cap F_j \) that does not belong to any other set \( F_k \) with \( k \in [\ell] \setminus \{i, j\} \). In this paper, we determine the largest dyadic dual VC-dimension of a non-piercing family is exactly $4$, providing a rare example where the maximum of this parameter can be determined for a natural family arising from geometry. As an application, we give a short and direct proof that the transversal number \( τ(\mathcal{F}) \) of any non-piercing family is at most \(Cν(\mathcal{F})^9 \), where \( ν(\mathcal{F}) \) is the matching number and $C$ is a constant. This improves a recent result of Pálvölgyi and Zólomy.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Agent Capability Negotiation and Binding Protocol (ACNBP)
Authors:
Ken Huang,
Akram Sheriff,
Vineeth Sai Narajala,
Idan Habler
Abstract:
As multi-agent systems evolve to encompass increasingly diverse and specialized agents, the challenge of enabling effective collaboration between heterogeneous agents has become paramount, with traditional agent communication protocols often assuming homogeneous environments or predefined interaction patterns that limit their applicability in dynamic, open-world scenarios. This paper presents the…
▽ More
As multi-agent systems evolve to encompass increasingly diverse and specialized agents, the challenge of enabling effective collaboration between heterogeneous agents has become paramount, with traditional agent communication protocols often assuming homogeneous environments or predefined interaction patterns that limit their applicability in dynamic, open-world scenarios. This paper presents the Agent Capability Negotiation and Binding Protocol (ACNBP), a novel framework designed to facilitate secure, efficient, and verifiable interactions between agents in heterogeneous multi-agent systems through integration with an Agent Name Service (ANS) infrastructure that provides comprehensive discovery, negotiation, and binding mechanisms. The protocol introduces a structured 10-step process encompassing capability discovery, candidate pre-screening and selection, secure negotiation phases, and binding commitment with built-in security measures including digital signatures, capability attestation, and comprehensive threat mitigation strategies, while a key innovation of ACNBP is its protocolExtension mechanism that enables backward-compatible protocol evolution and supports diverse agent architectures while maintaining security and interoperability. We demonstrate ACNBP's effectiveness through a comprehensive security analysis using the MAESTRO threat modeling framework, practical implementation considerations, and a detailed example showcasing the protocol's application in a document translation scenario, with the protocol addressing critical challenges in agent autonomy, capability verification, secure communication, and scalable agent ecosystem management.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Authors:
MiniMax,
:,
Aili Chen,
Aonian Li,
Bangwei Gong,
Binyang Jiang,
Bo Fei,
Bo Yang,
Boji Shan,
Changqing Yu,
Chao Wang,
Cheng Zhu,
Chengjun Xiao,
Chengyu Du,
Chi Zhang,
Chu Qiao,
Chunhao Zhang,
Chunhui Du,
Congchao Guo,
Da Chen,
Deming Ding,
Dianjun Sun,
Dong Li,
Enwei Jiao,
Haigang Zhou
, et al. (103 additional authors not shown)
Abstract:
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model…
▽ More
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Micro-macro Gaussian Splatting with Enhanced Scalability for Unconstrained Scene Reconstruction
Authors:
Yihui Li,
Chengxin Lv,
Hongyu Yang,
Di Huang
Abstract:
Reconstructing 3D scenes from unconstrained image collections poses significant challenges due to variations in appearance. In this paper, we propose Scalable Micro-macro Wavelet-based Gaussian Splatting (SMW-GS), a novel method that enhances 3D reconstruction across diverse scales by decomposing scene representations into global, refined, and intrinsic components. SMW-GS incorporates the followin…
▽ More
Reconstructing 3D scenes from unconstrained image collections poses significant challenges due to variations in appearance. In this paper, we propose Scalable Micro-macro Wavelet-based Gaussian Splatting (SMW-GS), a novel method that enhances 3D reconstruction across diverse scales by decomposing scene representations into global, refined, and intrinsic components. SMW-GS incorporates the following innovations: Micro-macro Projection, which enables Gaussian points to sample multi-scale details with improved diversity; and Wavelet-based Sampling, which refines feature representations using frequency-domain information to better capture complex scene appearances. To achieve scalability, we further propose a large-scale scene promotion strategy, which optimally assigns camera views to scene partitions by maximizing their contributions to Gaussian points, achieving consistent and high-quality reconstructions even in expansive environments. Extensive experiments demonstrate that SMW-GS significantly outperforms existing methods in both reconstruction quality and scalability, particularly excelling in large-scale urban environments with challenging illumination variations. Project is available at https://github.com/Kidleyh/SMW-GS.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Lorentz violation signatures in the low-energy sector of Hořava gravity from black hole shadow observations
Authors:
Wentao Liu,
Hongxia Huang,
Di Wu,
Jieci Wang
Abstract:
In this paper, we use the Hořava gravity model and EHT observations of supermassive black holes (BHs) to investigate signatures of Lorentz violation in real astrophysical environments. The Lorentz violation in the rotating Hořava BH spacetime are confined to the strong gravitational field region, being induced by the BH's rotation. Due to the non-separability of the photon motion equations in this…
▽ More
In this paper, we use the Hořava gravity model and EHT observations of supermassive black holes (BHs) to investigate signatures of Lorentz violation in real astrophysical environments. The Lorentz violation in the rotating Hořava BH spacetime are confined to the strong gravitational field region, being induced by the BH's rotation. Due to the non-separability of the photon motion equations in this spacetime, we employed a numerical backward ray-tracing method to generate shadow images for various BH parameters. Subsequently, we extracted coordinate positions characterizing the shadow shape from high-pixel images to evaluate the parameter space of the BH. When evaluating M87*, Lorentz violation can occur with arbitrary strength. However, for Sgr A*, we can impose certain parameter constraints on Lorentz violation. These constraints depend on the BH's spin. If future observations confirm Sgr A*'s spin parameter less than 0.81 at maximum inclination, current EHT results would challenge general relativity and support Lorentz violation in low-energy regimes.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Fast Transitions of X-ray Variability in the Neutron Star Low Mass X-ray Binary Cygnus X-2
Authors:
Liang Zhang,
Mariano Méndez,
Hua Feng,
Diego Altamirano,
Zi-xu Yang,
Qing-chang Zhao,
Shuang-nan Zhang,
Lian Tao,
Yue Huang,
Xiang Ma,
Shu-mei Jia,
Ming-yu Ge,
Li-ming Song,
Jin-lu Qu,
Shu Zhang
Abstract:
We present a spectral-timing analysis of two NICER observations of the weakly magnetized neutron star low-mass X-ray binary Cygnus X-2. During these observations, we detect a rapid transition from a narrow 50-Hz horizontal-branch oscillation to a broad 5-Hz normal-branch oscillation, accompanied by an increase in source flux and a decrease in spectral hardness. Thanks to the large effective area o…
▽ More
We present a spectral-timing analysis of two NICER observations of the weakly magnetized neutron star low-mass X-ray binary Cygnus X-2. During these observations, we detect a rapid transition from a narrow 50-Hz horizontal-branch oscillation to a broad 5-Hz normal-branch oscillation, accompanied by an increase in source flux and a decrease in spectral hardness. Thanks to the large effective area of NICER, we are able to conduct a detailed comparison of the spectra associated with different types of quasi-periodic oscillations (QPOs) on short timescales. By fitting the spectra with a model that includes a disc and Comptonization components plus two emission lines, we find that the parameters of the disc component do not change significantly during the transition. However, assuming a fixed electron temperature, the optical depth of the Comptonization component decreases significantly. This drop in optical depth may be attributed to the expansion of the boundary layer or spreading layer.In addition, we find that the rms spectra for both the HBO and NBO are hard, suggesting that the boundary layer or spreading layer is driving the variability. We discuss the potential physical origin of the different types of QPOs.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving
Authors:
Heyang Huang,
Cunchen Hu,
Jiaqi Zhu,
Ziyuan Gao,
Liangliang Xu,
Yizhou Shan,
Yungang Bao,
Sun Ninghui,
Tianwei Zhang,
Sa Wang
Abstract:
The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to ineffic…
▽ More
The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to inefficient GPU utilization. In addition, DiT exhibits varying performance gains across different resolutions and degrees of parallelism, and significant optimization potential remains unexplored. To address these problems, we present DDiT, a flexible system that integrates both inter-phase and intra-phase optimizations. DDiT focuses on two key metrics: optimal degree of parallelism, which prevents excessive parallelism for specific resolutions, and starvation time, which quantifies the sacrifice of each request. To this end, DDiT introduces a decoupled control mechanism to minimize the computational inefficiency caused by imbalances in the degree of parallelism between the DiT and VAE phases. It also designs a greedy resource allocation algorithm with a novel scheduling mechanism that operates at the single-step granularity, enabling dynamic and timely resource scaling. Our evaluation on the T5 encoder, OpenSora SDDiT, and OpenSora VAE models across diverse datasets reveals that DDiT significantly outperforms state-of-the-art baselines by up to 1.44x in p99 latency and 1.43x in average latency.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field
Authors:
Chengrui Zhang,
Maizhen Ning,
Zihao Zhou,
Jie Sun,
Kaizhu Huang,
Qiufeng Wang
Abstract:
Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on computer tools (e.g., Matplotlib and GeoGebra) to manually generate precise diagrams, but it usually requires huge, complicated calculations cost. Recently, researchers start to work on learning-based methods…
▽ More
Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on computer tools (e.g., Matplotlib and GeoGebra) to manually generate precise diagrams, but it usually requires huge, complicated calculations cost. Recently, researchers start to work on learning-based methods (e.g., Stable Diffusion and GPT4) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements in the SDF, then construct a series of constraint functions to represent geometric relationships, next we optimize such constraint functions to get an optimized field of both elements and constraints, finally by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to easily represent geometric elements and those constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, our GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. Through both qualitative and quantitative analysis, we can see that synthesized diagrams are realistic and accurate, and our synthesizing process is simple and efficient. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95\% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Self-Supervised Enhancement for Depth from a Lightweight ToF Sensor with Monocular Images
Authors:
Laiyan Ding,
Hualie Jiang,
Jiwei Chen,
Rui Huang
Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed…
▽ More
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
△ Less
Submitted 17 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
PRO: Projection Domain Synthesis for CT Imaging
Authors:
Kang Chen,
Bin Huang,
Xuebin Yang,
Junyan Zhang,
Qiegen Liu
Abstract:
Synthesizing high quality CT projection data remains a significant challenge due to the limited availability of annotated data and the complex nature of CT imaging. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that ope…
▽ More
Synthesizing high quality CT projection data remains a significant challenge due to the limited availability of annotated data and the complex nature of CT imaging. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that operate in the image domain, PRO learns rich structural representations from raw projection data and leverages anatomical text prompts for controllable synthesis. This projection domain strategy enables more faithful modeling of underlying imaging physics and anatomical structures. Moreover, PRO functions as a foundation model, capable of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves performance across multiple downstream tasks, including low-dose and sparse-view reconstruction. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO.
△ Less
Submitted 18 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation
Authors:
Jiaming Chen,
Yiyu Jiang,
Aoshen Huang,
Yang Li,
Wei Pan
Abstract:
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two obje…
▽ More
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process-conditioned by task instructions-generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website https://sites.google.com/view/vlm-sfd/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast
Authors:
Beilei Cui,
Yiming Huang,
Long Bai,
Hongliang Ren
Abstract:
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downst…
▽ More
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing
Authors:
Biao Yang,
Muqi Huang,
Yuhui Zhang,
Yun Xiong,
Kun Zhou,
Xi Chen,
Shiyang Zhou,
Huishuai Bao,
Chuan Li,
Feng Shi,
Hualei Liu
Abstract:
Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step po…
▽ More
Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named AttentionDrag, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
$0νββ$ decay nuclear matrix elements under Left-Right symmetric model from the spherical quasi-particle random phase approximation method with realistic force
Authors:
Ri-Guang Huang,
You-Cai Chen,
Dong-Liang Fang
Abstract:
We perform the calculation of nuclear matrix elements for the neutrinoless double beta decays under a Left-Right symmetric model mediated by light neutrino, and we adopt the spherical quasi-particle random-phase approximation (QRPA) approach with realistic force. For eight nuclei: $^{76}$Ge, $^{82}$Se, $^{96}$Zr, $^{100}$Mo, $^{116}$Cd, $^{128}$Te, $^{130}$Te and $^{136}$Xe, related nuclear matrix…
▽ More
We perform the calculation of nuclear matrix elements for the neutrinoless double beta decays under a Left-Right symmetric model mediated by light neutrino, and we adopt the spherical quasi-particle random-phase approximation (QRPA) approach with realistic force. For eight nuclei: $^{76}$Ge, $^{82}$Se, $^{96}$Zr, $^{100}$Mo, $^{116}$Cd, $^{128}$Te, $^{130}$Te and $^{136}$Xe, related nuclear matrix elements are given. We analyze each term and the details of contributions of different parts are also given. For the $q$ term, we find that the weak-magnetism components of the nucleon current contribute equally as other components such as axial-vector. We also discuss the influence of short-range correlations on these NMEs. It is found that $R$ term are more sensitive to the short range correlation than other terms due to the large portion of the contribution from high exchange momenta.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
NeuroPhysNet: A FitzHugh-Nagumo-Based Physics-Informed Neural Network Framework for Electroencephalograph (EEG) Analysis and Motor Imagery Classification
Authors:
Zhenyu Xia,
Xinlei Huang,
Suvash C. Saha
Abstract:
Electroencephalography (EEG) is extensively employed in medical diagnostics and brain-computer interface (BCI) applications due to its non-invasive nature and high temporal resolution. However, EEG analysis faces significant challenges, including noise, nonstationarity, and inter-subject variability, which hinder its clinical utility. Traditional neural networks often lack integration with biophys…
▽ More
Electroencephalography (EEG) is extensively employed in medical diagnostics and brain-computer interface (BCI) applications due to its non-invasive nature and high temporal resolution. However, EEG analysis faces significant challenges, including noise, nonstationarity, and inter-subject variability, which hinder its clinical utility. Traditional neural networks often lack integration with biophysical knowledge, limiting their interpretability, robustness, and potential for medical translation. To address these limitations, this study introduces NeuroPhysNet, a novel Physics-Informed Neural Network (PINN) framework tailored for EEG signal analysis and motor imagery classification in medical contexts. NeuroPhysNet incorporates the FitzHugh-Nagumo model, embedding neurodynamical principles to constrain predictions and enhance model robustness. Evaluated on the BCIC-IV-2a dataset, the framework achieved superior accuracy and generalization compared to conventional methods, especially in data-limited and cross-subject scenarios, which are common in clinical settings. By effectively integrating biophysical insights with data-driven techniques, NeuroPhysNet not only advances BCI applications but also holds significant promise for enhancing the precision and reliability of clinical diagnostics, such as motor disorder assessments and neurorehabilitation planning.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Quantum Recurrent Embedding Neural Network
Authors:
Mingrui Jing,
Erdong Huang,
Xiao Shi,
Shengyu Zhang,
Xin Wang
Abstract:
Quantum neural networks have emerged as promising quantum machine learning models, leveraging the properties of quantum systems and classical optimization to solve complex problems in physics and beyond. However, previous studies have demonstrated inevitable trainability issues that severely limit their capabilities in the large-scale regime. In this work, we propose a quantum recurrent embedding…
▽ More
Quantum neural networks have emerged as promising quantum machine learning models, leveraging the properties of quantum systems and classical optimization to solve complex problems in physics and beyond. However, previous studies have demonstrated inevitable trainability issues that severely limit their capabilities in the large-scale regime. In this work, we propose a quantum recurrent embedding neural network (QRENN) inspired by fast-track information pathways in ResNet and general quantum circuit architectures in quantum information theory. By employing dynamical Lie algebras, we provide a rigorous proof of the trainability of QRENN circuits, demonstrating that this deep quantum neural network can avoid barren plateaus. Notably, the general QRENN architecture resists classical simulation as it encompasses powerful quantum circuits such as QSP, QSVT, and DQC1, which are widely believed to be classically intractable. Building on this theoretical foundation, we apply our QRENN to accurately classify quantum Hamiltonians and detect symmetry-protected topological phases, demonstrating its applicability in quantum supervised learning. Our results highlight the power of recurrent data embedding in quantum neural networks and the potential for scalable quantum supervised learning in predicting physical properties and solving complex problems.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation
Authors:
Jiamin Wang,
Yichen Yao,
Xiang Feng,
Hang Wu,
Yaming Wang,
Qingqiu Huang,
Yuexin Ma,
Xinge Zhu
Abstract:
The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present…
▽ More
The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Authors:
Alexander Novikov,
Ngân Vũ,
Marvin Eisenberger,
Emilien Dupont,
Po-Sen Huang,
Adam Zsolt Wagner,
Sergey Shirobokov,
Borislav Kozlovskii,
Francisco J. R. Ruiz,
Abbas Mehrabian,
M. Pawan Kumar,
Abigail See,
Swarat Chaudhuri,
George Holland,
Alex Davies,
Sebastian Nowozin,
Pushmeet Kohli,
Matej Balog
Abstract:
In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the…
▽ More
In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Accelerating PDE-Constrained Optimization by the Derivative of Neural Operators
Authors:
Ze Cheng,
Zhuoyu Li,
Xiaoqiang Wang,
Jianing Huang,
Zhizhou Zhang,
Zhongkai Hao,
Hang Su
Abstract:
PDE-Constrained Optimization (PDECO) problems can be accelerated significantly by employing gradient-based methods with surrogate models like neural operators compared to traditional numerical solvers. However, this approach faces two key challenges: (1) **Data inefficiency**: Lack of efficient data sampling and effective training for neural operators, particularly for optimization purpose. (2) **…
▽ More
PDE-Constrained Optimization (PDECO) problems can be accelerated significantly by employing gradient-based methods with surrogate models like neural operators compared to traditional numerical solvers. However, this approach faces two key challenges: (1) **Data inefficiency**: Lack of efficient data sampling and effective training for neural operators, particularly for optimization purpose. (2) **Instability**: High risk of optimization derailment due to inaccurate neural operator predictions and gradients. To address these challenges, we propose a novel framework: (1) **Optimization-oriented training**: we leverage data from full steps of traditional optimization algorithms and employ a specialized training method for neural operators. (2) **Enhanced derivative learning**: We introduce a *Virtual-Fourier* layer to enhance derivative learning within the neural operator, a crucial aspect for gradient-based optimization. (3) **Hybrid optimization**: We implement a hybrid approach that integrates neural operators with numerical solvers, providing robust regularization for the optimization process. Our extensive experimental results demonstrate the effectiveness of our model in accurately learning operators and their derivatives. Furthermore, our hybrid optimization approach exhibits robust convergence.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
First-passage and extreme value statistics for overdamped Brownian motion in a linear potential
Authors:
Feng Huang,
Hanshuang Chen
Abstract:
We investigate the first-passage properties and extreme-value statistics of an overdamped Brownian particle confined by an external linear potential $V(x)=μ|x-x_0|$, where $μ>0$ is the strength of the potential and $x_0>0$ is the position of the lowest point of the potential, which coincides with the starting position of the particle. The Brownian motion terminates whenever the particle passes thr…
▽ More
We investigate the first-passage properties and extreme-value statistics of an overdamped Brownian particle confined by an external linear potential $V(x)=μ|x-x_0|$, where $μ>0$ is the strength of the potential and $x_0>0$ is the position of the lowest point of the potential, which coincides with the starting position of the particle. The Brownian motion terminates whenever the particle passes through the origin at a random time $t_f$. Our study reveals that the mean first-passage time $\langle t_f \rangle$ exhibits a nonmonotonic behavior with respect to $μ$, with a unique minimum occurring at an optimal value of $μ\simeq 1.24468D/x_0$, where $D$ is the diffusion constant of the Brownian particle. Moreover, we examine the distribution $P(M|x_0)$ of the maximum displacement $M$ during the first-passage process, as well as the statistics of the time $t_m$ at which $M$ is reached. Intriguingly, there exists another optimal $μ\simeq 1.24011 D/x_0$ that minimizes the mean time $\langle t_m \rangle$. All our analytical findings are corroborated through numerical simulations.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A Memetic Walrus Algorithm with Expert-guided Strategy for Adaptive Curriculum Sequencing
Authors:
Qionghao Huang,
Lingnuo Lu,
Xuemei Wu,
Fan Jiang,
Xizhe Wang,
Xun Wang
Abstract:
Adaptive Curriculum Sequencing (ACS) is essential for personalized online learning, yet current approaches struggle to balance complex educational constraints and maintain optimization stability. This paper proposes a Memetic Walrus Optimizer (MWO) that enhances optimization performance through three key innovations: (1) an expert-guided strategy with aging mechanism that improves escape from loca…
▽ More
Adaptive Curriculum Sequencing (ACS) is essential for personalized online learning, yet current approaches struggle to balance complex educational constraints and maintain optimization stability. This paper proposes a Memetic Walrus Optimizer (MWO) that enhances optimization performance through three key innovations: (1) an expert-guided strategy with aging mechanism that improves escape from local optima; (2) an adaptive control signal framework that dynamically balances exploration and exploitation; and (3) a three-tier priority mechanism for generating educationally meaningful sequences. We formulate ACS as a multi-objective optimization problem considering concept coverage, time constraints, and learning style compatibility. Experiments on the OULAD dataset demonstrate MWO's superior performance, achieving 95.3% difficulty progression rate (compared to 87.2% in baseline methods) and significantly better convergence stability (standard deviation of 18.02 versus 28.29-696.97 in competing algorithms). Additional validation on benchmark functions confirms MWO's robust optimization capability across diverse scenarios. The results demonstrate MWO's effectiveness in generating personalized learning sequences while maintaining computational efficiency and solution quality.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
CHARM: Considering Human Attributes for Reinforcement Modeling
Authors:
Qidi Fang,
Hang Yu,
Shijie Fang,
Jindan Huang,
Qiuyu Chen,
Reuben M. Aronson,
Elaine S. Short
Abstract:
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investiga…
▽ More
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investigating how human feedback patterns are associated with human characteristics. We conducted a public space study with two long horizon tasks and 46 participants. We found that feedback patterns are not only correlated with task statistics, such as rewards, but also correlated with participants' characteristics, especially robot experience and educational background. Additionally, we demonstrated that human feedback value can be more accurately predicted with human characteristics compared to only using task statistics. All human feedback and characteristics we collected, and codes for our data collection and predicting more accurate human feedback are available at https://github.com/AABL-Lab/CHARM
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Fast Convergence for High-Order ODE Solvers in Diffusion Probabilistic Models
Authors:
Daniel Zhengyu Huang,
Jiaoyang Huang,
Zhengjiang Lin
Abstract:
Diffusion probabilistic models generate samples by learning to reverse a noise-injection process that transforms data into noise. Reformulating this reverse process as a deterministic probability flow ordinary differential equation (ODE) enables efficient sampling using high-order solvers, often requiring only $\mathcal{O}(10)$ steps. Since the score function is typically approximated by a neural…
▽ More
Diffusion probabilistic models generate samples by learning to reverse a noise-injection process that transforms data into noise. Reformulating this reverse process as a deterministic probability flow ordinary differential equation (ODE) enables efficient sampling using high-order solvers, often requiring only $\mathcal{O}(10)$ steps. Since the score function is typically approximated by a neural network, analyzing the interaction between its regularity, approximation error, and numerical integration error is key to understanding the overall sampling accuracy. In this work, we continue our analysis of the convergence properties of the deterministic sampling methods derived from probability flow ODEs [25], focusing on $p$-th order (exponential) Runge-Kutta schemes for any integer $p \geq 1$. Under the assumption that the first and second derivatives of the approximate score function are bounded, we develop $p$-th order (exponential) Runge-Kutta schemes and demonstrate that the total variation distance between the target distribution and the generated data distribution can be bounded above by \begin{align*}
O\bigl(d^{\frac{7}{4}}\varepsilon_{\text{score}}^{\frac{1}{2}} +d(dH_{\max})^p\bigr), \end{align*} where $\varepsilon^2_{\text{score}}$ denotes the $L^2$ error in the score function approximation, $d$ is the data dimension and $H_{\max}$ represents the maximum step size used in the solver. We numerically verify the regularity assumption on benchmark datasets, confirming that the first and second derivatives of the approximate score function remain bounded in practice. Our theoretical guarantees hold for general forward processes with arbitrary variance schedules.
△ Less
Submitted 18 June, 2025; v1 submitted 15 June, 2025;
originally announced June 2025.
-
Evolution of ReID: From Early Methods to LLM Integration
Authors:
Amran Bhuiyan,
Mizanur Rahman,
Md Tahmid Rahman Laskar,
Aijun An,
Jimmy Xiangji Huang
Abstract:
Person re-identification (ReID) has evolved from handcrafted feature-based methods to deep learning approaches and, more recently, to models incorporating large language models (LLMs). Early methods struggled with variations in lighting, pose, and viewpoint, but deep learning addressed these issues by learning robust visual features. Building on this, LLMs now enable ReID systems to integrate sema…
▽ More
Person re-identification (ReID) has evolved from handcrafted feature-based methods to deep learning approaches and, more recently, to models incorporating large language models (LLMs). Early methods struggled with variations in lighting, pose, and viewpoint, but deep learning addressed these issues by learning robust visual features. Building on this, LLMs now enable ReID systems to integrate semantic and contextual information through natural language. This survey traces that full evolution and offers one of the first comprehensive reviews of ReID approaches that leverage LLMs, where textual descriptions are used as privileged information to improve visual matching. A key contribution is the use of dynamic, identity-specific prompts generated by GPT-4o, which enhance the alignment between images and text in vision-language ReID systems. Experimental results show that these descriptions improve accuracy, especially in complex or ambiguous cases. To support further research, we release a large set of GPT-4o-generated descriptions for standard ReID datasets. By bridging computer vision and natural language processing, this survey offers a unified perspective on the field's development and outlines key future directions such as better prompt design, cross-modal transfer learning, and real-world adaptability.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs
Authors:
Zijian Zhang,
Xuecheng Wu,
Danlei Huang,
Siyu Yan,
Chong Peng,
Xuezhi Cao
Abstract:
Driven by the rapid progress in vision-language models (VLMs), the responsible behavior of large-scale multimodal models has become a prominent research area, particularly focusing on hallucination detection and factuality checking. In this paper, we present the solution for the two tracks of Responsible AI challenge. Inspirations from the general domain demonstrate that a smaller distilled VLM ca…
▽ More
Driven by the rapid progress in vision-language models (VLMs), the responsible behavior of large-scale multimodal models has become a prominent research area, particularly focusing on hallucination detection and factuality checking. In this paper, we present the solution for the two tracks of Responsible AI challenge. Inspirations from the general domain demonstrate that a smaller distilled VLM can often outperform a larger VLM that is directly tuned on downstream tasks, while achieving higher efficiency. We thus jointly tackle two tasks from the perspective of knowledge distillation and propose a progressive hybrid knowledge distillation framework termed HKD4VLM. Specifically, the overall framework can be decomposed into Pyramid-like Progressive Online Distillation and Ternary-Coupled Refinement Distillation, hierarchically moving from coarse-grained knowledge alignment to fine-grained refinement. Besides, we further introduce the mapping shift-enhanced inference and diverse augmentation strategies to enhance model performance and robustness. Extensive experimental results demonstrate the effectiveness of our HKD4VLM. Ablation studies provide insights into the critical design choices driving performance gains.
△ Less
Submitted 17 June, 2025; v1 submitted 15 June, 2025;
originally announced June 2025.
-
Probing Dark Matter's Gravitational Effects Locally with TianQin
Authors:
Zheng-Cheng Liang,
Fa-Peng Huang,
Xuefeng Zhang,
Yi-Ming Hu
Abstract:
In this study, we explore the potential of using TianQin missions to probe the local gravitational effects of dark matter. The TianQin project plans to launch satellites at both low and high orbits. High-precision orbit determination is expected to assist in the Earth's gravity or gravitational waves detection. By comparing the derived masses in low and high orbits, it is possible to constrain the…
▽ More
In this study, we explore the potential of using TianQin missions to probe the local gravitational effects of dark matter. The TianQin project plans to launch satellites at both low and high orbits. High-precision orbit determination is expected to assist in the Earth's gravity or gravitational waves detection. By comparing the derived masses in low and high orbits, it is possible to constrain the amount of dark matter between the two spheres, hence placing a local constraint on dark matter's gravity effect. Our results show the capability of TianQin in detecting the density of dark matter around Earth, with an ultimate sensitivity to a value of $10^{-8}\,\,{\rm kg\,\,m^{-3}}$. This detection limit surpasses the estimated bounds for the solar system and the observation results for our Galaxy by approximately 7 and 14 orders of magnitude, respectively.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
Authors:
Xinyi Zhao,
Congjing Zhang,
Pei Guo,
Wei Li,
Lin Chen,
Chaoyue Zhao,
Shuai Huang
Abstract:
Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for…
▽ More
Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in the current models' ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github.com/Xinyi-0724/SmartHome-Bench-LLM.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Improving spliced alignment by modeling splice sites with deep learning
Authors:
Siying Yang,
Neng Huang,
Heng Li
Abstract:
Motivation: Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.
Results: We implemented…
▽ More
Motivation: Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.
Results: We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7,026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology.
Availability and implementation: https://github.com/lh3/minisplice
△ Less
Submitted 15 June, 2025;
originally announced June 2025.