-
Measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $D^+\to K^+η^{\prime}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (697 additional authors not shown)
Abstract:
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The bra…
▽ More
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The branching fractions are determined to be ${\mathcal B}(D^+\to K^+ π^0) = (1.45 \pm 0.06 \pm 0.06)\times 10^{-4}$, ${\mathcal B}(D^+\to K^+ η) = (1.17 \pm 0.10 \pm 0.03)\times 10^{-4}$ and ${\mathcal B}(D^+\to K^+ η^{\prime}) = (1.88 \pm 0.15 \pm 0.06)\times 10^{-4}$, where the first uncertainties are statistical and the second systematic. These results are consistent with the world average values but with significantly improved precision.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
NTIRE 2025 Image Shadow Removal Challenge Report
Authors:
Florin-Alexandru Vasluianu,
Tim Seizinger,
Zhuyun Zhou,
Cailian Chen,
Zongwei Wu,
Radu Timofte,
Mingjia Li,
Jin Hu,
Hainuo Wang,
Hengxing Liu,
Jiarui Wang,
Qiming Hu,
Xiaojie Guo,
Xin Lu,
Jiarong Yang,
Yuanfei Bao,
Anya Hu,
Zihao Fan,
Kunyu Wang,
Jie Xiao,
Xi Wang,
Xueyang Fu,
Zheng-Jun Zha,
Yu-Fan Lin,
Chia-Ming Lee
, et al. (57 additional authors not shown)
Abstract:
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were e…
▽ More
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Model Predictive Path-Following Control for a Quadrotor
Authors:
David Leprich,
Mario Rosenfelder,
Mario Hermle,
Jingshan Chen,
Peter Eberhard
Abstract:
Automating drone-assisted processes is a complex task. Many solutions rely on trajectory generation and tracking, whereas in contrast, path-following control is a particularly promising approach, offering an intuitive and natural approach to automate tasks for drones and other vehicles. While different solutions to the path-following problem have been proposed, most of them lack the capability to…
▽ More
Automating drone-assisted processes is a complex task. Many solutions rely on trajectory generation and tracking, whereas in contrast, path-following control is a particularly promising approach, offering an intuitive and natural approach to automate tasks for drones and other vehicles. While different solutions to the path-following problem have been proposed, most of them lack the capability to explicitly handle state and input constraints, are formulated in a conservative two-stage approach, or are only applicable to linear systems. To address these challenges, the paper is built upon a Model Predictive Control-based path-following framework and extends its application to the Crazyflie quadrotor, which is investigated in hardware experiments. A cascaded control structure including an underlying attitude controller is included in the Model Predictive Path-Following Control formulation to meet the challenging real-time demands of quadrotor control. The effectiveness of the proposed method is demonstrated through real-world experiments, representing, to the best of the authors' knowledge, a novel application of this MPC-based path-following approach to the quadrotor. Additionally, as an extension to the original method, to allow for deviations of the path in cases where the precise following of the path might be overly restrictive, a corridor path-following approach is presented.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories
Authors:
Qingsong Yan,
Qiang Wang,
Kaiyong Zhao,
Jie Chen,
Bo Li,
Xiaowen Chu,
Fei Deng
Abstract:
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In th…
▽ More
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Advancing Loss Functions in Recommender Systems: A Comparative Study with a Rényi Divergence-Based Solution
Authors:
Shengjia Zhang,
Jiawei Chen,
Changdong Li,
Sheng Zhou,
Qihao Shi,
Yan Feng,
Chun Chen,
Can Wang
Abstract:
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augment…
▽ More
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations -- stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
3D Vision-tactile Reconstruction from Infrared and Visible Images for Robotic Fine-grained Tactile Perception
Authors:
Yuankai Lin,
Xiaofan Lu,
Jiahui Chen,
Hua Yang
Abstract:
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction,…
▽ More
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction, and complex spatial boundary conditions for surface structures. With an end goal of constructing a human-like fingertip, our research (i) develops GelSplitter3D by expanding imaging channels with a prism and a near-infrared (NIR) camera, (ii) proposes a photometric stereo neural network with a CAD-based normal ground truth generation method to calibrate tactile geometry, and (iii) devises a normal integration method with boundary constraints of depth prior information to correcting the cumulative error of surface integrals. We demonstrate better tactile sensing performance, a 40$\%$ improvement in normal estimation accuracy, and the benefits of sensor shapes in grasping and manipulation tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Truncated Proximal Policy Optimization
Authors:
Tiantian Fan,
Lingjun Liu,
Yu Yue,
Jiaze Chen,
Chengyi Wang,
Qiying Yu,
Chi Zhang,
Zhiqi Lin,
Ruofei Zhu,
Yufeng Yuan,
Xiaochen Zuo,
Bole Ma,
Mofan Zhang,
Gaohong Liu,
Ru Zhang,
Haotian Zhou,
Cong Xie,
Ruidong Zhu,
Zhi Zhang,
Xin Liu,
Mingxuan Wang,
Lin Yan,
Yonghui Wu
Abstract:
Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error…
▽ More
Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Interpolation-based reproducing kernel particle method
Authors:
Jennifer E. Fromm,
John A. Evans,
J. S. Chen
Abstract:
Meshfree methods, including the reproducing kernel particle method (RKPM), have been widely used within the computational mechanics community to model physical phenomena in materials undergoing large deformations or extreme topology changes. RKPM shape functions and their derivatives cannot be accurately integrated with the Gauss-quadrature methods widely employed for the finite element method (FE…
▽ More
Meshfree methods, including the reproducing kernel particle method (RKPM), have been widely used within the computational mechanics community to model physical phenomena in materials undergoing large deformations or extreme topology changes. RKPM shape functions and their derivatives cannot be accurately integrated with the Gauss-quadrature methods widely employed for the finite element method (FEM) and typically require sophisticated nodal integration techniques, preventing them from easily being implemented in existing FEM software. Interpolation-based methods have been developed to address similar problems with isogeometric and immersed boundary methods, allowing these techniques to be implemented within open-source finite element software. With interpolation-based methods, background basis functions are represented as linear combinations of Lagrange polynomial foreground basis functions defined upon a boundary-conforming foreground mesh. This work extends the applications of interpolation-based methods to implement RKPM within open-source finite element software. Interpolation-based RKPM is applied to several PDEs, and error convergence rates are equivalent to classic RKPM integrated using high-order Gauss-quadrature schemes. The interpolation-based method is able to exploit the continuity of the RKPM basis to solve higher-order PDEs, demonstrated through the biharmonic problem. The method is extended to multi-material problems through Heaviside enrichment schemes, using local foreground refinement to reduce geometric integration error and achieve high-order accuracy. The computational cost of interpolation-based RKPM is similar to the smoothed gradient nodal integration schemes, offering significant savings over Gauss-quadrature-based meshfree methods while enabling easy implementation within existing finite element software.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework
Authors:
Dahang Wan,
Rongsheng Lu,
Yang Fang,
Xianli Lang,
Shuangbao Shu,
Jingjing Chen,
Siyuan Shen,
Ting Xu,
Zecong Ye
Abstract:
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework,…
▽ More
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
△ Less
Submitted 18 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
DreamLight: Towards Harmonious and Consistent Image Relighting
Authors:
Yong Liu,
Wenpeng Xiao,
Qianqian Wang,
Junlin Chen,
Shiyin Wang,
Yitong Wang,
Xinglong Wu,
Yansong Tang
Abstract:
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on im…
▽ More
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture. While these methods have achieved decent harmonization effects, they struggle to generate realistic and natural light interaction effects between the foreground and background. To alleviate these challenges, we reorganize the input data into a unified format and leverage the semantic prior provided by the pretrained diffusion model to facilitate the generation of natural results. Moreover, we propose a Position-Guided Light Adapter (PGLA) that condenses light information from different directions in the background into designed light query embeddings, and modulates the foreground with direction-biased masked attention. In addition, we present a post-processing module named Spectral Foreground Fixer (SFF) to adaptively reorganize different frequency components of subject and relighted background, which helps enhance the consistency of foreground appearance. Extensive comparisons and user study demonstrate that our DreamLight achieves remarkable relighting performance.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Authors:
Jingqi Yang,
Zhilong Song,
Jiawei Chen,
Mingli Song,
Sheng Zhou,
linjun sun,
Xiaogang Ouyang,
Chun Chen,
Can Wang
Abstract:
The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for compre…
▽ More
The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation, explicitly incorporating seven common types of anomalies observed in everyday GUI interactions. Furthermore, we propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools and then generate corresponding step and task descriptions for these actions with the assistance of MLLMs. This paradigm significantly reduces annotation time cost by a factor of over 19 times. Finally, we assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios. We anticipate that our work will highlight the importance of robustness in GUI agents and inspires more future research in this direction. The dataset and code are available at https://github.com/chessbean1/GUI-Robust..
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Compositional Attribute Imbalance in Vision Datasets
Authors:
Jiayi Chen,
Yanbiao Ma,
Andi Zhang,
Weidong Tang,
Wei Dai,
Bowei Liu
Abstract:
Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing b…
▽ More
Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model's ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Measurements of the Diffuse Interstellar Bands at 5780, 5797, and 6614 Å in the Hot Stellar Spectra of the LAMOST LRS DR10
Authors:
Xiao-Xiao Ma,
A-Li Luo,
Jian-Jun Chen,
Jing Chen,
Jun-Chao Liang
Abstract:
Diffuse Interstellar Bands (DIBs) are crucial tracers of the interstellar medium (ISM), yet their carriers remain poorly understood. While large-scale surveys have advanced DIB studies in cool stellar spectra, measurements in hot stellar spectra are still limited. Using 287 277 high signal-to-noise (S/N $>$ 50) hot stellar spectra from the tenth data release of the Large Sky Area Multi-Object Fibe…
▽ More
Diffuse Interstellar Bands (DIBs) are crucial tracers of the interstellar medium (ISM), yet their carriers remain poorly understood. While large-scale surveys have advanced DIB studies in cool stellar spectra, measurements in hot stellar spectra are still limited. Using 287 277 high signal-to-noise (S/N $>$ 50) hot stellar spectra from the tenth data release of the Large Sky Area Multi-Object Fiber Spectroscopic Telescope low-resolution spectroscopic survey (LAMOST LRS DR10), we systematically measured the three prominent optical DIBs at 5780, 5797, and 6614 Å. We published three catalogs containing 285 103, 279 195, and 281 146 valid measurements for the DIBs at 5780, 5797, and 6614 Å, respectively. Among them, 112 479, 25 232, and 71 048 are high-quality samples after rigorous quality control. To our knowledge, these are the largest hot-star DIB datasets in the northern sky. The catalogs provide spectral metadata, added astrometeric information, DIB profiles, and quality metrics. Our methodology and open-source pipeline ensure reproducibility, while the scale and precision of the data support future statistical studies. We anticipate that these catalogs will highlight the LAMOST's role in advancing DIB research and deepening our understanding of the ISM.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models
Authors:
Xinyang Li,
Siqi Liu,
Bochao Zou,
Jiansheng Chen,
Huimin Ma
Abstract:
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based…
▽ More
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model's exhibited ToM by adjusting in the direction of the attention head.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
VideoMAR: Autoregressive Video Generatio with Continuous Tokens
Authors:
Hu Yu,
Biao Gong,
Hangjie Yuan,
DanDan Zheng,
Weilong Chai,
Jingdong Chen,
Kecheng Zheng,
Feng Zhao
Abstract:
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first id…
▽ More
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9.3\%$), training data ($0.5\%$), and GPU resources ($0.2\%$).
△ Less
Submitted 18 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks
Authors:
Ziyuan Tang,
Jie Chen
Abstract:
A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the T…
▽ More
A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the Transformer backbone. A central challenge toward this end is how a sequence model encodes graphs of varying sizes and from different domains. We propose representing a node as multiple random walks, such that the Transformer can extract node representations from sequences, which in turn form edge and graph representations. We develop a novel context prediction loss for these random walks and theoretically analyze their expressive power in distinguishing neighborhoods and graphs. We also demonstrate the pre-training of our model and its adaptation to downstream tasks, showcasing its potential as a foundation for processing and reasoning with graph-structured data.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Theoretically-Grounded and Practical Framework with Synthetic Anomalies
Authors:
Matthew Lau,
Tian-Yi Zhou,
Xiangchi Yuan,
Jizhou Chen,
Wenke Lee,
Xiaoming Huo
Abstract:
Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test tim…
▽ More
Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test time. We propose a theoretically-grounded and empirically effective framework for semi-supervised AD that combines known and synthetic anomalies during training. To analyze semi-supervised AD, we introduce the first mathematical formulation of semi-supervised AD, which generalizes unsupervised AD. Here, we show that synthetic anomalies enable (i) better anomaly modeling in low-density regions and (ii) optimal convergence guarantees for neural network classifiers -- the first theoretical result for semi-supervised AD. We empirically validate our framework on five diverse benchmarks, observing consistent performance gains. These improvements also extend beyond our theoretical framework to other classification-based AD methods, validating the generalizability of the synthetic anomaly principle in AD.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Dynamical quantum phase transition with divergent multipartite entanglement
Authors:
Jie Chen,
Ricardo Costa de Almeida,
Hendrik Weimer
Abstract:
We investigate the nonequilibrium quench dynamics of the one-dimensional transverse-field Ising model in both integrable and nonintegrable regimes. In particular, we report on a novel type of dynamical quantum phase transition (DQPT) that is characterized by a divergent multipartite entanglement at critical times in the post-quench dynamics. We quantify the multipartite entanglement of the state b…
▽ More
We investigate the nonequilibrium quench dynamics of the one-dimensional transverse-field Ising model in both integrable and nonintegrable regimes. In particular, we report on a novel type of dynamical quantum phase transition (DQPT) that is characterized by a divergent multipartite entanglement at critical times in the post-quench dynamics. We quantify the multipartite entanglement of the state by the quantum Fisher information and demonstrate that the DQPT belongs to a different universality class than the ground-state phase transition. Furthermore, we perform a spectral analysis of the DQPT and demonstrate that it is a genuine nonequilibrium transition arising from the constructive interference of excited states of the system during the many-body dynamics. Finally, we discuss potential experimental realizations in Rydberg platforms as well as applications in the context of quantum metrology.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
XGraphRAG: Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation
Authors:
Ke Wang,
Bo Pan,
Yingchaojie Feng,
Yuwei Wu,
Jieyi Chen,
Minfeng Zhu,
Wei Chen
Abstract:
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)'s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usu…
▽ More
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)'s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usually face challenges in analyzing the effectiveness of GraphRAG on their dataset due to GraphRAG's complex information processing pipeline and the overwhelming amount of LLM invocations involved during graph construction and query, which limits GraphRAG interpretability and accessibility. This research proposes a visual analysis framework that helps RAG developers identify critical recalls of GraphRAG and trace these recalls through the GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype system incorporating a set of interactive visualizations to facilitate users' analysis process, boosting failure cases collection and improvement opportunities identification. Our evaluation demonstrates the effectiveness and usability of our approach. Our work is open-sourced and available at https://github.com/Gk0Wk/XGraphRAG.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
Authors:
Wenxuan Song,
Jiayi Chen,
Pengxiang Ding,
Yuxin Huang,
Han Zhao,
Donglin Wang,
Haoang Li
Abstract:
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi de…
▽ More
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
OneRec Technical Report
Authors:
Guorui Zhou,
Jiaxin Deng,
Jinghao Zhang,
Kuo Cai,
Lejian Ren,
Qiang Luo,
Qianqian Wang,
Qigen Hu,
Rui Huang,
Shiyao Wang,
Weifeng Ding,
Wuchao Li,
Xinchen Luo,
Xingmei Wang,
Zexuan Cheng,
Zixing Zhang,
Bin Zhang,
Boxuan Wang,
Chaoyi Ma,
Chengru Song,
Chenhui Wang,
Di Wang,
Dongxue Meng,
Fan Yang,
Fangyu Zhang
, et al. (40 additional authors not shown)
Abstract:
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimizat…
▽ More
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios.
To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Seismic Acoustic Impedance Inversion Framework Based on Conditional Latent Generative Diffusion Model
Authors:
Jie Chen,
Hongling Chen,
Jinghuai Gao,
Chuangji Meng,
Tao Yang,
XinXin Liang
Abstract:
Seismic acoustic impedance plays a crucial role in lithological identification and subsurface structure interpretation. However, due to the inherently ill-posed nature of the inversion problem, directly estimating impedance from post-stack seismic data remains highly challenging. Recently, diffusion models have shown great potential in addressing such inverse problems due to their strong prior lea…
▽ More
Seismic acoustic impedance plays a crucial role in lithological identification and subsurface structure interpretation. However, due to the inherently ill-posed nature of the inversion problem, directly estimating impedance from post-stack seismic data remains highly challenging. Recently, diffusion models have shown great potential in addressing such inverse problems due to their strong prior learning and generative capabilities. Nevertheless, most existing methods operate in the pixel domain and require multiple iterations, limiting their applicability to field data. To alleviate these limitations, we propose a novel seismic acoustic impedance inversion framework based on a conditional latent generative diffusion model, where the inversion process is made in latent space. To avoid introducing additional training overhead when embedding conditional inputs, we design a lightweight wavelet-based module into the framework to project seismic data and reuse an encoder trained on impedance to embed low-frequency impedance into the latent space. Furthermore, we propose a model-driven sampling strategy during the inversion process of this framework to enhance accuracy and reduce the number of required diffusion steps. Numerical experiments on a synthetic model demonstrate that the proposed method achieves high inversion accuracy and strong generalization capability within only a few diffusion steps. Moreover, application to field data reveals enhanced geological detail and higher consistency with well-log measurements, validating the effectiveness and practicality of the proposed approach.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Self-Supervised Enhancement for Depth from a Lightweight ToF Sensor with Monocular Images
Authors:
Laiyan Ding,
Hualie Jiang,
Jiwei Chen,
Rui Huang
Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed…
▽ More
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
△ Less
Submitted 17 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation
Authors:
Jiaming Chen,
Yiyu Jiang,
Aoshen Huang,
Yang Li,
Wei Pan
Abstract:
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two obje…
▽ More
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process-conditioned by task instructions-generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website https://sites.google.com/view/vlm-sfd/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
ACM tilting bundles on a Geigle-Lenzing projective plane of type $(2,2,2,p)$
Authors:
Jianmin Chen,
Shiquan Ruan,
Weikang Weng
Abstract:
Let $\mathbb{X}$ be a Geigle-Lenzing projective plane of type $(2,2,2,p)$ and $\mathsf{coh} \mathbb{X}$ the category of coherent sheaves on $\mathbb{X}$. This paper is devoted to study ACM tilting bundles over $\mathbb{X}$, that is, tilting objects in the derived category $\mathsf{D}^{\rm b}(\mathsf{coh} \, \mathbb{X})$ that are also ACM bundles. We show that a tilting bundle consisting of line bu…
▽ More
Let $\mathbb{X}$ be a Geigle-Lenzing projective plane of type $(2,2,2,p)$ and $\mathsf{coh} \mathbb{X}$ the category of coherent sheaves on $\mathbb{X}$. This paper is devoted to study ACM tilting bundles over $\mathbb{X}$, that is, tilting objects in the derived category $\mathsf{D}^{\rm b}(\mathsf{coh} \, \mathbb{X})$ that are also ACM bundles. We show that a tilting bundle consisting of line bundles is the $2$-canonical tilting bundle up to degree shift. We also provide a program to construct ACM tilting bundles, which give a rich source of (almost) $2$-representation infinite algebras. As an application, we give a classification result of ACM tilting bundles.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Measurement of the $Ω_c^0$ and $Ξ_c^0$ baryon lifetimes using hadronic $b$-baryon decays
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis
, et al. (1141 additional authors not shown)
Abstract:
The lifetimes of the $Ω_c^0$ and $Ξ_c^0$ baryons are measured using a $pp$ collision dataset collected by the LHCb experiment, corresponding to an integrated luminosity of $9~\rm{fb^{-1}}$. The charm baryons are produced in the fully reconstructed decay chains $Ω_b^- \rightarrow Ω_c^0 (\rightarrow pK^-K^-π^+)~π^-$ and $Ξ_b^- \rightarrow Ξ_c^0 (\rightarrow pK^-K^-π^+)~π^-$. The measurement uses top…
▽ More
The lifetimes of the $Ω_c^0$ and $Ξ_c^0$ baryons are measured using a $pp$ collision dataset collected by the LHCb experiment, corresponding to an integrated luminosity of $9~\rm{fb^{-1}}$. The charm baryons are produced in the fully reconstructed decay chains $Ω_b^- \rightarrow Ω_c^0 (\rightarrow pK^-K^-π^+)~π^-$ and $Ξ_b^- \rightarrow Ξ_c^0 (\rightarrow pK^-K^-π^+)~π^-$. The measurement uses topologically and kinematically similar $B^- \rightarrow D^0(\rightarrow K^-K^+π^-π^+)~π^-$ decays for normalisation. The measured lifetimes are
$τ_{Ω_c^0} = 276.3 \pm 19.4~\rm{(stat)} \pm 1.8~\rm{(syst)} \pm 0.7~(τ_{D^0})~\rm{fs}$,
$τ_{Ξ_c^0} = 149.2 \pm ~\,2.5~\rm{(stat)} \pm 0.9~\rm{(syst)} \pm 0.4~(τ_{D^0})~\rm{fs}$,
where the first uncertainty is statistical, the second systematic and the third due to the uncertainty of the $D^0$ lifetime. These results are consistent with previous measurements performed by the LHCb experiment.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs
Authors:
Guoxi Zhang,
Jiawei Chen,
Tianzhuo Yang,
Jiaming Ji,
Yaodong Yang,
Juntao Dai
Abstract:
The increasing prevalence of large language models (LLMs) is influencing global value systems. However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values. This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the d…
▽ More
The increasing prevalence of large language models (LLMs) is influencing global value systems. However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values. This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the development of equitable and inclusive AI systems. In this work, we introduce a systematic framework designed to boost fair and robust cross-cultural consensus among LLMs. We model consensus as a Nash Equilibrium and employ a game-theoretic negotiation method based on Policy-Space Response Oracles (PSRO) to simulate an organized cross-cultural negotiation process. To evaluate this approach, we construct regional cultural agents using data transformed from the World Values Survey (WVS). Beyond the conventional model-level evaluation method, We further propose two quantitative metrics, Perplexity-based Acceptence and Values Self-Consistency, to assess consensus outcomes. Experimental results indicate that our approach generates consensus of higher quality while ensuring more balanced compromise compared to baselines. Overall, it mitigates WEIRD bias by guiding agents toward convergence through fair and gradual negotiation steps.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
DVP-MVS++: Synergize Depth-Normal-Edge and Harmonized Visibility Prior for Multi-View Stereo
Authors:
Zhenlong Yuan,
Dapeng Zhang,
Zehao Li,
Chengxuan Qian,
Jianing Chen,
Yinda Chen,
Kehua Chen,
Tianlu Mao,
Zhaoxin Li,
Hao Jiang,
Zhaoqi Wang
Abstract:
Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused b…
▽ More
Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused by edge-skipping and visibility occlusions, which may cause potential estimation deviations. To address these issues, we propose DVP-MVS++, an innovative approach that synergizes both depth-normal-edge aligned and harmonized cross-view priors for robust and visibility-aware patch deformation. Specifically, to avoid edge-skipping, we first apply DepthPro, Metric3Dv2 and Roberts operator to generate coarse depth maps, normal maps and edge maps, respectively. These maps are then aligned via an erosion-dilation strategy to produce fine-grained homogeneous boundaries for facilitating robust patch deformation. Moreover, we reformulate view selection weights as visibility maps, and then implement both an enhanced cross-view depth reprojection and an area-maximization strategy to help reliably restore visible areas and effectively balance deformed patch, thus acquiring harmonized cross-view priors for visibility-aware patch deformation. Additionally, we obtain geometry consistency by adopting both aggregated normals via view selection and projection depth differences via epipolar lines, and then employ SHIQ for highlight correction to enable geometry consistency with highlight-aware perception, thus improving reconstruction quality during propagation and refinement stage. Evaluation results on ETH3D, Tanks & Temples and Strecha datasets exhibit the state-of-the-art performance and robust generalization capability of our proposed method.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Reconfigurable Digital RRAM Logic Enables In-Situ Pruning and Learning for Edge AI
Authors:
Songqi Wang,
Yue Zhang,
Jia Chen,
Xinyuan Zhang,
Yi Li,
Ning Lin,
Yangu He,
Jichang Yang,
Yingjie Yu,
Yi Li,
Zhongrui Wang,
Xiaojuan Qi,
Han Wang
Abstract:
The human brain simultaneously optimizes synaptic weights and topology by growing, pruning, and strengthening synapses while performing all computation entirely in memory. In contrast, modern artificial-intelligence systems separate weight optimization from topology optimization and depend on energy-intensive von Neumann architectures. Here, we present a software-hardware co-design that bridges th…
▽ More
The human brain simultaneously optimizes synaptic weights and topology by growing, pruning, and strengthening synapses while performing all computation entirely in memory. In contrast, modern artificial-intelligence systems separate weight optimization from topology optimization and depend on energy-intensive von Neumann architectures. Here, we present a software-hardware co-design that bridges this gap. On the algorithmic side, we introduce a real-time dynamic weight-pruning strategy that monitors weight similarity during training and removes redundancies on the fly, reducing operations by 26.80% on MNIST and 59.94% on ModelNet10 without sacrificing accuracy (91.44% and 77.75%, respectively). On the hardware side, we fabricate a reconfigurable, fully digital compute-in-memory (CIM) chip based on 180 nm one-transistor-one-resistor (1T1R) RRAM arrays. Each array embeds flexible Boolean logic (NAND, AND, XOR, OR), enabling both convolution and similarity evaluation inside memory and eliminating all ADC/DAC overhead. The digital design achieves zero bit-error, reduces silicon area by 72.30% and overall energy by 57.26% compared to analogue RRAM CIM, and lowers energy by 75.61% and 86.53% on MNIST and ModelNet10, respectively, relative to an NVIDIA RTX 4090. Together, our co-design establishes a scalable brain-inspired paradigm for adaptive, energy-efficient edge intelligence in the future.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Implementing van der Waals forces for polytope particles in DEM simulations of clay
Authors:
Dominik Krengel,
Jian Chen,
Zhipeng Yu,
Hans-Georg Matuttis,
Takashi Matsushima
Abstract:
Clay minerals are non-spherical nano-scale particles that usually form flocculated, house-of-card like structures under the influence of inter-molecular forces. Numerical modeling of clays is still in its infancy as the required inter-particle forces are available only for spherical particles. A polytope approach would allow shape-accurate forces and torques while simultaneously being more perform…
▽ More
Clay minerals are non-spherical nano-scale particles that usually form flocculated, house-of-card like structures under the influence of inter-molecular forces. Numerical modeling of clays is still in its infancy as the required inter-particle forces are available only for spherical particles. A polytope approach would allow shape-accurate forces and torques while simultaneously being more performant. The Anandarajah solution provides an analytical formulation for van der Waals forces for cuboid particles but in its original form is not suitable for implementation in DEM simulations. In this work, we discuss the necessary changes for a functional implementation of the Anandarajah solution in a DEM simulation of rectangular particles and their extension to cuboid particles.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
AFBS:Buffer Gradient Selection in Semi-asynchronous Federated Learning
Authors:
Chaoyi Lu,
Yiding Sun,
Jinqian Chen,
Zhichuan Yang,
Jiangming Pan,
Jihua Zhu
Abstract:
Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as bl…
▽ More
Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as blindly aggregating all gradients can harm training. To address this, we propose AFBS (Asynchronous FL Buffer Selection), the first algorithm to perform gradient selection within buffers while ensuring privacy protection. Specifically, the client sends the random projection encrypted label distribution matrix before training, and the server performs client clustering based on it. During training, server scores and selects gradients within each cluster based on their informational value, discarding low-value gradients to enhance semi-asynchronous federated learning. Extensive experiments in highly heterogeneous system and data environments demonstrate AFBS's superior performance compared to state-of-the-art methods. Notably, on the most challenging task, CIFAR-100, AFBS improves accuracy by up to 4.8% over the previous best algorithm and reduces the time to reach target accuracy by 75%.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Combining Self-attention and Dilation Convolutional for Semantic Segmentation of Coal Maceral Groups
Authors:
Zhenghao Xi,
Zhengnan Lv,
Yang Zheng,
Xiang Liu,
Zhuang Yu,
Junran Chen,
Jing Hu,
Yaqi Liu
Abstract:
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts mo…
▽ More
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT-based DA-VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA-VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA-VIT and state-of-the-art methods at lots of evaluation metrics. Experimental results show that DA-VIT-Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA-VIT-Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA-VIT are better than other state-of-the-art methods.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Characterization of fiberwise bimeromorphism and specialization of bimeromorphic types I: the non-negative Kodaira dimension case
Authors:
Jian Chen,
Sheng Rao,
I-Hsun Tsai
Abstract:
Inspired by the recent works of M. Kontsevich--Y. Tschinkel and J. Nicaise--J. C. Ottem on specialization of birational types for smooth families (in the scheme category) and J. Koll{á}r's work on fiberwise bimeromorphism, we focus on characterizing the fiberwise bimeromorphism and utilizing the characterization to investigate the specialization of bimeromorphic types for non-smooth families in th…
▽ More
Inspired by the recent works of M. Kontsevich--Y. Tschinkel and J. Nicaise--J. C. Ottem on specialization of birational types for smooth families (in the scheme category) and J. Koll{á}r's work on fiberwise bimeromorphism, we focus on characterizing the fiberwise bimeromorphism and utilizing the characterization to investigate the specialization of bimeromorphic types for non-smooth families in the complex analytic setting. We provide some criteria for a bimeromorphic map between two families over the same base to be fiberwise bimeromorphic. By combining these criteria with ideas by D. Mumford--U. Persson and T. de Fernex--D. Fusi, as well as K. Timmerscheidt's approach via the relative Barlet cycle space theory, we establish the specialization of bimeromorphic types for locally Moishezon families with fibers having only canonical singularities and being of non-negative Kodaira dimension. These specialization results can easily lead to criteria for locally strongly bimeromorphic isotriviality. Throughout this paper, we unveil the connections among the four classical topics in bimeromorphic geometry: the deformation behavior of plurigenera (or even $1$-genus), fiberwise bimeromorphism, specialization of bimeromorphic types, and the bimeromorphic version of the deformation rigidity.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
Authors:
Runhao Zeng,
Qi Deng,
Ronghao Zhang,
Shuaicheng Niu,
Jian Chen,
Xiping Hu,
Victor C. M. Leung
Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio informatio…
▽ More
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Authors:
Zonghao Ying,
Siyang Wu,
Run Hao,
Peng Ying,
Shixuan Sun,
Pengyu Chen,
Junze Chen,
Hao Du,
Kaiwen Shen,
Shangkun Wu,
Jiwei Wei,
Shiyuan He,
Yang Yang,
Xiaohai Xu,
Ke Ma,
Qianqian Xu,
Qingming Huang,
Shi Lin,
Xun Wang,
Changting Lin,
Meng Han,
Yilei Jiang,
Siqi Lai,
Yaozhi Zheng,
Yifei Song
, et al. (22 additional authors not shown)
Abstract:
Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents finding…
▽ More
Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Exploring the Secondary Risks of Large Language Models
Authors:
Jiawei Chen,
Zhengwei Fang,
Xiao Yang,
Chao Yu,
Zhaoxia Yin,
Hang Su
Abstract:
Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes ma…
▽ More
Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations
Authors:
Zilin Si,
Jose Enrique Chen,
M. Emre Karagozler,
Antonia Bronars,
Jonathan Hutchinson,
Thomas Lampe,
Nimrod Gileadi,
Taylor Howell,
Stefano Saliceti,
Lukasz Barczyk,
Ilan Olivarez Correa,
Tom Erez,
Mohit Shridhar,
Murilo Fernandes Martins,
Konstantinos Bousmalis,
Nicolas Heess,
Francesco Nori,
Maria Bauza Villalonga
Abstract:
Recent advancements in teleoperation systems have enabled high-quality data collection for robotic manipulators, showing impressive results in learning manipulation at scale. This progress suggests that extending these capabilities to robotic hands could unlock an even broader range of manipulation skills, especially if we could achieve the same level of dexterity that human hands exhibit. However…
▽ More
Recent advancements in teleoperation systems have enabled high-quality data collection for robotic manipulators, showing impressive results in learning manipulation at scale. This progress suggests that extending these capabilities to robotic hands could unlock an even broader range of manipulation skills, especially if we could achieve the same level of dexterity that human hands exhibit. However, teleoperating robotic hands is far from a solved problem, as it presents a significant challenge due to the high degrees of freedom of robotic hands and the complex dynamics occurring during contact-rich settings. In this work, we present ExoStart, a general and scalable learning framework that leverages human dexterity to improve robotic hand control. In particular, we obtain high-quality data by collecting direct demonstrations without a robot in the loop using a sensorized low-cost wearable exoskeleton, capturing the rich behaviors that humans can demonstrate with their own hands. We also propose a simulation-based dynamics filter that generates dynamically feasible trajectories from the collected demonstrations and use the generated trajectories to bootstrap an auto-curriculum reinforcement learning method that relies only on simple sparse rewards. The ExoStart pipeline is generalizable and yields robust policies that transfer zero-shot to the real robot. Our results demonstrate that ExoStart can generate dexterous real-world hand skills, achieving a success rate above 50% on a wide range of complex tasks such as opening an AirPods case or inserting and turning a key in a lock. More details and videos can be found in https://sites.google.com/view/exostart.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Authors:
Bo-Cheng Chiu,
Jen-Jee Chen,
Yu-Chee Tseng,
Feng-Chi Chen
Abstract:
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM expl…
▽ More
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
Authors:
Zhuguanyu Wu,
Shihe Wang,
Jiayi Zhang,
Jiaxin Chen,
Yunhong Wang
Abstract:
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessi…
▽ More
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Radii of spherical timelike geodesics in Kerr-Newman black holes
Authors:
Wei Huang,
Jun-Xu Chen,
Jia-Hui Huang
Abstract:
The existence, radii and radial stability of the equatorial and non-equatorial (particularly, the polar) spherical orbits are discussed for particles with different conserved energy. The radii of these orbits generally are solutions of a quintic polynomial equation with four dimensionless parameters. For the case with $γ=1$, we obtain the analytical expressions for the radii of the polar, equatori…
▽ More
The existence, radii and radial stability of the equatorial and non-equatorial (particularly, the polar) spherical orbits are discussed for particles with different conserved energy. The radii of these orbits generally are solutions of a quintic polynomial equation with four dimensionless parameters. For the case with $γ=1$, we obtain the analytical expressions for the radii of the polar, equatorial and general orbits. The radial stability of the orbits outside the event horizon is also discussed. In the $(u, w, β)$ space, a no-orbit surface is found. When the parameters lies on this surface there is no orbit outside the event horizon, otherwise there is always one spherical orbit outside the event horizon. For the cases with $γ\neq1$, we focus on the study of polar and equatorial orbits. For polar orbits with $0<γ<1$, a boundary surface in $(u, w, γ)$ space is identified which determines the existence of spherical polar orbits outside the event horizon. Numerical results of the radii and radial stability of the polar orbits are shown for examples with specific values of $γ$. For polar orbits with $γ>1$, it is found that there is always one unstable orbit outside the event horizon. For equatorial orbits with $0<γ<1$, in each rotating case (prograde case and retrograde case), a boundary surface in $(u, w, γ)$ space is also identified which divides the parameter space into two regions: one region with two orbits (one stable and the other unstable) and the other with no orbit outside the event horizon. Parameters on the boundary surface correspond to ISCOs. An analytical formula for the ISCOs is derived by choosing $(w,γ)$ as independent variables. For equatorial orbits with $γ>1$, it is found that there is always one unstable orbit outside the event horizon.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization
Authors:
Jingfeng Guo,
Jian Liu,
Jinnan Chen,
Shiwei Mao,
Changrong Hu,
Puhua Jiang,
Junlin Yu,
Jing Xu,
Qi Liu,
Lixin Xu,
Zhuo Chen,
Chunchao Guo
Abstract:
We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint's children and for each hie…
▽ More
We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint's children and for each hierarchical layer, effectively automating connectivity relationships. This approach significantly enhances topological accuracy by integrating connectivity information directly into the prediction framework. To further guarantee high-quality topology, we implement a topology-aware reward function that quantifies topological correctness, which is then utilized in a post-training phase through reward-guided Direct Preference Optimization. Additionally, we incorporate implicit geodesic features for latent top-k bone selection, which substantially improves skinning quality. By leveraging geodesic distance information within the model's latent space, our approach intelligently determines the most influential bones for each vertex, effectively mitigating common skinning artifacts. This combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables our model to consistently generate more anatomically plausible skeletal structures with superior deformation properties.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Authors:
Wenkang Han,
Zhixiong Zeng,
Jing Huang,
Shu Jiang,
Liming Zheng,
Longrong Yang,
Haibo Qiu,
Chang Yao,
Jingyuan Chen,
Lin Ma
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screensho…
▽ More
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at https://github.com/GUIRoboTron/GUIRoboTron-Speech.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Authors:
Yuxuan Luo,
Yuhui Yuan,
Junwen Chen,
Haonan Cai,
Ziyi Yue,
Yuwei Yang,
Fatima Zohra Daha,
Ji Li,
Zhouhui Lian
Abstract:
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effec…
▽ More
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
△ Less
Submitted 13 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Authors:
Lianghong Guo,
Yanlin Wang,
Caihua Li,
Pengyu Yang,
Jiachi Chen,
Wei Tao,
Yingtian Zou,
Duyu Tang,
Zibin Zheng
Abstract:
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating…
▽ More
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
RationalVLA: A Rational Vision-Language-Action Model with Dual System
Authors:
Wenxuan Song,
Jiayi Chen,
Wenxue Li,
Xu He,
Han Zhao,
Can Cui,
Pengxiang Ding Shiyan Su,
Feilong Tang,
Xuelian Cheng,
Donglin Wang,
Zongyuan Ge,
Xinhu Zheng,
Zhe Liu,
Hesheng Wang,
Haoang Li
Abstract:
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasibl…
▽ More
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications. Our project page is https://irpn-eai.github.io/RationalVLA.
△ Less
Submitted 13 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems
Authors:
Xiaozhe Li,
Jixuan Chen,
Xinyu Fang,
Shengyuan Ding,
Haodong Duan,
Qingwen Liu,
Kai Chen
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH inclu…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Electric field control of third-order nonlinear Hall effect
Authors:
Jiaju Yang,
Lujun Wei,
Yanghui Li,
Lina Chen,
Wei Niu,
Jiarui Chen,
Jun Du,
Yong Pu
Abstract:
The third-order nonlinear Hall effect (NLHE) serves as a sensitive probe of energy band geometric property, providing a new paradigm for revealing the Berry curvature distribution and topological response of quantum materials. In the Weyl semimetal TaIrTe4, we report for the first time that the sign of the third-order NLHE reverses with decreasing temperature. Through scaling law analysis, we thin…
▽ More
The third-order nonlinear Hall effect (NLHE) serves as a sensitive probe of energy band geometric property, providing a new paradigm for revealing the Berry curvature distribution and topological response of quantum materials. In the Weyl semimetal TaIrTe4, we report for the first time that the sign of the third-order NLHE reverses with decreasing temperature. Through scaling law analysis, we think that the third-order NLHE at high (T > 23 K) and low (T < 23 K) temperatures is dominated by Berry-connection polarizability (BCP) and impurity scattering, respectively. The third-order NLHE response strength can be effectively modulated by an additional applied in-plane constant electric field. At the high temperature region, the BCP reduction induced by the electric field leads to a decrease in the third-order NLHE response strength, while at the low temperature region, the electric field cause both BCP and impurity scattering effects to weaken, resulting in a more significant modulation of the third-order NLHE response strength. At 4 K and an electric field strength of 0.3 kV/cm, the modulated relative response strength could reach up to 65.3%. This work provides a new means to explore the third-order NLHE and a valuable reference for the development of novel electronic devices.
△ Less
Submitted 12 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Unitary Scrambling and Collapse: A Quantum Diffusion Framework for Generative Modeling
Authors:
Yihua Li,
Jiayi Chen,
Tamanna S. Kumavat,
Kyriakos Flouris
Abstract:
Quantum computing, with its promise of exponential speedups, is rapidly emerging as a powerful paradigm for advancing artificial intelligence. We propose QSC-Diffusion, the first fully quantum diffusion-based framework for image generation. Our method integrates classical Gaussian noise with quantum scrambling in the forward process, and employs parameterized quantum circuits with measurement-indu…
▽ More
Quantum computing, with its promise of exponential speedups, is rapidly emerging as a powerful paradigm for advancing artificial intelligence. We propose QSC-Diffusion, the first fully quantum diffusion-based framework for image generation. Our method integrates classical Gaussian noise with quantum scrambling in the forward process, and employs parameterized quantum circuits with measurement-induced collapse for reverse denoising -- enabling end-to-end sampling without reliance on classical neural architectures or preprocessing modules. To address optimization challenges in deep quantum models, we introduce a hybrid loss that balances fidelity and diversity, coupled with a divide-and-conquer training strategy to mitigate barren plateaus. Remarkably, QSC-Diffusion achieves competitive FID scores across multiple datasets while using orders of magnitude fewer parameters, outperforming even some quantum-classical hybrid baselines in efficiency. These results highlight the potential of quantum-native generative modeling and mark a foundational step toward scalable quantum machine learning.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Rethinking Generative Human Video Coding with Implicit Motion Transformation
Authors:
Bolin Chen,
Ru-Ling Liao,
Jie Chen,
Yan Ye
Abstract:
Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant succes…
▽ More
Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Photon-mediated interactions by Floquet photonic lattices
Authors:
Jia-Qiang Chen,
Peng-Bo Li,
Álvaro Gómez-León,
Alejandro González-Tudela
Abstract:
We investigate the interactions between two-level emitters mediated by time-dependent, one-dimensional, structured photonic baths, focusing on Floquet topological lattices. Building on the framework of periodically driven photonic lattices, we demonstrate and characterize the emergence of tunable-range emitter's interactions mediated by bound states absent in static photonic lattices. In particula…
▽ More
We investigate the interactions between two-level emitters mediated by time-dependent, one-dimensional, structured photonic baths, focusing on Floquet topological lattices. Building on the framework of periodically driven photonic lattices, we demonstrate and characterize the emergence of tunable-range emitter's interactions mediated by bound states absent in static photonic lattices. In particular, we show that one can not only obtain different spatial interaction dependencies with respect to the static bath scenarios, but also in qualitatively different regimes due to the time-dependent nature of the bath, for example, when the emitters have different frequencies. This work sheds light on the interplay between non-equilibrium photonics and quantum optics and can serve as the basis for analyzing Floquet photonic lattices in higher dimensions.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.