Search | arXiv e-print repository

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Authors: Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying nois… ▽ More Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: This paper has been early accepted by MICCAI 2025

arXiv:2506.04941 [pdf, ps, other]

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Authors: Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

Abstract: Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mas… ▽ More Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ . △ Less

Submitted 5 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.04924 [pdf, ps, other]

Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer Fusion

Authors: Han Wang, Ruoyun He, Guoguang Lao, Ting Liu, Hejiao Luo, Changqi Qin, Hongying Luo, Junmin Huang, Zihan Wei, Lu Chen, Yongzhi Xu, Ziqian Bi, Junhao Song, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Huafeng Liu, Junfeng Hao, Chunjie Tian

Abstract: Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (C… ▽ More Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (CriticalWindow-24) benchmark, ALFIA surpasses state-of-the-art tabular classifiers in AUPRC while preserving a balanced precision-recall profile. The embeddings produced by ALFIA's fusion module, capturing both fine-grained clinical cues and high-level concepts, enable seamless pairing with GBDTs (CatBoost/LightGBM) as ALFIA-boost, and deep neuro networks as ALFIA-nn, yielding additional performance gains. Our experiments confirm ALFIA's superior early-warning performance, by operating directly on routine clinical text, it furnishes clinicians with a convenient yet robust tool for risk stratification and timely intervention in critical-care settings. △ Less

Submitted 6 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

Comments: 21 pages, 6 figures

arXiv:2506.04850 [pdf, ps, other]

doi 10.1051/0004-6361/202452550

The Chinese Pulsar Timing Array data release I. Single pulsar noise analysis

Authors: Siyuan Chen, Heng Xu, Yanjun Guo, Bojun Wang, R. Nicolas Caballero, Jinchen Jiang, Jiangwei Xu, Zihan Xue, Kejia Lee, Jianping Yuan, Yonghua Xu, Jingbo Wang, Longfei Hao, Jintao Luo, Jinlin Han, Peng Jiang, Zhiqiang Shen, Min Wang, Na Wang, Renxin Xu, Xiangping Wu, Lei Qian, Xin Guan, Menglin Huang, Chun Sun , et al. (1 additional authors not shown)

Abstract: The Chinese Pulsar Timing Array (CPTA) has collected observations from 57 millisecond pulsars using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST) for close to three years, for the purpose of searching for gravitational waves (GWs). To robustly search for ultra-low-frequency GWs, pulsar timing arrays (PTAs) need to use models to describe the noise from the individual pulsars. We… ▽ More The Chinese Pulsar Timing Array (CPTA) has collected observations from 57 millisecond pulsars using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST) for close to three years, for the purpose of searching for gravitational waves (GWs). To robustly search for ultra-low-frequency GWs, pulsar timing arrays (PTAs) need to use models to describe the noise from the individual pulsars. We report on the results from the single pulsar noise analysis of the CPTA data release I (DR1). Conventionally, power laws in the frequency domain are used to describe pulsar red noise and dispersion measurement (DM) variations over time. Employing Bayesian methods, we found the choice of number and range of frequency bins with the highest evidence for each pulsar individually. A comparison between a dataset using DM piecewise measured (DMX) values and a power-law Gaussian process to describe the DM variations shows strong Bayesian evidence in favour of the power-law model. Furthermore, we demonstrate that the constraints obtained from four independent software packages are very consistent with each other. The short time span of the CPTA DR1, paired with the large sensitivity of FAST, has proved to be a challenge for the conventional noise model using a power law. This mainly shows in the difficulty to separate different noise terms due to their covariances with each other. Nineteen pulsars are found to display covariances between the short-term white noise and long-term red and DM noise. With future CPTA datasets, we expect that the degeneracy can be broken. Finally, we compared the CPTA DR1 results against the noise properties found by other PTA collaborations. While we can see broad agreement, there is some tension between different PTA datasets for some of the overlapping pulsars. This could be due to the differences in the methods and frequency range compared to the other PTAs. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: 17 pages, 4 figures, 10 tables

arXiv:2506.04660 [pdf]

doi 10.1177/14780771251335110

Adaptive recycled plastic architecture: Vacuum-Sealed Chainmail Structures Through Computational Design

Authors: Yi Xu, Farzin Lotfi-Jam, Mustafa Faruki

Abstract: The construction industry is a major consumer of raw materials, accounting for nearly half of global material usage annually, while generating significant waste that poses sustainability challenges. This paper explores the untapped potential of recycled plastics as a primary construction material, leveraging their lightweight, flexible, and customizable properties for advanced applications in modu… ▽ More The construction industry is a major consumer of raw materials, accounting for nearly half of global material usage annually, while generating significant waste that poses sustainability challenges. This paper explores the untapped potential of recycled plastics as a primary construction material, leveraging their lightweight, flexible, and customizable properties for advanced applications in modular chainmail systems. Through a computational workflow, the study optimizes the design, testing, and fabrication of vacuum-sealed chainmail structures composed of recycled plastic filaments, demonstrating their adaptability and structural performance for architectural use. Key contributions include a novel methodology for integrating recycled plastic filaments into chainmail geometries, validated through 2D sectional testing, 3D shell structure generation, and physical modeling under vacuum constraints. The research identifies the rectangular chainmail configuration as the most efficient and adaptable, achieving superior deformation capacity, material efficiency, and load-bearing performance. Optimization strategies for temporary structures highlight practical deployment potential, balancing material savings, usable area, and water drainage efficiency. The findings offer a foundation for innovative applications in extreme conditions, including disaster-prone areas, high-altitude environments, underwater platforms, and extraterrestrial habitats. These applications leverage the lightweight, adaptable, and durable properties of recycled plastics and modular chainmail systems, bridging the gap between waste management and high-performance design while addressing unique challenges in harsh and resource-constrained environments. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: Accepted manuscript. Published in International Journal of Architectural Computing, April 2025

ACM Class: J.6; I.2.10

arXiv:2506.04592 [pdf, ps, other]

Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Authors: Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang

Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their eff… ▽ More Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted in ACL 2025

arXiv:2506.03673 [pdf, ps, other]

Reason from Future: Reverse Thought Chain Enhances LLM Reasoning

Authors: Yinlong Xu, Yanzhao Zheng, Shuoshuo Sun, Shuaihan Huang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Hongxia Xu, Jian Wu

Abstract: It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fall into the trap of local optimum reason… ▽ More It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fall into the trap of local optimum reasoning, which means the model lacks a global perspective while solving problems. We propose a novel reasoning paradigm called Reason from Future (RFF), which generates reasoning paths by bidirectional reasoning that combines top-down planning with bottom-up reasoning accumulation. The essence of RFF lies in its reverse reasoning mechanism, which prioritizes core logical relationships and imposes goal-oriented constraints on intermediate steps, thereby reducing the searching space and mitigating error accumulation inherent in sequential forward reasoning. Empirical evaluations across diverse experiments demonstrate that RFF outperforms conventional paradigms with higher accuracy and less searching space to solve complex tasks. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025 findings

arXiv:2506.03509 [pdf, ps, other]

doi 10.1103/k2t5-wcq1

Multiband superconductivity in the topological Kramers nodal-line semimetals

Authors: Tian Shang, Jianzhou Zhao, Keqi Xia, Lun-Hui Hu, Yang Xu, Qingfeng Zhan, Dariusz Jakub Gawryluk, Toni Shiroka

Abstract: Recent band-structure calculations predict that the ruthenium-based ternary silicides are three-dimensional Kramers nodal line semimetals. Among them, NbRuSi and TaRuSi show bulk superconductivity (SC) below $T_c \sim 3$ K and 4 K, as well as spontaneous magnetic fields. The latter indicates the breaking of time-reversal symmetry and, thus, unconventional SC in both compounds. Previous temperature… ▽ More Recent band-structure calculations predict that the ruthenium-based ternary silicides are three-dimensional Kramers nodal line semimetals. Among them, NbRuSi and TaRuSi show bulk superconductivity (SC) below $T_c \sim 3$ K and 4 K, as well as spontaneous magnetic fields. The latter indicates the breaking of time-reversal symmetry and, thus, unconventional SC in both compounds. Previous temperature-dependent muon-spin spectroscopy studies failed to distinguish whether such compounds exhibit single-gap or multi-gap SC. Here, we report on systematic measurements of the field-dependent muon-spin relaxation rates in the superconducting state and on temperature-dependent electrical resistivity and specific heat under applied magnetic fields. Both the upper critical field and the field-dependent superconducting relaxation are well described by a two-band model. By combining our experimental results with numerical band-structure calculations, we provide solid evidence for multiband SC in NbRuSi and TaRuSi, and thus offer further insight into the unconventional- and topological nature of their superconductivity. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 9 pages, 8 figures

Journal ref: Phys. Rev. B 111, 214516 (2025)

arXiv:2506.03474 [pdf, ps, other]

CORE: Constraint-Aware One-Step Reinforcement Learning for Simulation-Guided Neural Network Accelerator Design

Authors: Yifeng Xiao, Yurong Xu, Ning Yan, Masood Mortazavi, Pierluigi Nuzzo

Abstract: Simulation-based design space exploration (DSE) aims to efficiently optimize high-dimensional structured designs under complex constraints and expensive evaluation costs. Existing approaches, including heuristic and multi-step reinforcement learning (RL) methods, struggle to balance sampling efficiency and constraint satisfaction due to sparse, delayed feedback, and large hybrid action spaces. In… ▽ More Simulation-based design space exploration (DSE) aims to efficiently optimize high-dimensional structured designs under complex constraints and expensive evaluation costs. Existing approaches, including heuristic and multi-step reinforcement learning (RL) methods, struggle to balance sampling efficiency and constraint satisfaction due to sparse, delayed feedback, and large hybrid action spaces. In this paper, we introduce CORE, a constraint-aware, one-step RL method for simulationguided DSE. In CORE, the policy agent learns to sample design configurations by defining a structured distribution over them, incorporating dependencies via a scaling-graph-based decoder, and by reward shaping to penalize invalid designs based on the feedback obtained from simulation. CORE updates the policy using a surrogate objective that compares the rewards of designs within a sampled batch, without learning a value function. This critic-free formulation enables efficient learning by encouraging the selection of higher-reward designs. We instantiate CORE for hardware-mapping co-design of neural network accelerators, demonstrating that it significantly improves sample efficiency and achieves better accelerator configurations compared to state-of-the-art baselines. Our approach is general and applicable to a broad class of discrete-continuous constrained design problems. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Preprint. 10 pages + appendix. Submitted to NeurIPS 2025

ACM Class: I.2.6; C.3

arXiv:2506.03450 [pdf, ps, other]

SENMAP: Multi-objective data-flow mapping and synthesis for hybrid scalable neuromorphic systems

Authors: Prithvish V Nembhani, Oliver Rhodes, Guangzhi Tang, Alexandra F Dobrita, Yingfu Xu, Kanishkan Vadivel, Kevin Shidqi, Paul Detterer, Mario Konijnenburg, Gert-Jan van Schaik, Manolis Sifalakis, Zaid Al-Ars, Amirreza Yousefzadeh

Abstract: This paper introduces SENMap, a mapping and synthesis tool for scalable, energy-efficient neuromorphic computing architecture frameworks. SENECA is a flexible architectural design optimized for executing edge AI SNN/ANN inference applications efficiently. To speed up the silicon tape-out and chip design for SENECA, an accurate emulator, SENSIM, was designed. While SENSIM supports direct mapping of… ▽ More This paper introduces SENMap, a mapping and synthesis tool for scalable, energy-efficient neuromorphic computing architecture frameworks. SENECA is a flexible architectural design optimized for executing edge AI SNN/ANN inference applications efficiently. To speed up the silicon tape-out and chip design for SENECA, an accurate emulator, SENSIM, was designed. While SENSIM supports direct mapping of SNNs on neuromorphic architectures, as the SNN and ANNs grow in size, achieving optimal mapping for objectives like energy, throughput, area, and accuracy becomes challenging. This paper introduces SENMap, flexible mapping software for efficiently mapping large SNN and ANN applications onto adaptable architectures. SENMap considers architectural, pretrained SNN and ANN realistic examples, and event rate-based parameters and is open-sourced along with SENSIM to aid flexible neuromorphic chip design before fabrication. Experimental results show SENMap enables 40 percent energy improvements for a baseline SENSIM operating in timestep asynchronous mode of operation. SENMap is designed in such a way that it facilitates mapping large spiking neural networks for future modifications as well. △ Less

Submitted 16 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

Comments: IJCNN conference, Italy, 2025, accepted, 30 June - 5 July

arXiv:2506.03408 [pdf, ps, other]

Trajectory Prediction Meets Large Language Models: A Survey

Authors: Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu

Abstract: Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five direc… ▽ More Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 16 pages, GitHub: https://github.com/colorfulfuture/Awesome-Trajectory-Motion-Prediction-Papers

arXiv:2506.02969 [pdf, ps, other]

Measurement of the branching fractions of the Cabibbo-favored decays $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ and $Λ_{c}^{+}\toΞ^{0}K_{S}^{0}π^{+}$ and search for $Λ_{c}^{+}\toΣ^{0} K_{S}^{0}K^{+}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (660 additional authors not shown)

Abstract: Based on $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of about 4.5 fb$^{-1}$ collected at center-of-mass energies between 4599.53 MeV and 4698.82 MeV with the BESIII detector, the absolute branching fraction of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is measured to be $(3.12\pm0.46\pm0.15)\times10^{-3}$. Combined with a previous measurement from the BESIII… ▽ More Based on $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of about 4.5 fb$^{-1}$ collected at center-of-mass energies between 4599.53 MeV and 4698.82 MeV with the BESIII detector, the absolute branching fraction of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is measured to be $(3.12\pm0.46\pm0.15)\times10^{-3}$. Combined with a previous measurement from the BESIII Collaboration, the branching fraction of the decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is calculated to be $(3.07\pm0.26\pm0.13)\times10^{-3}$. The decay $Λ_{c}^{+}\toΞ^{0}K_{S}^{0}π^{+}$ is observed for the first time with a statistical significance of $6.6σ$, and its branching fraction is determined to be $(3.70\pm0.60\pm0.21)\times10^{-3}$. In addition, a search for the decay $Λ_{c}^{+}\toΣ^{0} K_{S}^{0}K^{+}$ is performed and its branching fraction is determined to be $(0.80^{+0.28}_{-0.24}\pm0.16)\times10^{-3}$, corresponding to an upper limit of $1.28\times10^{-3}$ at $90\%$ confidence level. These measurements provide new information that can be used to distinguish between theoretical models. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02521 [pdf, ps, other]

Improved Measurements of $D^+ \to ηe^+ν_e$ and $D^+ \to ημ^+ν_μ$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (682 additional authors not shown)

Abstract: Using 20.3 fb$^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector, we measure the branching fractions of $D^+\to ηe^+ν_e$ and $D^+\to ημ^+ν_μ$ to be $(9.75\pm0.29\pm0.28)\times10^{-4}$ and $(9.08\pm0.35\pm0.23)\times10^{-4}$, where the first and second uncertainties are statistical and systematic, respectively. From a simultaneous fit to t… ▽ More Using 20.3 fb$^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector, we measure the branching fractions of $D^+\to ηe^+ν_e$ and $D^+\to ημ^+ν_μ$ to be $(9.75\pm0.29\pm0.28)\times10^{-4}$ and $(9.08\pm0.35\pm0.23)\times10^{-4}$, where the first and second uncertainties are statistical and systematic, respectively. From a simultaneous fit to their partial decay rates, we determine the product of the hadronic form factor $f^η_+(0)$ and the modulus of the $c\to d$ Cabibbo-Kobayashi-Maskawa matrix element $|V_{cd}|$ to be $f^η_+(0)|V_{cd}|=0.078\pm0.002\pm0.001$. Taking the $|V_{cd}|$ value from the Standard Model global fit as input, we obtain $f^η_+(0)=0.345\pm0.008\pm0.003$. The ratio between the measured branching fractions of $D^+\toη^+μ^+ν_μ$ and $D^+\toηe^+ν_e$, is determined to be $0.93\pm0.05_{\rm stat.}\pm0.02_{\rm syst.}$, indicating no violation of lepton flavor universality. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02498 [pdf, ps, other]

Electronic structures and magnetism in van der Waals flat-band material Ni$_{3}$GeTe$_{2}$

Authors: Yuanji Xu, Xintao Jin, Haoyuan Tang, Fuyang Tian

Abstract: The study of magnetism in two-dimensional materials has garnered significant interest, driven by fundamental investigations into low-dimensional magnetic phenomena and their potential for applications in spintronic devices. Through dynamical mean-field theory calculations, we demonstrate that Ni$_{3}$GeTe$_{2}$ exhibits flat-band characteristics resulting from the geometric frustration of its laye… ▽ More The study of magnetism in two-dimensional materials has garnered significant interest, driven by fundamental investigations into low-dimensional magnetic phenomena and their potential for applications in spintronic devices. Through dynamical mean-field theory calculations, we demonstrate that Ni$_{3}$GeTe$_{2}$ exhibits flat-band characteristics resulting from the geometric frustration of its layered triangular lattice. These flat bands are further renormalized due to electronic correlation. Our calculations reveal that the magnetic order of Ni atoms is significantly influenced by both the Coulomb interaction and Hund's coupling, indicating that the physics of Ni atoms is situated in an intermediate region between Hundness and Mottness. Additionally, our results show that Ni atoms experience significant spin fluctuations in their local moments, maintaining paramagnetism at low temperatures. Furthermore, we investigate the effect of vacancies, finding a substantial suppression of the density of states at the Fermi level. The physical mechanisms uncovered by our study provide a comprehensive understanding of the novel properties exhibited in this material. △ Less

Submitted 21 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02020 [pdf, other]

Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

Authors: Youze Xue, Dian Li, Gang Liu

Abstract: With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CL… ▽ More With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed. △ Less

Submitted 28 May, 2025; originally announced June 2025.

arXiv:2506.01968 [pdf, ps, other]

Efficient ANN-SNN Conversion with Error Compensation Learning

Authors: Chang Liu, Jiangrong Shen, Xuming Ran, Mingkun Xu, Qi Xu, Yi Xu, Gang Pan

Abstract: Artificial neural networks (ANNs) have demonstrated outstanding performance in numerous tasks, but deployment in resource-constrained environments remains a challenge due to their high computational and memory requirements. Spiking neural networks (SNNs) operate through discrete spike events and offer superior energy efficiency, providing a bio-inspired alternative. However, current ANN-to-SNN con… ▽ More Artificial neural networks (ANNs) have demonstrated outstanding performance in numerous tasks, but deployment in resource-constrained environments remains a challenge due to their high computational and memory requirements. Spiking neural networks (SNNs) operate through discrete spike events and offer superior energy efficiency, providing a bio-inspired alternative. However, current ANN-to-SNN conversion often results in significant accuracy loss and increased inference time due to conversion errors such as clipping, quantization, and uneven activation. This paper proposes a novel ANN-to-SNN conversion framework based on error compensation learning. We introduce a learnable threshold clipping function, dual-threshold neurons, and an optimized membrane potential initialization strategy to mitigate the conversion error. Together, these techniques address the clipping error through adaptive thresholds, dynamically reduce the quantization error through dual-threshold neurons, and minimize the non-uniformity error by effectively managing the membrane potential. Experimental results on CIFAR-10, CIFAR-100, ImageNet datasets show that our method achieves high-precision and ultra-low latency among existing conversion methods. Using only two time steps, our method significantly reduces the inference time while maintains competitive accuracy of 94.75% on CIFAR-10 dataset under ResNet-18 structure. This research promotes the practical application of SNNs on low-power hardware, making efficient real-time processing possible. △ Less

Submitted 12 May, 2025; originally announced June 2025.

arXiv:2506.01829 [pdf, ps, other]

CiteEval: Principle-Driven Citation Evaluation for Source Attribution

Authors: Yumo Xu, Peng Qi, Jifan Chen, Kunlun Liu, Rujun Han, Lan Liu, Bonan Min, Vittorio Castelli, Arshit Gupta, Zhiguo Wang

Abstract: Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a… ▽ More Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: ACL 2025

arXiv:2506.01725 [pdf, ps, other]

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking

Authors: Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, Limin Wang

Abstract: While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. S… ▽ More While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description generation, enabling the model to produce captions with more accurate actions. Our experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline using limited samples (1.5k) across multiple video caption benchmarks (DREAM1K: +4.4 event F1, VDC: +4.2 Acc, CAREBENCH: +3.1 action F1, +6.9 object F1) while consistently outperforming the SFT-trained counterparts, confirming GRPO's superiority in enhancing MLLMs' captioning capabilities. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.01293 [pdf, ps, other]

Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation

Authors: Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Min Zhang, Wen Zhang, Huajun Chen

Abstract: Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address t… ▽ More Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Our code and data are released at https://github.com/zjukg/M3STR △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Work in progress

arXiv:2506.00688 [pdf, ps, other]

Existing Large Language Model Unlearning Evaluations Are Inconclusive

Authors: Zhili Feng, Yixuan Even Xu, Alexander Robey, Robert Kirk, Xander Davies, Yarin Gal, Avi Schwarzschild, J. Zico Kolter

Abstract: Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations intro… ▽ More Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance by re-teaching the model during testing. Second, we demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines. Finally, we find that many evaluations rely on spurious correlations, making their results difficult to trust and interpret. Taken together, these issues suggest that current evaluation protocols may both overstate and understate unlearning success. To address this, we propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness. We validate these principles through a series of targeted experiments, showing how violations of each can lead to misleading conclusions. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.00677 [pdf, ps, other]

Review of Blockchain-Based Approaches to Spent Fuel Management in Nuclear Power Plants

Authors: Yuxiang Xu, Wenjuan Yu, Yuqian Wan, Zhongming Zhang

Abstract: This study addresses critical challenges in managing the transportation of spent nuclear fuel, including inadequate data transparency, stringent confidentiality requirements, and a lack of trust among collaborating parties, issues prevalent in traditional centralized management systems. Given the high risks involved, balancing data confidentiality with regulatory transparency is imperative. To ove… ▽ More This study addresses critical challenges in managing the transportation of spent nuclear fuel, including inadequate data transparency, stringent confidentiality requirements, and a lack of trust among collaborating parties, issues prevalent in traditional centralized management systems. Given the high risks involved, balancing data confidentiality with regulatory transparency is imperative. To overcome these limitations, a prototype system integrating blockchain technology and the Internet of Things (IoT) is proposed, featuring a multi-tiered consortium chain architecture. This system utilizes IoT sensors for real-time data collection, which is immutably recorded on the blockchain, while a hierarchical data structure (operational, supervisory, and public layers) manages access for diverse stakeholders. The results demonstrate that this approach significantly enhances data immutability, enables real-time multi-sensor data integration, improves decentralized transparency, and increases resilience compared to traditional systems. Ultimately, this blockchain-IoT framework improves the safety, transparency, and efficiency of spent fuel transportation, effectively resolving the conflict between confidentiality and transparency in nuclear data management and offering significant practical implications. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.00569 [pdf, ps, other]

AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs

Authors: Nicholas E. Corrado, Julian Katz-Samuels, Adithya Devraj, Hyokun Yun, Chao Zhang, Yi Xu, Yi Pan, Bing Yin, Trishul Chilimbi

Abstract: When aligning large language models (LLMs), their performance on various tasks (such as being helpful, harmless, and honest) depends heavily on the composition of their training data. However, selecting a data mixture that achieves strong performance across all tasks is challenging. Existing approaches rely on large ablation studies, heuristics, or human intuition, but these can be prohibitively e… ▽ More When aligning large language models (LLMs), their performance on various tasks (such as being helpful, harmless, and honest) depends heavily on the composition of their training data. However, selecting a data mixture that achieves strong performance across all tasks is challenging. Existing approaches rely on large ablation studies, heuristics, or human intuition, but these can be prohibitively expensive and suboptimal. We study this problem in the setting of preference optimization via DPO and introduce AutoMixAlign (AMA), a theoretically-grounded algorithm that adaptively mixes datasets during training to balance performance across tasks. AMA first trains \textit{specialist models} for each task to determine losses that correspond to strong task performance. Then, it trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses deviate most from specialist model losses. To optimize this problem, we propose two algorithms: (1) AMA-R, which adaptively reweights the objective to prioritize tasks, and (2) AMA-S, which adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of $O(1/\sqrt{T})$ in the convex case. AMA-R's convergence result follows from Sagawa et al. (2019), and we provide a convergence proof for AMA-S using online learning techniques such as EXP3. We evaluate AMA on several multitask alignment setups and find that AMA outperforms the standard alignment approach -- which simply optimizes the total loss across all tasks -- and also outperforms model merging methods. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: ACL 2025, Main Conference

arXiv:2506.00225 [pdf, other]

Understanding while Exploring: Semantics-driven Active Mapping

Authors: Liyan Chen, Huangying Zhan, Hairong Yin, Yi Xu, Philippos Mordohai

Abstract: Effective robotic autonomy in unknown environments demands proactive exploration and precise understanding of both geometry and semantics. In this paper, we propose ActiveSGM, an active semantic mapping framework designed to predict the informativeness of potential observations before execution. Built upon a 3D Gaussian Splatting (3DGS) mapping backbone, our approach employs semantic and geometric… ▽ More Effective robotic autonomy in unknown environments demands proactive exploration and precise understanding of both geometry and semantics. In this paper, we propose ActiveSGM, an active semantic mapping framework designed to predict the informativeness of potential observations before execution. Built upon a 3D Gaussian Splatting (3DGS) mapping backbone, our approach employs semantic and geometric uncertainty quantification, coupled with a sparse semantic representation, to guide exploration. By enabling robots to strategically select the most beneficial viewpoints, ActiveSGM efficiently enhances mapping completeness, accuracy, and robustness to noisy semantic data, ultimately supporting more adaptive scene exploration. Our experiments on the Replica and Matterport3D datasets highlight the effectiveness of ActiveSGM in active semantic mapping tasks. △ Less

Submitted 30 May, 2025; originally announced June 2025.

arXiv:2505.24823 [pdf, other]

PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models

Authors: Yinggan Xu, Yue Liu, Zhiqiang Gao, Changnan Peng, Di Luo

Abstract: Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core… ▽ More Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24466 [pdf, ps, other]

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

Authors: Yingjia Xu, Jinlin Wu, Zhen Chen, Daming Gao, Yang Yang, Zhen Lei, Min Cao

Abstract: Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within… ▽ More Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within the scene, which can offer valuable complementary insights for retrieval. To address this, we introduce SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich annotations covering both pedestrian appearance and environmental cues. Based on this, we propose SA-Person, a two-stage retrieval framework. In the first stage, it performs discriminative appearance grounding by aligning textual cues with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking method leveraging multimodal large language models to jointly reason over pedestrian appearance and the global scene context. Experiments on SCENEPERSON-13W validate the effectiveness of our framework in challenging scene-level retrieval scenarios. The code and dataset will be made publicly available. △ Less

Submitted 26 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

Comments: 22 pages, 7 figures. Under review

arXiv:2505.24307 [pdf, ps, other]

Multi-Waveguide Pinching Antennas for ISAC

Authors: Weihao Mao, Yang Lu, Yanqing Xu, Bo Ai, Octavia A. Dobre, Dusit Niyato

Abstract: Recently, a novel flexible-antenna technology, called pinching antennas, has attracted growing academic interest. By inserting discrete dielectric materials, pinching antennas can be activated at arbitrary points along waveguides, allowing for flexible customization of large-scale path loss. This paper investigates a multi-waveguide pinching-antenna integrated sensing and communications (ISAC) sys… ▽ More Recently, a novel flexible-antenna technology, called pinching antennas, has attracted growing academic interest. By inserting discrete dielectric materials, pinching antennas can be activated at arbitrary points along waveguides, allowing for flexible customization of large-scale path loss. This paper investigates a multi-waveguide pinching-antenna integrated sensing and communications (ISAC) system, where transmit pinching antennas (TPAs) and receive pinching antennas (RPAs) coordinate to simultaneously detect one potential target and serve one downlink user. We formulate a communication rate maximization problem subject to radar signal-to-noise ratio (SNR) requirement, transmit power budget, and the allowable movement region of the TPAs, by jointly optimizing TPA locations and transmit beamforming design. To address the non-convexity of the problem, we propose a novel fine-tuning approximation method to reformulate it into a tractable form, followed by a successive convex approximation (SCA)-based algorithm to obtain the solution efficiently. Extensive simulations validate both the system design and the proposed algorithm. Results show that the proposed method achieves near-optimal performance compared with the computational-intensive exhaustive search-based benchmark, and pinching-antenna ISAC systems exhibit a distinct communication-sensing trade-off compared with conventional systems. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24221 [pdf, ps, other]

FOCUS: Boosting Schema-aware Access for KV Stores via Hierarchical Data Management

Authors: Zhen Liu, Wenzhe Zhu, Yongkun Li, Yinlong Xu

Abstract: Persistent key-value (KV) stores are critical infrastructure for data-intensive applications. Leveraging high-performance Non-Volatile Memory (NVM) to enhance KV stores has gained traction. However, previous work has primarily focused on optimizing KV stores themselves, without adequately addressing their integration into applications. Consequently, existing applications, represented by NewSQL dat… ▽ More Persistent key-value (KV) stores are critical infrastructure for data-intensive applications. Leveraging high-performance Non-Volatile Memory (NVM) to enhance KV stores has gained traction. However, previous work has primarily focused on optimizing KV stores themselves, without adequately addressing their integration into applications. Consequently, existing applications, represented by NewSQL databases, still resort to a flat mapping approach, which simply maps structured records into flat KV pairs to use KV stores. Such semantic mismatch may cause significant I/O amplification and I/O splitting under production workloads, harming the performance. To this end, we propose FOCUS, a log-structured KV store optimized for fine-grained hierarchical data organization and schema-aware access. FOCUS introduces a hierarchical KV model to provide native support for upper-layer structured data. We implemented FOCUS from scratch. Experiments show that FOCUS can increase throughput by 2.1-5.9x compared to mainstream NVM-backed KV stores under YCSB SQL workloads. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24175 [pdf, ps, other]

Photometric redshift estimation for emission line galaxies of DESI Legacy Imaging Surveys by CNN-MLP

Authors: Shirui Wei, Changhua Li, Yanxia Zhang, Chenzhou Cui, Chao Tang, Jingyi Zhang, Yongheng Zhao, Xuebing Wu, Yihan Tao, Dongwei Fan, Shanshan Li, Yunfei Xu, Maoyuan Huang, Xingyu Yang, Zihan Kang, Jinghang Shi

Abstract: Emission Line Galaxies (ELGs) are crucial for cosmological studies, particularly in understanding the large-scale structure of the Universe and the role of dark energy. ELGs form an essential component of the target catalogue for the Dark Energy Spectroscopic Instrument (DESI), a major astronomical survey. However, the accurate selection of ELGs for such surveys is challenging due to the inherent… ▽ More Emission Line Galaxies (ELGs) are crucial for cosmological studies, particularly in understanding the large-scale structure of the Universe and the role of dark energy. ELGs form an essential component of the target catalogue for the Dark Energy Spectroscopic Instrument (DESI), a major astronomical survey. However, the accurate selection of ELGs for such surveys is challenging due to the inherent uncertainties in determining their redshifts with photometric data. In order to improve the accuracy of photometric redshift estimation for ELGs, we propose a novel approach CNN-MLP that combines Convolutional Neural Networks (CNNs) with Multilayer Perceptrons (MLPs). This approach integrates both images and photometric data derived from the DESI Legacy Imaging Surveys Data Release 10. By leveraging the complementary strengths of CNNs (for image data processing) and MLPs (for photometric feature integration), the CNN-MLP model achieves a $σ_{\mathrm{NMAD}}$ (normalised median absolute deviation) of 0.0140 and an outlier fraction of 2.57%. Compared to other models, CNN-MLP demonstrates a significant improvement in the accuracy of ELG photometric redshift estimation, which directly benefits the target selection process for DESI. In addition, we explore the photometric redshifts of different galaxy types (Starforming, Starburst, AGN, Broadline). Furthermore, this approach will contribute to more reliable photometric redshift estimation in ongoing and future large-scale sky surveys (e.g. LSST, CSST, Euclid), enhancing the overall efficiency of cosmological research and galaxy surveys. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 15 pages, 10 figures, 8 tables, accepted for publication in PASA

arXiv:2505.24173 [pdf, ps, other]

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Authors: Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, Xuegong Zhang

Abstract: Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visu… ▽ More Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.23871 [pdf, ps, other]

ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning

Authors: Zeyuan Liu, Zhihe Yang, Jiawei Xu, Rui Yang, Jiafei Lyu, Baoxiang Wang, Yunjian Xu, Xiu Li

Abstract: Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corr… ▽ More Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results. △ Less

Submitted 4 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.23866 [pdf, ps, other]

Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization

Authors: Chengli Tan, Yubo Zhou, Haishan Ye, Guang Dai, Junmin Liu, Zengjie Song, Jiangshe Zhang, Zixiang Zhao, Yunda Hao, Yong Xu

Abstract: Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently propose… ▽ More Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 16 pages

arXiv:2505.23826 [pdf, ps, other]

FinRipple: Aligning Large Language Models with Financial Market for Event Ripple Effect Awareness

Authors: Yuanjian Xu, Jianing Hao, Kunsheng Tang, Jingnan Chen, Anxian Liu, Peng Liu, Guang Zhang

Abstract: Financial markets exhibit complex dynamics where localized events trigger ripple effects across entities. Previous event studies, constrained by static single-company analyses and simplistic assumptions, fail to capture these ripple effects. While large language models (LLMs) offer emergent reasoning capabilities, their direct application falters due to structural market unawareness and limited ca… ▽ More Financial markets exhibit complex dynamics where localized events trigger ripple effects across entities. Previous event studies, constrained by static single-company analyses and simplistic assumptions, fail to capture these ripple effects. While large language models (LLMs) offer emergent reasoning capabilities, their direct application falters due to structural market unawareness and limited capacity to analyze ripple effects. We propose FinRipple, an elegant framework that empowers LLMs with the ability to analyze ripple effects through financial theory-guided large-scale reinforcement learning. We begin by relaxing the assumptions of previous methods, incorporating a time-varying knowledge graph to accurately represent market structure. By seamlessly integrating classical asset pricing theory, we align the LLM with the market, enabling it to predict ripple effects. To the best of our knowledge, we are the first to provide a standardized definition of ripple effect prediction, a task that is extremely important yet unexplored in the financial domain. Extensive experiments demonstrate that FinRipple provides a promising solution to this task. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.23803 [pdf, ps, other]

MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection

Authors: Yinuo Xue, Eric Spero, Yun Sing Koh, Giovanni Russello

Abstract: Phishing email detection faces critical challenges from evolving adversarial tactics and heterogeneous attack patterns. Traditional detection methods, such as rule-based filters and denylists, often struggle to keep pace with these evolving tactics, leading to false negatives and compromised security. While machine learning approaches have improved detection accuracy, they still face challenges ad… ▽ More Phishing email detection faces critical challenges from evolving adversarial tactics and heterogeneous attack patterns. Traditional detection methods, such as rule-based filters and denylists, often struggle to keep pace with these evolving tactics, leading to false negatives and compromised security. While machine learning approaches have improved detection accuracy, they still face challenges adapting to novel phishing strategies. We present MultiPhishGuard, a dynamic LLM-based multi-agent detection system that synergizes specialized expertise with adversarial-aware reinforcement learning. Our framework employs five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents) with automatically adjusted decision weights powered by a Proximal Policy Optimization reinforcement learning algorithm. To address emerging threats, we introduce an adversarial training loop featuring an adversarial agent that generates subtle context-aware email variants, creating a self-improving defense ecosystem and enhancing system robustness. Experimental evaluations on public datasets demonstrate that MultiPhishGuard significantly outperforms Chain-of-Thoughts, single-agent baselines and state-of-the-art detectors, as validated by ablation studies and comparative analyses. Experiments demonstrate that MultiPhishGuard achieves high accuracy (97.89\%) with low false positive (2.73\%) and false negative rates (0.20\%). Additionally, we incorporate an explanation simplifier agent, which provides users with clear and easily understandable explanations for why an email is classified as phishing or legitimate. This work advances phishing defense through dynamic multi-agent collaboration and generative adversarial resilience. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.23561 [pdf, ps, other]

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

Authors: Zenghui Yuan, Yangming Xu, Jiawen Shi, Pan Zhou, Lichao Sun

Abstract: Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model mer… ▽ More Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning). △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: This paper is accepted by ACL 2025 main conference

arXiv:2505.23466 [pdf, ps, other]

Importance of pressure-dependent electronic interactions and magnetic order on pressure-driven insulator-metal transitions in MnO and NiO

Authors: Bei-Lei Liu, Yue-Chao Wang, Yuan-Ji Xu, Xingyu Gao, Hai-Feng Liu, Hai-Feng Song

Abstract: The pressure-driven insulator-metal transition is a crucial topic in condensed matter physics. However, even for the prototypical strongly correlated system, NiO, the critical pressure for transition remains debated. In this work, we evaluated the electronic interactions over a wide range of pressures based on our developed doubly-screened Coulomb correction method and investigated the effects of… ▽ More The pressure-driven insulator-metal transition is a crucial topic in condensed matter physics. However, even for the prototypical strongly correlated system, NiO, the critical pressure for transition remains debated. In this work, we evaluated the electronic interactions over a wide range of pressures based on our developed doubly-screened Coulomb correction method and investigated the effects of pressure-dependent electronic interactions and their interplay with magnetic order on the transition. As a validation of the method, we also performed calculations on MnO. The results show that the hybrid functional combined with pressure-dependent screening parameters reasonably describes the insulator-metal transition in MnO. The insulating band gap of antiferromagnetic (AFM) NiO also match well with experiments in both trend and value, which is better than the method using fixed parameters. Further calculations considering magnetic order indicate that as the electronic interactions weaken under pressure, the AFM state of NiO will no longer be stable, a phenomenon that was not observed in previous works. In addition, the results show that, compared with DFT+$U$ within the on-site Coulomb correction framework, the hybrid functional provides a more accurate description of the properties of MnO and NiO at high pressures, highlighting the key role of non-local effects. Our work provides a possible explanation for the long-standing discrepancies in NiO and offers guidance for the development of first-principles methods for correlated electron systems under pressure. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: 10 pages, 9 figures

arXiv:2505.23227 [pdf, ps, other]

Polymer-modulated evaporation flow enables scalable self-assembly of highly aligned nanowires

Authors: Liyiming Tao, Zechao Jiang, Shiyuan Hu, Lin Du, Qiuting Zhang, Jiajia Zhou, Masao Doi, Xiaojun Wu, Xingkun Man, Ye Xu

Abstract: Highly aligned nanowire networks are essential for enabling anisotropic optical, electrical, and sensing functionalities in next-generation devices. However, achieving such alignment typically requires complex fabrication methods or high-energy processing. Here, we present a simple and scalable self-assembly strategy that uses a viscosity-enhancing polymer additive to modulate fluid flows during s… ▽ More Highly aligned nanowire networks are essential for enabling anisotropic optical, electrical, and sensing functionalities in next-generation devices. However, achieving such alignment typically requires complex fabrication methods or high-energy processing. Here, we present a simple and scalable self-assembly strategy that uses a viscosity-enhancing polymer additive to modulate fluid flows during solvent evaporation. The addition of carboxymethylcellulose sodium (CMC-Na) reshapes the evaporation-driven flow field and generates a compressional flow region near the drying edge. Within this region, rotation-inducing velocity gradients progressively align silver nanowires (AgNWs) into highly ordered arrays. This unique mechanism yields uniform AgNW coatings with a high degree of nanowire alignment and tunable areal density across centimeter-scale areas. The resulting films exhibit strong broadband anisotropy, including polarization-dependent transmission in both visible and terahertz (THz) regimes and angle-dependent electrical conductivity. The approach also integrates naturally with dip-coating-based shear alignment, enabling programmable control over alignment direction and spatial patterning. This work establishes a robust, polymer-enabled mechanism for bottom-up nanowire alignment and offers a passive, energy-efficient route for fabricating anisotropic nanostructured coatings. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.23038 [pdf, ps, other]

EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models

Authors: Yuzhen Xiao, Jiahe Song, Yongxin Xu, Ruizhe Zhang, Yiqi Xiao, Xin Lu, Runchuan Zhu, Bowen Jiang, Junfeng Zhao

Abstract: In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment an… ▽ More In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.22999 [pdf, ps, other]

Online Selection with Uncertain Disruption

Authors: Yihua Xu, Süleyman Kerimov, Sebastian Perez-Salazar

Abstract: In numerous online selection problems, decision-makers (DMs) must allocate on the fly limited resources to customers with uncertain values. The DM faces the tension between allocating resources to currently observed values and saving them for potentially better, unobserved values in the future. Addressing this tension becomes more demanding if an uncertain disruption occurs while serving customers… ▽ More In numerous online selection problems, decision-makers (DMs) must allocate on the fly limited resources to customers with uncertain values. The DM faces the tension between allocating resources to currently observed values and saving them for potentially better, unobserved values in the future. Addressing this tension becomes more demanding if an uncertain disruption occurs while serving customers. Without any disruption, the DM gets access to the capacity information to serve customers throughout the time horizon. However, with uncertain disruption, the DM must act more cautiously due to risk of running out of capacity abruptly or misusing the resources. Motivated by this tension, we introduce the Online Selection with Uncertain Disruption (OS-UD) problem. In OS-UD, a DM sequentially observes n non-negative values drawn from a common distribution and must commit to select or reject each value in real time, without revisiting past values. The disruption is modeled as a Bernoulli random variable with probability p each time DM selects a value. We aim to design an online algorithm that maximizes the expected sum of selected values before a disruption occurs, if any. We evaluate online algorithms using the competitive ratio. Using a quantile-based approach, we devise a non-adaptive single-threshold algorithm that attains a competitive ratio of at least 1-1/e, and an adaptive threshold algorithm characterized by a sequence of non-increasing thresholds that attains an asymptotic competitive ratio of at least 0.745. Both of these results are worst-case optimal within their corresponding class of algorithms. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.22633 [pdf, other]

Spatial Knowledge Graph-Guided Multimodal Synthesis

Authors: Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Huajun Chen, Ningyu Zhang

Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel mu… ▽ More Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Ongoing work

arXiv:2505.22290 [pdf, ps, other]

Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling

Authors: Fanzeng Xia, Yidong Luo, Tinko Sebastian Bartels, Yaqi Xu, Tongxin Li

Abstract: Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning befo… ▽ More Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.22167 [pdf, other]

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno

Abstract: Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not gener… ▽ More Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted to ICML2025

arXiv:2505.22140 [pdf, other]

Search for a dark baryon in the $Ξ^-\rightarrowπ^-+{\rm invisible}$ decay

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (697 additional authors not shown)

Abstract: A search for a dark baryon is performed for the first time in the two-body decay $Ξ^-\rightarrowπ^-+{\rm invisible}$ using $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected at a center-of-mass energy of $\sqrt{s}=3.097\,\mbox{GeV}$ with the BESIII detector at the BEPCII collider. No significant signal is observed, and the 90% (95%) confidence level upper limits on the branching fraction… ▽ More A search for a dark baryon is performed for the first time in the two-body decay $Ξ^-\rightarrowπ^-+{\rm invisible}$ using $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected at a center-of-mass energy of $\sqrt{s}=3.097\,\mbox{GeV}$ with the BESIII detector at the BEPCII collider. No significant signal is observed, and the 90% (95%) confidence level upper limits on the branching fraction $B(Ξ^-\rightarrowπ^-+{\rm invisible})$ are determined to be $4.2\times10^{-5}$ ($5.2\times10^{-5}$), $6.9\times10^{-5}$ ($8.4\times10^{-5}$), $6.5\times10^{-4}$ ($7.6\times10^{-4}$), $1.1\times10^{-4}$ ($1.3\times10^{-4}$) and $4.5\times10^{-5}$ ($5.5\times10^{-5}$), under the dark baryon mass hypotheses of 1.07$\,\mbox{GeV}/c^2$, 1.10$\,\mbox{GeV}/c^2$, $m_Λ$ (1.116$\,\mbox{GeV}/c^2$), 1.13$\,\mbox{GeV}/c^2$, and 1.16$\,\mbox{GeV}/c^2$, respectively. The constraints obtained on the Wilson coefficients $C_{u s, s}^L$ and $C_{u s, s}^R$ are more stringent than the previous limits derived from the LHC searches for the colored mediators. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 11 pages, 4 figures, 1 table

arXiv:2505.21906 [pdf, ps, other]

ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Authors: Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

Abstract: Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies:… ▽ More Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, be capable of solving math problems, and possess visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized two-stage training pipeline designed to preserve the VLM's original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities. △ Less

Submitted 29 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: Project page: https://chatvla-2.github.io/

arXiv:2505.21822 [pdf, ps, other]

Compressive Fourier-Domain Intensity Coupling (C-FOCUS) enables near-millimeter deep imaging in the intact mouse brain in vivo

Authors: Renzhi He, Yucheng Li, Brianna Urbina, Jiandi Wan, Yi Xue

Abstract: Two-photon microscopy is a powerful tool for in vivo imaging, but its imaging depth is typically limited to a few hundred microns due to tissue scattering, even with existing scattering correction techniques. Moreover, most active scattering correction methods are restricted to small regions by the optical memory effect. Here, we introduce compressive Fourier-domain intensity coupling for scatteri… ▽ More Two-photon microscopy is a powerful tool for in vivo imaging, but its imaging depth is typically limited to a few hundred microns due to tissue scattering, even with existing scattering correction techniques. Moreover, most active scattering correction methods are restricted to small regions by the optical memory effect. Here, we introduce compressive Fourier-domain intensity coupling for scattering correction (C-FOCUS), an active scattering correction approach that integrates Fourier-domain intensity modulation with compressive sensing for two-photon microscopy. Using C-FOCUS, we demonstrate high-resolution imaging of YFP-labeled neurons and FITC-labeled blood vessels at depths exceeding 900 um in the intact mouse brain in vivo. Furthermore, we achieve transcranial imaging of YFP-labeled dendritic structures through the intact adult mouse skull. C-FOCUS enables high-contrast fluorescence imaging at depths previously inaccessible using two-photon microscopy with 1035 nm excitation, enhancing fluorescence intensity by over 20-fold compared to uncorrected imaging. C-FOCUS provides a broadly applicable strategy for rapid, deep-tissue optical imaging in vivo. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21568 [pdf, ps, other]

VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents

Authors: Haiyun Li, Zhiyong Wu, Xiaofeng Xie, Jingran Xie, Yaoxun Xu, Hanyang Peng

Abstract: Voice cloning (VC)-resistant watermarking is an emerging technique for tracing and preventing unauthorized cloning. Existing methods effectively trace traditional VC models by training them on watermarked audio but fail in zero-shot VC scenarios, where models synthesize audio from an audio prompt without training. To address this, we propose VoiceMark, the first zero-shot VC-resistant watermarking… ▽ More Voice cloning (VC)-resistant watermarking is an emerging technique for tracing and preventing unauthorized cloning. Existing methods effectively trace traditional VC models by training them on watermarked audio but fail in zero-shot VC scenarios, where models synthesize audio from an audio prompt without training. To address this, we propose VoiceMark, the first zero-shot VC-resistant watermarking method that leverages speaker-specific latents as the watermark carrier, allowing the watermark to transfer through the zero-shot VC process into the synthesized audio. Additionally, we introduce VC-simulated augmentations and VAD-based loss to enhance robustness against distortions. Experiments on multiple zero-shot VC models demonstrate that VoiceMark achieves over 95% accuracy in watermark detection after zero-shot VC synthesis, significantly outperforming existing methods, which only reach around 50%. See our code and demos at: https://huggingface.co/spaces/haiyunli/VoiceMark △ Less

Submitted 30 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.21527 [pdf, ps, other]

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Authors: Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen

Abstract: Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel A… ▽ More Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR. △ Less

Submitted 29 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.21070 [pdf, ps, other]

Minute-Long Videos with Dual Parallelisms

Authors: Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang

Abstract: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.… ▽ More Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs. △ Less

Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: The code is available at https://github.com/DualParal-Project/DualParal

arXiv:2505.21050 [pdf, other]

Advancing high-fidelity 3D and Texture Generation with 2.5D latents

Authors: Xin Yang, Jiantao Lin, Yingjie Xu, Haodong Li, Yingcong Chen

Abstract: Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfacto… ▽ More Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation. △ Less

Submitted 28 May, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.21049 [pdf, ps, other]

Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing

Authors: Dehao Wang, Haohang Zhu, Yiwen Xu, Kaiqi Liu

Abstract: Road potholes pose a serious threat to driving safety and comfort, making their detection and assessment a critical task in fields such as autonomous driving. When driving vehicles, the operators usually avoid large potholes and approach smaller ones at reduced speeds to ensure safety. Therefore, accurately estimating pothole area is of vital importance. Most existing vision-based methods rely on… ▽ More Road potholes pose a serious threat to driving safety and comfort, making their detection and assessment a critical task in fields such as autonomous driving. When driving vehicles, the operators usually avoid large potholes and approach smaller ones at reduced speeds to ensure safety. Therefore, accurately estimating pothole area is of vital importance. Most existing vision-based methods rely on distance priors to construct geometric models. However, their performance is susceptible to variations in camera angles and typically relies on the assumption of a flat road surface, potentially leading to significant errors in complex real-world environments. To address these problems, a robust pothole area estimation framework that integrates object detection and monocular depth estimation in a video stream is proposed in this paper. First, to enhance pothole feature extraction and improve the detection of small potholes, ACSH-YOLOv8 is proposed with ACmix module and the small object detection head. Then, the BoT-SORT algorithm is utilized for pothole tracking, while DepthAnything V2 generates depth maps for each frame. With the obtained depth maps and potholes labels, a novel Minimum Bounding Triangulated Pixel (MBTP) method is proposed for pothole area estimation. Finally, Kalman Filter based on Confidence and Distance (CDKF) is developed to maintain consistency of estimation results across consecutive frames. The results show that ACSH-YOLOv8 model achieves an AP(50) of 76.6%, representing a 7.6% improvement over YOLOv8. Through CDKF optimization across consecutive frames, pothole predictions become more robust, thereby enhancing the method's practical applicability. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.20747 [pdf, ps, other]

On Kernel Design for Regularized Volterra Series Identification of Wiener-Hammerstein Systems

Authors: Yu Xu, Biqiang Mu, Tianshi Chen

Abstract: There have been increasing interests on the Volterra series identification with the kernel-based regularization method. The major difficulties are on the kernel design and efficiency of the corresponding implementation. In this paper, we first assume that the underlying system to be identified is the Wiener-Hammerstein (WH) system with polynomial nonlinearity. We then show how to design kernels wi… ▽ More There have been increasing interests on the Volterra series identification with the kernel-based regularization method. The major difficulties are on the kernel design and efficiency of the corresponding implementation. In this paper, we first assume that the underlying system to be identified is the Wiener-Hammerstein (WH) system with polynomial nonlinearity. We then show how to design kernels with nonzero off-diagonal blocks for Volterra maps by taking into account the prior knowledge of the linear blocks and the structure of WH systems. Moreover, exploring the structure of the designed kernels leads to the same computational complexity as the state-of-the-art result, i.e., $O(N^3)$, where $N$ is the sample size, but with a significant difference that the proposed kernels are designed in a direct and flexible way. In addition, for a special case of the kernel and a class of widely used input signals, further exploring the separable structure of the output kernel matrix can lower the computational complexity from $O(N^3)$ to $O(Nγ^2)$, where $γ$ is the separability rank of the output kernel matrix and can be much smaller than $N$. We finally run Monte Carlo simulations to demonstrate the proposed kernels and the obtained theoretical results. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 17 pages, 7 figures

Showing 151–200 of 7,791 results for author: Xu, Y