Search | arXiv e-print repository

arXiv:2505.10963 [pdf, ps, other]

Beyond real: Alternative unitary cluster Jastrow models for molecular electronic structure calculations on near-term quantum computers

Authors: Nikolay V. Tkachenko, Hang Ren, Wendy M. Billings, Rebecca Tomann, K. Birgitta Whaley, Martin Head-Gordon

Abstract: Near-term quantum devices require wavefunction ansätze that are expressive while also of shallow circuit depth in order to both accurately and efficiently simulate molecular electronic structure. While unitary coupled cluster (e.g., UCCSD) has become a standard, the high gate count associated with the implementation of this limits its feasibility on noisy intermediate-scale quantum (NISQ) hardware… ▽ More Near-term quantum devices require wavefunction ansätze that are expressive while also of shallow circuit depth in order to both accurately and efficiently simulate molecular electronic structure. While unitary coupled cluster (e.g., UCCSD) has become a standard, the high gate count associated with the implementation of this limits its feasibility on noisy intermediate-scale quantum (NISQ) hardware. K-fold unitary cluster Jastrow (uCJ) ansätze mitigate this challenge by providing $O(kN^2)$ circuit scaling and favorable linear depth circuit implementation. Previous work has focused on the real orbital-rotation (Re-uCJ) variant of uCJ, which allows an exact (Trotter-free) implementation. Here we extend and generalize the $k$-fold uCJ framework by introducing two new variants, Im-uCJ and g-uCJ, which incorporate imaginary and fully complex orbital rotation operators, respectively. Similar to Re-uCJ, both of the new variants achieve quadratic gate-count scaling. Our results focus on the simplest $k=1$ model, and show that the uCJ models frequently maintain energy errors within chemical accuracy. Both g-uCJ and Im-uCJ are more expressive in terms of capturing electron correlation and are also more accurate than the earlier Re-uCJ ansatz. We further show that Im-uCJ and g-uCJ circuits can also be implemented exactly, without any Trotter decomposition. Numerical tests using $k=1$ on $H_2$, $H_3^+$, $Be_2$, $C_2H_4$, $C_2H_6$ and $C_6H_6$ in various basis sets confirm the practical feasibility of these shallow Jastrow-based ansätze for applications on near-term quantum hardware. △ Less

Submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.10442 [pdf, ps, other]

IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

Authors: Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, Junshan Zhang

Abstract: Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability an… ▽ More Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability and poor sample efficiency during the RL fine-tuning phase. In this work, we introduce IN-RIL, INterleaved Reinforcement learning and Imitation Learning, for policy fine-tuning, which periodically injects IL updates after multiple RL updates and hence can benefit from the stability of IL and the guidance of expert data for more efficient exploration throughout the entire fine-tuning process. Since IL and RL involve different optimization objectives, we develop gradient separation mechanisms to prevent destructive interference during \ABBR fine-tuning, by separating possibly conflicting gradient updates in orthogonal subspaces. Furthermore, we conduct rigorous analysis, and our findings shed light on why interleaving IL with RL stabilizes learning and improves sample-efficiency. Extensive experiments on 14 robot manipulation and locomotion tasks across 3 benchmarks, including FurnitureBench, OpenAI Gym, and Robomimic, demonstrate that \ABBR can significantly improve sample efficiency and mitigate performance collapse during online finetuning in both long- and short-horizon tasks with either sparse or dense rewards. IN-RIL, as a general plug-in compatible with various state-of-the-art RL algorithms, can significantly improve RL fine-tuning, e.g., from 12\% to 88\% with 6.3x improvement in the success rate on Robomimic Transport. Project page: https://github.com/ucd-dare/IN-RIL. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.10207 [pdf, other]

How to Color Temporal Graphs to Ensure Proper Transitions

Authors: Allen Ibiapina, Minh Hang Nguyen, Mikaël Rabie, Cléophée Robin

Abstract: Graph Coloring consists in assigning colors to vertices ensuring that two adjacent vertices do not have the same color. In dynamic graphs, this notion is not well defined, as we need to decide if different colors for adjacent vertices must happen all the time or not, and how to go from a coloring in one time to the next one. In this paper, we define a coloring notion for Temporal Graphs where at… ▽ More Graph Coloring consists in assigning colors to vertices ensuring that two adjacent vertices do not have the same color. In dynamic graphs, this notion is not well defined, as we need to decide if different colors for adjacent vertices must happen all the time or not, and how to go from a coloring in one time to the next one. In this paper, we define a coloring notion for Temporal Graphs where at each step, the coloring must be proper. It uses a notion of compatibility between two consecutive snapshots that implies that the coloring stays proper while the transition happens. Given a graph, the minimum number of colors needed to ensure that such coloring exists is the \emph{Temporal Chromatic Number} of this graph. With those notions, we provide some lower and upper bounds for the temporal chromatic number in the general case. We then dive into some specific classes of graphs such as trees, graphs with bounded degree or bounded degeneracy. Finally, we consider temporal graphs where grow pace is one, that is, a single edge can be added and a single other one can be removed between two time steps. In that case, we consider bipartite and bounded degree graphs. Even though the problem is defined with full knowledge of the temporal graph, our results also work in the case where future snapshots are given online: we need to choose the coloring of the next snapshot after having computed the current one, not knowing what △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 20 pages, 9 figures

arXiv:2505.10166 [pdf, ps, other]

Cavity-Mediated Electron-Electron Interactions: Renormalizing Dirac States in Graphene

Authors: Hang Liu, Francesco Troisi, Hannes Hübener, Simone Latini, Angel Rubio

Abstract: Embedding materials in optical cavities has emerged as a strategy for tuning material properties. Accurate simulations of electrons in materials interacting with quantum photon fluctuations of a cavity are crucial for understanding and predicting cavity-induced phenomena. In this article, we develop a non-perturbative quantum electrodynamical approach based on a photon-free self-consistent Hartree… ▽ More Embedding materials in optical cavities has emerged as a strategy for tuning material properties. Accurate simulations of electrons in materials interacting with quantum photon fluctuations of a cavity are crucial for understanding and predicting cavity-induced phenomena. In this article, we develop a non-perturbative quantum electrodynamical approach based on a photon-free self-consistent Hartree-Fock framework to model the coupling between electrons and cavity photons in crystalline materials. We apply this theoretical approach to investigate graphene coupled to the vacuum field fluctuations of cavity photon modes with different types of polarizations. The cavity photons introduce nonlocal electron-electron interactions, originating from the quantum nature of light, that lead to significant renormalization of the Dirac bands. In contrast to the case of graphene coupled to a classical circularly polarized light field, where a topological Dirac gap emerges, the nonlocal interactions induced by a quantum linearly polarized photon mode give rise to the formation of flat bands and the opening of a topologically trivial Dirac gap. When two symmetric cavity photon modes are introduced, Dirac cones remain gapless, but a Fermi velocity renormalization yet indicates the relevant role of nonlocal interactions. These effects disappear in the classical limit for coherent photon modes. This new self-consistent theoretical framework paves the way for the simulation of non-perturbative quantum effects in strongly coupled light-matter systems, and allows for a more comprehensive discovery of novel cavity-induced quantum phenomena. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 20 pages, 10 figures

arXiv:2505.10039 [pdf, other]

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Authors: Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Abstract: Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from th… ▽ More Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework's ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 10 pages

arXiv:2505.09684 [pdf, ps, other]

Demonstration of low-overhead quantum error correction codes

Authors: Ke Wang, Zhide Lu, Chuanyu Zhang, Gongyu Liu, Jiachen Chen, Yanzhe Wang, Yaozu Wu, Shibo Xu, Xuhao Zhu, Feitong Jin, Yu Gao, Ziqi Tan, Zhengyi Cui, Ning Wang, Yiren Zou, Aosai Zhang, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Yihang Han, Yiyang He, Jiayuan Shen, Han Wang , et al. (17 additional authors not shown)

Abstract: Quantum computers hold the potential to surpass classical computers in solving complex computational problems. However, the fragility of quantum information and the error-prone nature of quantum operations make building large-scale, fault-tolerant quantum computers a prominent challenge. To combat errors, pioneering experiments have demonstrated a variety of quantum error correction codes. Yet, mo… ▽ More Quantum computers hold the potential to surpass classical computers in solving complex computational problems. However, the fragility of quantum information and the error-prone nature of quantum operations make building large-scale, fault-tolerant quantum computers a prominent challenge. To combat errors, pioneering experiments have demonstrated a variety of quantum error correction codes. Yet, most of these codes suffer from low encoding efficiency, and their scalability is hindered by prohibitively high resource overheads. Here, we report the demonstration of two low-overhead quantum low-density parity-check (qLDPC) codes, a distance-4 bivariate bicycle code and a distance-3 qLDPC code, on our latest superconducting processor, Kunlun, featuring 32 long-range-coupled transmon qubits. Utilizing a two-dimensional architecture with overlapping long-range couplers, we demonstrate simultaneous measurements of all nonlocal weight-6 stabilizers via the periodic execution of an efficient syndrome extraction circuit. We achieve a logical error rate per logical qubit per cycle of $(8.91 \pm 0.17)\%$ for the distance-4 bivariate bicycle code with four logical qubits and $(7.77 \pm 0.12)\%$ for the distance-3 qLDPC code with six logical qubits. Our results establish the feasibility of implementing various qLDPC codes with long-range coupled superconducting processors, marking a crucial step towards large-scale low-overhead quantum error correction. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.09665 [pdf, other]

Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

Authors: Sulong Zhou, Qunying Huang, Shaoheng Zhou, Yun Hang, Xinyue Ye, Aodong Mei, Kathryn Phung, Yuning Ye, Uma Govindswamy, Zehan Li

Abstract: Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyz… ▽ More Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events. △ Less

Submitted 15 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

Comments: Corrected capitalization errors in the section subtitle 3.4, 4.3, step 1 in section 3.3.2, and Supplementary Information. Fix typo with "Weighting" for step 4 in section 3.3.2

arXiv:2505.09201 [pdf]

Photoswitchable exceptional points derived from bound states in the continuum

Authors: Lei Wang, Hang Liu, Junwei Liu, Aoxuan Liu, Jialiang Huang, Qiannan Li, Hui Dai, Caihong Zhang, Jingbo Wu, Kebin Fan, Huabing Wang, Biaobing Jin, Jian Chen, Peiheng Wu

Abstract: Bound states in the continuum (BICs) and exceptional points (EPs), as two distinct physical singularities represented by complex frequencies in non-Hermitian systems, have garnered significant attention and clear definitions in their respective fields in recent years. They share overlapping applications in areas such as high-sensitivity sensing and laser emission. However, the transition between t… ▽ More Bound states in the continuum (BICs) and exceptional points (EPs), as two distinct physical singularities represented by complex frequencies in non-Hermitian systems, have garnered significant attention and clear definitions in their respective fields in recent years. They share overlapping applications in areas such as high-sensitivity sensing and laser emission. However, the transition between the two, inspired by these intersections, remains largely unexplored. In this work, we reveal the transition process in a non-Hermitian two-mode system, evolving from one bound singularity to a two-dimensional exceptional ring, where the EP is the coalescent state of the quasi-Friedrich-Wintgen (FW)-BIC. This phenomenon is experimentally validated through pored dielectric metasurfaces in terahertz band. Furthermore, external pumping induced photocarriers as the dissipative perturbation, facilitates the breaking of degeneracy in the complex eigenfrequency and enables dynamic EP switching. Finally, we experimentally demonstrate a switchable terahertz beam deflection driven by the phase singularities of the EP. These findings are instrumental in advancing the development of compact devices for sensing and wavefront control within non-Hermitian systems. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.08690 [pdf, ps, other]

Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation

Authors: Sheng Liang, Hang Lv, Zhihao Wen, Yaxiong Wu, Yongyue Zhang, Hao Wang, Yong Liu

Abstract: Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in… ▽ More Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in existing pipeline systems, and (2) the absence of benchmarks for evaluating joint schema matching and extraction. Although large language models (LLMs) offer potential solutions, their schema hallucination tendencies and context window limitations pose challenges for practical deployment. In response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel paradigm combining schema paraphrasing with schema retrieval-augmented generation. ASEE adeptly retrieves paraphrased schemas and accurately generates targeted structures. To facilitate rigorous evaluation, we construct the Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which systematically consolidates 12 datasets across diverse domains, complexity levels, and language settings. Extensive evaluations on MD-SEE show that our proposed ASEE demonstrates strong adaptability across various scenarios, significantly improving the accuracy of event extraction. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: 15 pages, 3 figures

ACM Class: I.2.7

arXiv:2505.08293 [pdf, ps, other]

M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis

Authors: Zhizhuo Yin, Yuk Hang Tsui, Pan Hui

Abstract: Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, d… ▽ More Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we proposed a multi-granular token predictor that extracts multi-granular information from audio and predicts the corresponding motion tokens. Then M3G reconstructs the human gestures from the predicted tokens using the MGVQ-VAE. Both objective and subjective experiments demonstrate that our proposed M3G framework outperforms the state-of-the-art methods in terms of generating natural and expressive full-body human gestures. △ Less

Submitted 19 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

Comments: 9 Pages, 4 figures

ACM Class: I.3.6

arXiv:2505.08291 [pdf, ps, other]

Multireference error mitigation for quantum computation of chemistry

Authors: Hang Zou, Erika Magnusson, Hampus Brunander, Werner Dobrautz, Martin Rahm

Abstract: Quantum error mitigation (QEM) strategies are essential for improving the precision and reliability of quantum chemistry algorithms on noisy intermediate-scale quantum devices. Reference-state error mitigation (REM) is a cost-effective chemistry-inspired QEM method that performs exceptionally well for weakly correlated problems. However, the effectiveness of REM is often limited when applied to st… ▽ More Quantum error mitigation (QEM) strategies are essential for improving the precision and reliability of quantum chemistry algorithms on noisy intermediate-scale quantum devices. Reference-state error mitigation (REM) is a cost-effective chemistry-inspired QEM method that performs exceptionally well for weakly correlated problems. However, the effectiveness of REM is often limited when applied to strongly correlated systems. Here, we introduce multireference-state error mitigation (MREM), an extension of REM that systematically captures quantum hardware noise in strongly correlated ground states by utilizing multireference states. A pivotal aspect of MREM is using Givens rotations to efficiently construct quantum circuits to generate multireference states. To strike a balance between circuit expressivity and noise sensitivity, we employ compact wavefunctions composed a few dominant Slater determinants. These truncated multireference states, engineered to exhibit substantial overlap with the target ground state, can effectively enhance error mitigation in variational quantum eigensolver experiments. We demonstrate the effectiveness of MREM through comprehensive simulations of molecular systems $\mathrm{H_2O, ~N_2, ~and ~F_2}$, underscoring its ability to realize significant improvements in computational accuracy compared to the original REM method. MREM broadens the scope of error mitigation to encompass a wider variety of molecular systems, including those exhibiting pronounced electron correlation. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2505.08265 [pdf, ps, other]

LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification

Authors: Hang Gao, Wenxuan Huang, Fengge Wu, Junsuo Zhao, Changwen Zheng, Huaping Liu

Abstract: The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the int… ▽ More The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module. △ Less

Submitted 11 June, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

Comments: Accepted by ICML 2025

arXiv:2505.08155 [pdf, other]

Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering

Authors: Weizhi Fei, Zihao Wang, hang Yin, Shukai Zhao, Wei Zhang, Yangqiu Song

Abstract: Complex Query Answering (CQA) aims to retrieve answer sets for complex logical formulas from incomplete knowledge graphs, which is a crucial yet challenging task in knowledge graph reasoning. While neuro-symbolic search utilized neural link predictions achieve superior accuracy, they encounter significant complexity bottlenecks: (i) Data complexity typically scales quadratically with the number of… ▽ More Complex Query Answering (CQA) aims to retrieve answer sets for complex logical formulas from incomplete knowledge graphs, which is a crucial yet challenging task in knowledge graph reasoning. While neuro-symbolic search utilized neural link predictions achieve superior accuracy, they encounter significant complexity bottlenecks: (i) Data complexity typically scales quadratically with the number of entities in the knowledge graph, and (ii) Query complexity becomes NP-hard for cyclic queries. Consequently, these approaches struggle to effectively scale to larger knowledge graphs and more complex queries. To address these challenges, we propose an efficient and scalable symbolic search framework. First, we propose two constraint strategies to compute neural logical indices to reduce the domain of variables, thereby decreasing the data complexity of symbolic search. Additionally, we introduce an approximate algorithm based on local search to tackle the NP query complexity of cyclic queries. Experiments on various CQA benchmarks demonstrate that our framework reduces the computational load of symbolic methods by 90\% while maintaining nearly the same performance, thus alleviating both efficiency and scalability issues. △ Less

Submitted 20 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07916 [pdf, ps, other]

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Authors: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

Abstract: We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w… ▽ More We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07891 [pdf, ps, other]

doi 10.1109/TAI.2025.3567369

TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking

Authors: Ching Nam Hang, Pei-Duo Yu, Chee Wei Tan

Abstract: In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish "trumors", which are health-related rumors th… ▽ More In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish "trumors", which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age. △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.07680 [pdf, other]

SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models

Authors: Hang Wu, Jianian Zhu, Yinghui Li, Haojie Wang, Biao Hou, Jidong Zhai

Abstract: Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost: larger models offer superior capabilities but incur significant latency, while smaller models are faster but less powerful. Existing serving strategies often employ fixed model scales or static two-stage speculative decoding, failing to dynamically adapt to the varying complexities of user r… ▽ More Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost: larger models offer superior capabilities but incur significant latency, while smaller models are faster but less powerful. Existing serving strategies often employ fixed model scales or static two-stage speculative decoding, failing to dynamically adapt to the varying complexities of user requests or fluctuations in system performance. This paper introduces \systemname{}, a novel framework that reimagines LLM inference as an adaptive routing problem solved through multi-level speculative decoding. \systemname{} dynamically constructs and optimizes inference "paths" (chains of models) based on real-time feedback, addressing the limitations of static approaches. Our contributions are threefold: (1) An \textbf{adaptive model chain scheduling} mechanism that leverages performance profiling (execution times) and predictive similarity metrics (derived from token distribution divergence) to continuously select the optimal sequence of draft and verifier models, minimizing predicted latency per generated token. (2) A \textbf{multi-level collaborative verification} framework where intermediate models within the selected chain can validate speculative tokens, reducing the verification burden on the final, most powerful target model. (3) A \textbf{synchronized state management} system providing efficient, consistent KV cache handling across heterogeneous models in the chain, including precise, low-overhead rollbacks tailored for asynchronous batch processing inherent in multi-level speculation. Preliminary experiments demonstrate the validity of our method. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 10 pages

arXiv:2505.06875 [pdf, ps, other]

Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning

Authors: Chengkai Xu, Jiaqi Liu, Yicheng Guo, Yuhang Zhang, Peng Hang, Jian Sun

Abstract: Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for… ▽ More Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for high-level instruction parsing with a Reinforcement Learning (RL) agent for low-level real-time decision. In this dual system, the LLM operates as the "slow" module, translating user directives into structured guidance, while the RL agent functions as the "fast" module, making time-critical maneuvers under stringent latency constraints. By decoupling high-level decision making from rapid control, our framework enables personalized user-centric operation while maintaining robust safety margins. Experimental evaluations across various driving scenarios demonstrate the effectiveness of our method. Compared to baseline algorithms, the proposed architecture not only reduces collision rates but also aligns driving behaviors more closely with user preferences, thereby achieving a human-centric mode. By integrating user guidance at the decision level and refining it with real-time control, our framework bridges the gap between individual passenger needs and the rigor required for safe, reliable driving in complex traffic environments. △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.06512 [pdf, other]

HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

Authors: Hang Wang, Zhi-Qi Cheng, Chenhao Lin, Chao Shen, Lei Zhang

Abstract: Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment… ▽ More Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image generation. Our code is available at https://github.com/hwang-cs-ime/HCMA. △ Less

Submitted 14 May, 2025; v1 submitted 10 May, 2025; originally announced May 2025.

Comments: 10 pages, 4 figures

arXiv:2505.06455 [pdf, ps, other]

Reconstructing Real-Valued Quantum States

Authors: Zhixin Song, Hang Ren, Melody Lee, Bryan Gard, Nicolas Renaud, Spencer H. Bryngelson

Abstract: Quantum tomography is a crucial tool for characterizing quantum states and devices and estimating nonlinear properties of the systems. Performing full quantum state tomography (FQST) on an $N_\mathrm{q}$ qubit system requires an exponentially increasing overhead with $O(3^{N_\mathrm{q}})$ distinct Pauli measurement settings to resolve all complex phases and reconstruct the density matrix. However,… ▽ More Quantum tomography is a crucial tool for characterizing quantum states and devices and estimating nonlinear properties of the systems. Performing full quantum state tomography (FQST) on an $N_\mathrm{q}$ qubit system requires an exponentially increasing overhead with $O(3^{N_\mathrm{q}})$ distinct Pauli measurement settings to resolve all complex phases and reconstruct the density matrix. However, many appealing applications of quantum computing, such as quantum linear system algorithms, require only real-valued amplitudes. Here we introduce a novel readout method for real-valued quantum states that reduces measurement settings required for state vector reconstruction to $O(N_\mathrm{q})$, while the post-processing cost remains exponential. This approach offers a substantial speedup over conventional tomography. We experimentally validate our method up to 10~qubits on the latest available IBM quantum processor and demonstrate that it accurately extracts key properties such as entanglement and magic. Our method also outperforms the standard SWAP test for state overlap estimation. This calculation resembles a numerical integration in certain cases and can be applied to extract nonlinear properties, which are important in application fields. △ Less

Submitted 9 May, 2025; originally announced May 2025.

arXiv:2505.06321 [pdf, other]

Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Representation Learning

Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu

Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and… ▽ More Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T. △ Less

Submitted 16 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

Comments: Accepted by IJCAI 2025

arXiv:2505.05822 [pdf, ps, other]

Self-reorganization and Information Transfer in Massive Schools of Fish

Authors: Haotian Hang, Chenchen Huang, Alex Barnett, Eva Kanso

Abstract: The remarkable cohesion and coordination observed in moving animal groups and their collective responsiveness to threats are thought to be mediated by scale-free correlations, where changes in the behavior of one animal influence others in the group, regardless of the distance between them. But are these features independent of group size? Here, we investigate group cohesiveness and collective res… ▽ More The remarkable cohesion and coordination observed in moving animal groups and their collective responsiveness to threats are thought to be mediated by scale-free correlations, where changes in the behavior of one animal influence others in the group, regardless of the distance between them. But are these features independent of group size? Here, we investigate group cohesiveness and collective responsiveness in computational models of massive schools of fish of up to 50,000 individuals. We show that as the number of swimmers increases, flow interactions destabilize the school, creating clusters that constantly fragment, disperse, and regroup, similar to their biological counterparts. We calculate the spatial correlation and speed of information propagation in these dynamic clusters. Spatial correlations in cohesive and polarized clusters are indeed scale free, much like in natural animal groups, but fragmentation events are preceded by a decrease in correlation length, thus diminishing the group's collective responsiveness, leaving it more vulnerable to predation events. Importantly, in groups undergoing collective turns, the information about the change in direction propagates linearly in time among group members, thanks to the non-reciprocal nature of the visual interactions between individuals. Merging speeds up the transfer of information within each cluster by several fold, while fragmentation slows it down. Our findings suggest that flow interactions may have played an important role in group size regulation, behavioral adaptations, and dispersion in living animal groups. △ Less

Submitted 3 June, 2025; v1 submitted 9 May, 2025; originally announced May 2025.

arXiv:2505.05579

LaZagna: An Open-Source Framework for Flexible 3D FPGA Architectural Exploration

Authors: Ismael Youssef, Hang Yang, Cong Hao

Abstract: While 3D IC technology has been extensively explored for ASICs, their application to FPGAs remains limited. Existing studies on 3D FPGAs are often constrained to fixed prototypes, narrow architectural templates, and simulation-only evaluations. In this work, we present LaZagna, the first open-source framework for automated, end-to-end 3D FPGA architecture generation and evaluation. LaZagna support… ▽ More While 3D IC technology has been extensively explored for ASICs, their application to FPGAs remains limited. Existing studies on 3D FPGAs are often constrained to fixed prototypes, narrow architectural templates, and simulation-only evaluations. In this work, we present LaZagna, the first open-source framework for automated, end-to-end 3D FPGA architecture generation and evaluation. LaZagna supports high-level architectural specification, synthesizable RTL generation, and bitstream production, enabling comprehensive validation of 3D FPGA designs beyond simulation. It significantly broadens the design space compared to prior work by introducing customizable vertical interconnect patterns, novel 3D switch block designs, and support for heterogeneous logic layers. The framework also incorporates practical design constraints such as inter-layer via density and vertical interconnect delay. We demonstrate the capabilities of LaZagna by generating synthesizable RTL that can be taken through full physical design flows for fabric generation, along with functionally correct bitstreams. Furthermore, we conduct five case studies that explore various architectural parameters and evaluate their impact on wirelength, critical path delay, and routing runtime. These studies showcase the framework's scalability, flexibility, and effectiveness in guiding future 3D FPGA architectural and packaging decisions. LaZagna is fully open-source and available on GitHub. △ Less

Submitted 11 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

Comments: Withdrawn due to an error in experimental setup that affected the results. A corrected version is in progress

arXiv:2505.05061 [pdf, other]

Seismic first-arrival traveltime simulation based on reciprocity-constrained PINN

Authors: Hang Geng, Chao Song, Umair bin Waheed, Cai Liu

Abstract: Simulating seismic first-arrival traveltime plays a crucial role in seismic tomography. First-arrival traveltime simulation relies on solving the eikonal equation. The accuracy of conventional numerical solvers is limited to a finite-difference approximation. In recent years, physics-informed neural networks (PINNs) have been applied to achieve this task. However, traditional PINNs encounter chall… ▽ More Simulating seismic first-arrival traveltime plays a crucial role in seismic tomography. First-arrival traveltime simulation relies on solving the eikonal equation. The accuracy of conventional numerical solvers is limited to a finite-difference approximation. In recent years, physics-informed neural networks (PINNs) have been applied to achieve this task. However, traditional PINNs encounter challenges in accurately solving the eikonal equation, especially in cases where the model exhibits directional scaling differences. These challenges result in substantial traveltime prediction errors when the traveling distance is long. To improve the accuracy of PINN in traveltime prediction, we incorporate the reciprocity principle as a constraint into the PINN training framework. Based on the reciprocity principle, which states that the traveltime between two points remains invariant when their roles as source and receiver are exchanged, we propose to apply this principle to multiple source-receiver pairs in PINN-based traveltime prediction. Furthermore, a dynamic weighting mechanism is proposed to balance the contributions of the eikonal equation loss and the reciprocity-constrained loss during the training process. This adaptive weighting evolves dynamically with the training epochs, enhancing the convergency of the training process. Experiments conducted on a simple lens velocity model, the Overthrust velocity model, and a 3D velocity model demonstrate that the introduction of the reciprocity-constrained PINN significantly improves the accuracy of traveltime predictions. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.04996 [pdf, other]

Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

Authors: Jinhe Huang, Yongkang Cheng, Yuming Hang, Gaoge Han, Jinewei Li, Jing Zhang, Xingjian Gu

Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Gener… ▽ More Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker's speech information but also respond in realtime to the listener's feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: accepted by ICMR 2025

arXiv:2505.04662 [pdf, other]

doi 10.1109/JAS.2025.125438

Crafting Physical Adversarial Examples by Combining Differentiable and Physically Based Renders

Authors: Yuqiu Liu, Huanqian Yan, Xiaopei Zhu, Xiaolin Hu, Liang Tang, Hang Su, Chen Lv

Abstract: Recently we have witnessed progress in hiding road vehicles against object detectors through adversarial camouflage in the digital world. The extension of this technique to the physical world is crucial for testing the robustness of autonomous driving systems. However, existing methods do not show good performances when applied to the physical world. This is partly due to insufficient photorealism… ▽ More Recently we have witnessed progress in hiding road vehicles against object detectors through adversarial camouflage in the digital world. The extension of this technique to the physical world is crucial for testing the robustness of autonomous driving systems. However, existing methods do not show good performances when applied to the physical world. This is partly due to insufficient photorealism in training examples, and lack of proper physical realization methods for camouflage. To generate a robust adversarial camouflage suitable for real vehicles, we propose a novel method called PAV-Camou. We propose to adjust the mapping from the coordinates in the 2D map to those of corresponding 3D model. This process is critical for mitigating texture distortion and ensuring the camouflage's effectiveness when applied in the real world. Then we combine two renderers with different characteristics to obtain adversarial examples that are photorealistic that closely mimic real-world lighting and texture properties. The method ensures that the generated textures remain effective under diverse environmental conditions. Our adversarial camouflage can be optimized and printed in the form of 2D patterns, allowing for direct application on real vehicles. Extensive experiments demonstrated that our proposed method achieved good performance in both the digital world and the physical world. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 13 pages, 15 figures; this paper has been accepted by IEEE/CAA Journal of Automatica Sinica

arXiv:2505.04519 [pdf, other]

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Authors: Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu , et al. (49 additional authors not shown)

Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing r… ▽ More Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2505.04212 [pdf, other]

MAMMOTH-MOSFIRE: Environmental Effects on Galaxy Interstellar Medium at $z\sim2$

Authors: Hang Zhou, Xin Wang, Matthew A. Malkan, Tommaso Treu, Yiming Yang, Zheng Cai, Xiaohui Fan, Mengting Ju, Dong Dong Shi, Anahita Alavi, Fuyan Bian, James Colbert, Alaina L. Henry, Sijia Li, Zihao Li, Harry I. Teplitz, Hu Zhan, Xian Zhong Zheng, Zheng Zheng

Abstract: The MAMMOTH-MOSFIRE program is a deep Keck MOSFIRE K-band spectroscopic follow-up of emission-line galaxies identified in the MAMMOTH-Grism HST WFC3/G141 slitless spectroscopic survey, targeting the core regions of three most massive galaxy protoclusters at cosmic noon. To introduce this program, we present a comprehensive analysis of the emission-line diagnostics for a unique sample of 43 protocl… ▽ More The MAMMOTH-MOSFIRE program is a deep Keck MOSFIRE K-band spectroscopic follow-up of emission-line galaxies identified in the MAMMOTH-Grism HST WFC3/G141 slitless spectroscopic survey, targeting the core regions of three most massive galaxy protoclusters at cosmic noon. To introduce this program, we present a comprehensive analysis of the emission-line diagnostics for a unique sample of 43 protocluster member galaxies at $z\sim2$, investigating the impact of the overdense environment on their interstellar medium conditions. We characterize their ionization and excitation state using the $\rm [N\,II]λ$6584, $\rm [S\,II]λλ$6717,6731, and $\rm [O\,I]λ$6300 BPT diagrams, from a full suite of rest-frame optical emission lines jointly covered by Keck MOSFIRE and HST G141 spectroscopy. Our analysis reveals a median electron density of $n_{\rm e}\approx290~{\rm cm}^{-3}$ from $\rm [S\,II]$ doublets, consistent with measurements from field galaxies at similar redshifts. Like their field counterparts at $z\sim2$, protocluster galaxies exhibit a systematic offset in the N2 BPT diagram compared to the local star-forming sequence, but no offset in the S2 BPT diagram. Notably, we find significantly enhanced $\rm [O\,I]/Hα$ ratios, which can be well explained by photoionization models incorporating both $\rm H\,II$ regions and shock excitation. This work highlights the powerful synergy between high-resolution Keck MOSFIRE K-band spectroscopy and HST G141 slitless spectroscopy, enabling comprehensive coverage of the rest-frame optical spectra of galaxies at $z\sim2$. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 20 pages, 9 figures, 6 table, submitted to ApJ

arXiv:2505.03756 [pdf, other]

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Authors: Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo

Abstract: Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dep… ▽ More Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works. △ Less

Submitted 19 April, 2025; originally announced May 2025.

arXiv:2505.03739 [pdf, other]

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Authors: Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-A… ▽ More With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio

arXiv:2505.03469 [pdf, other]

Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

Authors: Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen

Abstract: Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant… ▽ More Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3% across various benchmarks while substantially reducing model response length by approximately 47.61%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models. △ Less

Submitted 21 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

Comments: 12 pages, 5 figures

arXiv:2505.02928 [pdf, other]

Redshift Assessment Infrastructure Layers (RAIL): Rubin-era photometric redshift stress-testing and at-scale production

Authors: The RAIL Team, Jan Luca van den Busch, Eric Charles, Johann Cohen-Tanugi, Alice Crafford, John Franklin Crenshaw, Sylvie Dagoret, Josue De-Santiago, Juan De Vicente, Qianjun Hang, Benjamin Joachimi, Shahab Joudaki, J. Bryce Kalmbach, Shuang Liang, Olivia Lynn, Alex I. Malz, Rachel Mandelbaum, Grant Merz, Irene Moskowitz, Drew Oldag, Jaime Ruiz-Zapatero, Mubdi Rahman, Samuel J. Schmidt, Jennifer Scora, Raphael Shirley , et al. (6 additional authors not shown)

Abstract: Virtually all extragalactic use cases of the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) require the use of galaxy redshift information, yet the vast majority of its sample of tens of billions of galaxies will lack high-fidelity spectroscopic measurements thereof, instead relying on photometric redshifts (photo-$z$) subject to systematic imprecision and inaccuracy best encap… ▽ More Virtually all extragalactic use cases of the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) require the use of galaxy redshift information, yet the vast majority of its sample of tens of billions of galaxies will lack high-fidelity spectroscopic measurements thereof, instead relying on photometric redshifts (photo-$z$) subject to systematic imprecision and inaccuracy best encapsulated by photo-$z$ probability density functions (PDFs). We present the version 1 release of Redshift Assessment Infrastructure Layers (RAIL), an open source Python library for at-scale probabilistic photo-$z$ estimation, initiated by the LSST Dark Energy Science Collaboration (DESC) with contributions from the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) Frameworks team. RAIL's three subpackages provide modular tools for end-to-end stress-testing, including a forward modeling suite to generate realistically complex photometry, a unified API for estimating per-galaxy and ensemble redshift PDFs by an extensible set of algorithms, and built-in metrics of both photo-$z$ PDFs and point estimates. RAIL serves as a flexible toolkit enabling the derivation and optimization of photo-$z$ data products at scale for a variety of science goals and is not specific to LSST data. We thus describe to the extragalactic science community, including and beyond Rubin the design and functionality of the RAIL software library so that any researcher may have access to its wide array of photo-$z$ characterization and assessment tools. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: Submitted to OJA, 21 pages, 6 figures, 5 tables. Comments welcomed!

arXiv:2505.02825 [pdf, ps, other]

Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

Authors: Alex Hoi Hang Chan, Otto Brookes, Urs Waldmann, Hemal Naik, Iain D. Couzin, Majid Mirmehdi, Noël Adiko Houa, Emmanuelle Normand, Christophe Boesch, Lukas Boesch, Mimi Arandjelovic, Hjalmar Kühl, Tilo Burghardt, Fumihiro Kano

Abstract: Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that mo… ▽ More Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows. △ Less

Submitted 6 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

Comments: Accepted at CVPR Workshop, CV4Animals 2025

arXiv:2505.02498 [pdf, ps, other]

A higher index and rapidly decaying kernels

Authors: Hao Guo, Peter Hochs, Hang Wang

Abstract: We construct an index of first-order, self-adjoint, elliptic differential operators in the $K$-theory of a Fréchet algebra of smooth kernels with faster than exponential off-diagonal decay. We show that this index can be represented by an idempotent involving heat operators. The rapid decay of the kernels in the algebra used is helpful in proving convergence of pairings with cyclic cocycles. Repre… ▽ More We construct an index of first-order, self-adjoint, elliptic differential operators in the $K$-theory of a Fréchet algebra of smooth kernels with faster than exponential off-diagonal decay. We show that this index can be represented by an idempotent involving heat operators. The rapid decay of the kernels in the algebra used is helpful in proving convergence of pairings with cyclic cocycles. Representing the index in terms of heat operators allows one to use heat kernel asymptotics to compute such pairings. We give a link to von Neumann algebras and $L^2$-index theorems as an immediate application, and work out further applications in other papers. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: The preprint with ArXiv number 2407.16275 was split into two parts; this is the first part. arXiv admin note: substantial text overlap with arXiv:2407.16275

arXiv:2505.01983 [pdf, ps, other]

Association and Independence Test for Random Objects

Authors: Hang Zhou, Hans-Georg Müller

Abstract: We develop a unified framework for testing independence and quantifying association between random objects that are located in general metric spaces. Special cases include functional and high-dimensional data as well as networks, covariance matrices and data on Riemannian manifolds, among other metric space-valued data. A key concept is the profile association, a measure based on distance profiles… ▽ More We develop a unified framework for testing independence and quantifying association between random objects that are located in general metric spaces. Special cases include functional and high-dimensional data as well as networks, covariance matrices and data on Riemannian manifolds, among other metric space-valued data. A key concept is the profile association, a measure based on distance profiles that intrinsically characterize the distributions of random objects in metric spaces. We rigorously establish a connection between the Hoeffding D statistic and the profile association and derive a permutation test with theoretical guarantees for consistency and power under alternatives to the null hypothesis of independence/no association. We extend this framework to the conditional setting, where the independence between random objects given a Euclidean predictor is of interest. In simulations across various metric spaces, the proposed profile independence test is found to outperform existing approaches. The practical utility of this framework is demonstrated with applications to brain connectivity networks derived from magnetic resonance imaging and age-at-death distributions for males and females obtained from human mortality data. △ Less

Submitted 10 June, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.01950 [pdf, other]

Segment Any RGB-Thermal Model with Language-aided Distillation

Authors: Dong Xing, Xianxun Zhu, Wei Zhou, Qika Lin, Hang Yang, Yuqing Wang

Abstract: The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, w… ▽ More The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: arXiv admin note: text overlap with arXiv:2412.04220 by other authors

arXiv:2505.01458 [pdf, other]

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

Authors: Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

Abstract: Navigation and manipulation are core capabilities in Embodied AI, yet training agents with these capabilities in the real world faces high costs and time complexity. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing their properties overlooked in previous surveys. We also analyz… ▽ More Navigation and manipulation are core capabilities in Embodied AI, yet training agents with these capabilities in the real world faces high costs and time complexity. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing their properties overlooked in previous surveys. We also analyze their features for navigation and manipulation tasks, along with hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and cutting-edge methods-such as world models and geometric equivariance-to help researchers select suitable tools while accounting for hardware constraints. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2505.01383 [pdf, other]

FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research

Authors: Yan Miao, Will Shen, Hang Cui, Sayan Mitra

Abstract: We present FalconWing -- an open-source, ultra-lightweight (150 g) fixed-wing platform for autonomy research. The hardware platform integrates a small camera, a standard airframe, offboard computation, and radio communication for manual overrides. We demonstrate FalconWing's capabilities by developing and deploying a purely vision-based control policy for autonomous landing (without IMU or motion… ▽ More We present FalconWing -- an open-source, ultra-lightweight (150 g) fixed-wing platform for autonomy research. The hardware platform integrates a small camera, a standard airframe, offboard computation, and radio communication for manual overrides. We demonstrate FalconWing's capabilities by developing and deploying a purely vision-based control policy for autonomous landing (without IMU or motion capture) using a novel real-to-sim-to-real learning approach. Our learning approach: (1) constructs a photorealistic simulation environment via 3D Gaussian splatting trained on real-world images; (2) identifies nonlinear dynamics from vision-estimated real-flight data; and (3) trains a multi-modal Vision Transformer (ViT) policy through simulation-only imitation learning. The ViT architecture fuses single RGB image with the history of control actions via self-attention, preserving temporal context while maintaining real-time 20 Hz inference. When deployed zero-shot on the hardware platform, this policy achieves an 80% success rate in vision-based autonomous landings. Together with the hardware specifications, we also open-source the system dynamics, the software for photorealistic simulator and the learning approach. △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.00565 [pdf]

Differentiating anomalous and topological Hall effects using first-order reversal curve measurements

Authors: Gregory M. Stephen, Ryan T. Van Haren, Vinay Sharma, Lixuan Tai, Bingqian Dai, Hang Chi, Kang L. Wang, Aubrey T. Hanbicki, Adam L. Friedman

Abstract: Next generation magnetic memories rely on novel magnetic phases for information storage. Novel spin textures such as skyrmions provide one possible avenue forward due to their topological protection and controllability via electric fields. However, the common signature of these spin textures, the topological Hall effect (THE), can be mimicked by other trivial effects. Competing anomalous Hall effe… ▽ More Next generation magnetic memories rely on novel magnetic phases for information storage. Novel spin textures such as skyrmions provide one possible avenue forward due to their topological protection and controllability via electric fields. However, the common signature of these spin textures, the topological Hall effect (THE), can be mimicked by other trivial effects. Competing anomalous Hall effect (AHE) components can produce a peak in the Hall voltage similar to that of the THE, making clear identification of the THE difficult. By applying the first-order reversal curve (FORC) technique to the Hall effect in candidate topological Hall systems we can clearly distinguish between the THE and AHE. This technique allows for quantitative investigation of the THE and AHE in magnetic materials and heterostructures with topologically non-trivial spin textures. We demonstrate the technique and apply it to several examples. △ Less

Submitted 1 May, 2025; originally announced May 2025.

Comments: 10 pages, 4 figures

arXiv:2504.21741 [pdf, ps, other]

Asymptotic diameter of preferential attachment model

Authors: Hang Du, Shuyang Gong, Zhangsong Li, Haodong Zhu

Abstract: We study the asymptotic diameter of the preferential attachment model $\operatorname{PA}\!_n^{(m,δ)}$ with parameters $m \ge 2$ and $δ> 0$. Building on the recent work \cite{VZ25}, we prove that the diameter of $G_n \sim \operatorname{PA}\!_n^{(m,δ)}$ is $(1+o(1))\log_νn$ with high probability, where $ν$ is the exponential growth rate of the local weak limit of $G_n$. Our result confirms the conje… ▽ More We study the asymptotic diameter of the preferential attachment model $\operatorname{PA}\!_n^{(m,δ)}$ with parameters $m \ge 2$ and $δ> 0$. Building on the recent work \cite{VZ25}, we prove that the diameter of $G_n \sim \operatorname{PA}\!_n^{(m,δ)}$ is $(1+o(1))\log_νn$ with high probability, where $ν$ is the exponential growth rate of the local weak limit of $G_n$. Our result confirms the conjecture in \cite{VZ25} and closes the remaining gap in understanding the asymptotic diameter of preferential attachment graphs with general parameters $m \ge 1$ and $δ>-m$. Our proof follows a general recipe that relates the diameter of a random graph to its typical distance, which we expect to have applicability in a broader range of models. △ Less

Submitted 30 April, 2025; originally announced April 2025.

Comments: 11 pages

MSC Class: 05C80; 05C82

arXiv:2504.21622 [pdf, other]

Path Planning on Multi-level Point Cloud with a Weighted Traversability Graph

Authors: Yujie Tang, Quan Li, Hao Geng, Yangmin Xie, Hang Shi, Yusheng Yang

Abstract: This article proposes a new path planning method for addressing multi-level terrain situations. The proposed method includes innovations in three aspects: 1) the pre-processing of point cloud maps with a multi-level skip-list structure and data-slimming algorithm for well-organized and simplified map formalization and management, 2) the direct acquisition of local traversability indexes through ve… ▽ More This article proposes a new path planning method for addressing multi-level terrain situations. The proposed method includes innovations in three aspects: 1) the pre-processing of point cloud maps with a multi-level skip-list structure and data-slimming algorithm for well-organized and simplified map formalization and management, 2) the direct acquisition of local traversability indexes through vehicle and point cloud interaction analysis, which saves work in surface fitting, and 3) the assignment of traversability indexes on a multi-level connectivity graph to generate a weighted traversability graph for generally search-based path planning. The A* algorithm is modified to utilize the traversability graph to generate a short and safe path. The effectiveness and reliability of the proposed method are verified through indoor and outdoor experiments conducted in various environments, including multi-floor buildings, woodland, and rugged mountainous regions. The results demonstrate that the proposed method can properly address 3D path planning problems for ground vehicles in a wide range of situations. △ Less

Submitted 30 April, 2025; originally announced April 2025.

arXiv:2504.21055 [pdf, ps, other]

Modeling and Performance Analysis for Semantic Communications Based on Empirical Results

Authors: Shuai Ma, Bin Shen, Chuanhui Zhang, Youlong Wu, Hang Li, Shiyin Li, Guangming Shi, Naofal Al-Dhahir

Abstract: Due to the black-box characteristics of deep learning based semantic encoders and decoders, finding a tractable method for the performance analysis of semantic communications is a challenging problem. In this paper, we propose an Alpha-Beta-Gamma (ABG) formula to model the relationship between the end-to-end measurement and SNR, which can be applied for both image reconstruction tasks and inferenc… ▽ More Due to the black-box characteristics of deep learning based semantic encoders and decoders, finding a tractable method for the performance analysis of semantic communications is a challenging problem. In this paper, we propose an Alpha-Beta-Gamma (ABG) formula to model the relationship between the end-to-end measurement and SNR, which can be applied for both image reconstruction tasks and inference tasks. Specifically, for image reconstruction tasks, the proposed ABG formula can well fit the commonly used DL networks, such as SCUNet, and Vision Transformer, for semantic encoding with the multi scale-structural similarity index measure (MS-SSIM) measurement. Furthermore, we find that the upper bound of the MS-SSIM depends on the number of quantized output bits of semantic encoders, and we also propose a closed-form expression to fit the relationship between the MS-SSIM and quantized output bits. To the best of our knowledge, this is the first theoretical expression between end-to-end performance metrics and SNR for semantic communications. Based on the proposed ABG formula, we investigate an adaptive power control scheme for semantic communications over random fading channels, which can effectively guarantee quality of service (QoS) for semantic communications, and then design the optimal power allocation scheme to maximize the energy efficiency of the semantic communication system. Furthermore, by exploiting the bisection algorithm, we develop the power allocation scheme to maximize the minimum QoS of multiple users for OFDMA downlink semantic communication Extensive simulations verify the effectiveness and superiority of the proposed ABG formula and power allocation schemes. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.21017 [pdf, ps, other]

ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese

Authors: Hai-Chung Nguyen-Phung, Ngoc C. Lê, Van-Chien Nguyen, Hang Thi Nguyen, Thuy Phuong Thi Nguyen

Abstract: After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in May 2022, there are more than 522 million cases and six million deaths worldwide (including nearly ten million cases and over forty-three thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures o… ▽ More After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in May 2022, there are more than 522 million cases and six million deaths worldwide (including nearly ten million cases and over forty-three thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual. △ Less

Submitted 14 June, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

Comments: 8 pages. Technical report

arXiv:2504.20681 [pdf, other]

Data Encryption Battlefield: A Deep Dive into the Dynamic Confrontations in Ransomware Attacks

Authors: Arash Mahboubi, Hamed Aboutorab, Seyit Camtepe, Hang Thanh Bui, Khanh Luong, Keyvan Ansari, Shenlu Wang, Bazara Barry

Abstract: In the rapidly evolving landscape of cybersecurity threats, ransomware represents a significant challenge. Attackers increasingly employ sophisticated encryption methods, such as entropy reduction through Base64 encoding, and partial or intermittent encryption to evade traditional detection methods. This study explores the dynamic battle between adversaries who continuously refine encryption strat… ▽ More In the rapidly evolving landscape of cybersecurity threats, ransomware represents a significant challenge. Attackers increasingly employ sophisticated encryption methods, such as entropy reduction through Base64 encoding, and partial or intermittent encryption to evade traditional detection methods. This study explores the dynamic battle between adversaries who continuously refine encryption strategies and defenders developing advanced countermeasures to protect vulnerable data. We investigate the application of online incremental machine learning algorithms designed to predict file encryption activities despite adversaries evolving obfuscation techniques. Our analysis utilizes an extensive dataset of 32.6 GB, comprising 11,928 files across multiple formats, including Microsoft Word documents (doc), PowerPoint presentations (ppt), Excel spreadsheets (xlsx), image formats (jpg, jpeg, png, tif, gif), PDFs (pdf), audio (mp3), and video (mp4) files. These files were encrypted by 75 distinct ransomware families, facilitating a robust empirical evaluation of machine learning classifiers effectiveness against diverse encryption tactics. Results highlight the Hoeffding Tree algorithms superior incremental learning capability, particularly effective in detecting traditional and AES-Base64 encryption methods employed to lower entropy. Conversely, the Random Forest classifier with warm-start functionality excels at identifying intermittent encryption methods, demonstrating the necessity of tailored machine learning solutions to counter sophisticated ransomware strategies. △ Less

Submitted 29 April, 2025; originally announced April 2025.

MSC Class: 68M25

arXiv:2504.20468 [pdf, other]

Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

Authors: Yuanchen Wu, Lu Zhang, Hang Yao, Junlong Du, Ke Yan, Shouhong Ding, Yunsheng Wu, Xiaoqiang Li

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, and overlooking the task question itself. This paper discusses the… ▽ More Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce "Antidote", a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct "CP-Bench", a novel benchmark to evaluate LVLMs' ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues. △ Less

Submitted 7 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

Comments: Accepted to CVPR 2025

arXiv:2504.19860 [pdf, other]

CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

Authors: Chenhan Jiang, Yihan Zeng, Hang Xu, Dit-Yan Yeung

Abstract: Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertent… ▽ More Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS's inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Comprehensive evaluations demonstrate that our framework, CoherenDream, establishes state-of-the-art performance in text-aligned 3D generation across multiple benchmarks, including T$^3$Bench and TIFA subset. Qualitative results showcase the superior performance of CoherenDream in preserving textual consistency and semantic interactions. As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.19478 [pdf, other]

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

Authors: Weitao Feng, Hang Zhou, Jing Liao, Li Cheng, Wenbo Zhou

Abstract: We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation… ▽ More We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.18842 [pdf]

A Microgravity Simulation Experimental Platform For Small Space Robots In Orbit

Authors: Hang Luo, Nanlin Zhou, Haoxiang Zhang, Kai Han, Ning Zhao, Zhiyuan Yang, Jian Qi, Sikai Zhao, Jie Zhao, Yanhe Zhu

Abstract: This study describes the development and validation of a novel microgravity experimental platform that is mainly applied to small robots such as modular self-reconfigurable robots. This platform mainly consists of an air supply system, a microporous platform and glass. By supplying air to the microporous platform to form an air film, the influence of the weight of the air foot and the ventilation… ▽ More This study describes the development and validation of a novel microgravity experimental platform that is mainly applied to small robots such as modular self-reconfigurable robots. This platform mainly consists of an air supply system, a microporous platform and glass. By supplying air to the microporous platform to form an air film, the influence of the weight of the air foot and the ventilation hose of traditional air-float platforms on microgravity experiments is solved. The contribution of this work is to provide a platform with less external interference for microgravity simulation experiments on small robots. △ Less

Submitted 26 April, 2025; originally announced April 2025.

arXiv:2504.18782 [pdf, other]

CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

Authors: Hang Yu, Jiahao Wen, Zhedong Zheng

Abstract: Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalabi… ▽ More Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID. △ Less

Submitted 25 April, 2025; originally announced April 2025.

arXiv:2504.18391 [pdf, other]

Fast Autoregressive Models for Continuous Latent Generation

Authors: Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen

Abstract: Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the h… ▽ More Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.18005 [pdf, ps, other]

The equivalence between Einstein and Jordan frames: a study based on the inflationary magnetogenesis model

Authors: Hang Wang, Shuang Liu, Yu Li, Yao-chuan Wang

Abstract: The equivalence of the Jordan and Einstein frames has been a subject of considerable interest in the field. In this paper, within the context of $f(R)$ gravity, we explore the inflationary magnetogenesis model, focusing on the magnetic field energy density and its spectrum in both the Jordan and Einstein frames to elucidate the equivalence between these two reference frames. Our analysis reveals t… ▽ More The equivalence of the Jordan and Einstein frames has been a subject of considerable interest in the field. In this paper, within the context of $f(R)$ gravity, we explore the inflationary magnetogenesis model, focusing on the magnetic field energy density and its spectrum in both the Jordan and Einstein frames to elucidate the equivalence between these two reference frames. Our analysis reveals that during the inflationary epoch, while the magnetic field exhibits a scale-invariant spectrum in the Einstein frame, it demonstrates a blue spectrum in the Jordan frame. Additionally, we investigate the post-inflationary evolution of the magnetic field's energy density in both frames, uncovering that for scale-invariant spectra in the Einstein frame during inflation, the magnetic field transitions to a blue spectrum, whereas in the Jordan frame, it evolves into a red spectrum. We also establish the conditions under which both frames may exhibit scale-invariant spectra simultaneously during the inflationary period. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 15 pages, no figure

Showing 101–150 of 3,624 results for author: Hang