-
CiteEval: Principle-Driven Citation Evaluation for Source Attribution
Authors:
Yumo Xu,
Peng Qi,
Jifan Chen,
Kunlun Liu,
Rujun Han,
Lan Liu,
Bonan Min,
Vittorio Castelli,
Arshit Gupta,
Zhiguo Wang
Abstract:
Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a…
▽ More
Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset
Authors:
Jinhong Wang,
Shuo Tong,
Jian liu,
Dongqi Tang,
Jintai Chen,
Haochao Ying,
Hongxia Xu,
Danny Chen,
Jian Wu
Abstract:
Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of…
▽ More
Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments
Authors:
Xiao Yang,
Jiawei Chen,
Jun Luo,
Zhengwei Fang,
Yinpeng Dong,
Hang Su,
Jun Zhu
Abstract:
The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' lim…
▽ More
The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models' limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs' actionable outputs, long-horizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
hqQUBO: A Hybrid-querying Quantum Optimization Model Validated with 16-qubits on an Ion Trap Quantum Computer for Life Science Applications
Authors:
Rong Chen,
Quan-Xin Mei,
Wen-Ding Zhao,
Lin Yao,
Hao-Xiang Yang,
Shun-Yao Zhang,
Jiao Chen,
Hong-Lin Li
Abstract:
AlphaFold has achieved groundbreaking advancements in protein structure prediction, exerting profound influence across biology, medicine, and drug discovery. However, its reliance on multiple sequence alignment (MSA) is inherently time-consuming due to the NP-hard nature of constructing MSAs. Quantum computing emerges as a promising alternative, compared to classical computers, offering the potent…
▽ More
AlphaFold has achieved groundbreaking advancements in protein structure prediction, exerting profound influence across biology, medicine, and drug discovery. However, its reliance on multiple sequence alignment (MSA) is inherently time-consuming due to the NP-hard nature of constructing MSAs. Quantum computing emerges as a promising alternative, compared to classical computers, offering the potentials for exponential speedup and improved accuracy on such complex optimization challenges. This work bridges the gap between quantum computing and MSA task efficiently and successfully, where we compared classical and quantum computational scaling as the number of qubits increases, and assessed the role of quantum entanglement in model performance. Furthermore, we proposed an innovative hybrid query encoding approach hyQUBO to avoid redundancy, and thereby the quantum resources significantly reduced to a scaling of $\mathcal{O}(NL)$. Additionally, coupling of VQE and the quenched CVaR scheme was utilized to enhance the robustness and convergence. The integration of multiple strategies facilitates the robust deployment of the quantum algorithm from idealized simulators (on CPU and GPU) to real-world, noisy quantum devices (HYQ-A37). To the best of our knowledge, our work represented the largest-scale implementation of digital simulation using up to 16 qubits on a trapped-ion quantum computer for life science problem, which achieved state of the art performance in both simulation and experimental results. Our work paves the way towards large-scale simulations of life science tasks on real quantum processors.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme
Authors:
Mikhail Persiianov,
Jiawei Chen,
Petr Mokrov,
Alexander Tyurin,
Evgeny Burnaev,
Alexander Korotin
Abstract:
Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines th…
▽ More
Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce $\texttt{iJKOnet}$, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional $\textit{end-to-end}$ adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Non-conformality of large deviations of moving average of the random walk in strongly mixing environment
Authors:
Jiaming Chen
Abstract:
The quenched and annealed large deviations of the random walk in random environment are shown to conform on any compact set whenever the level of disorder is sufficiently low. In this work, we show that these two large deviations always disagree at some interior point of the natural domain of the random walk in strongly mixing environment, regardless of the level of disorder.
The quenched and annealed large deviations of the random walk in random environment are shown to conform on any compact set whenever the level of disorder is sufficiently low. In this work, we show that these two large deviations always disagree at some interior point of the natural domain of the random walk in strongly mixing environment, regardless of the level of disorder.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
DeepVerse: 4D Autoregressive Video Generation as a World Model
Authors:
Junyi Chen,
Haoyi Zhu,
Xianglong He,
Yifan Wang,
Jianjun Zhou,
Wenzheng Chang,
Yang Zhou,
Zizun Li,
Zhoujie Fu,
Jiangmiao Pang,
Tong He
Abstract:
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error…
▽ More
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Authors:
Lei Lei,
Jie Gu,
Xiaokang Ma,
Chu Tang,
Jingmin Chen,
Tong Xu
Abstract:
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selectio…
▽ More
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans
Authors:
Yueqian Guo,
Tianzhao Li,
Xin Lyu,
Jiehaolin Chen,
Zhaohan Wang,
Sirui Xiao,
Yurun Chen,
Yezi He,
Helin Li,
Fan Zhang
Abstract:
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1)…
▽ More
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Authors:
Yudong Zhang,
Ruobing Xie,
Yiqing Huang,
Jiansheng Chen,
Xingwu Sun,
Zhanhui Kang,
Di Wang,
Yu Wang
Abstract:
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. Despite their potential impact, the development of effective methods for purifying such adversarial examples has rece…
▽ More
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. Despite their potential impact, the development of effective methods for purifying such adversarial examples has received relatively limited attention. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive "fighting fire with fire" strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code will be made publicly available.
△ Less
Submitted 10 June, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
PseudoVC: Improving One-shot Voice Conversion with Pseudo Paired Data
Authors:
Songjun Cao,
Qinghua Wu,
Jie Chen,
Jin Li,
Long Ma
Abstract:
As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To addre…
▽ More
As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To address these mismatches, we propose a novel VC training method called \textit{PseudoVC} in this paper. First, we introduce an innovative information perturbation approach named \textit{Pseudo Conversion} to tackle the first mismatch problem. This approach leverages pretrained VC models to convert the source utterance into a perturbed utterance, which is fed into the content encoder during training. Second, we propose an approach termed \textit{Speaker Sampling} to resolve the second mismatch problem, which will substitute the input to the speaker encoder by another utterance from the same speaker during training. Experimental results demonstrate that our proposed \textit{Pseudo Conversion} outperforms previous information perturbation methods, and the overall \textit{PseudoVC} method surpasses publicly available VC models. Audio examples are available.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Electrically tunable quantum interference of atomic spins on surfaces
Authors:
Hao Wang,
Jing Chen,
Peng Fan,
Yelko del Castillo,
Alejandro Ferrón,
Lili Jiang,
Zilong Wu,
Shijie Li,
Hong-Jun Gao,
Heng Fan,
Joaquín Fernández-Rossier,
Kai Yang
Abstract:
Controlling quantum interference near avoided energy-level crossings is crucial for fast and reliable coherent manipulation in quantum information processing. However, achieving tunable quantum interference in atomically-precise engineered structures remains challenging. Here, we demonstrate electrical control of quantum interference using atomic spins on an insulating film in a scanning tunneling…
▽ More
Controlling quantum interference near avoided energy-level crossings is crucial for fast and reliable coherent manipulation in quantum information processing. However, achieving tunable quantum interference in atomically-precise engineered structures remains challenging. Here, we demonstrate electrical control of quantum interference using atomic spins on an insulating film in a scanning tunneling microscope. Using bias voltages applied across the tunnel junction, we modulate the atomically-confined magnetic interaction between the probe tip and surface atoms with a strong electric field, and drive the spin state rapidly through the energy-level anticrossing. This all-electrical manipulation allows us to achieve Landau-Zener-Stückelberg-Majorana (LZSM) interferometry on both single spins and pairs of interacting spins. The LZSM pattern exhibits multiphoton resonances, and its asymmetry suggests that the spin dynamics is influenced by spin-transfer torque of tunneling electrons. Multi-level LZSM spectra measured on coupled spins with tunable interactions show distinct interference patterns depending on their many-body energy landscapes. These results open new avenues for all-electrical quantum manipulation in spin-based quantum processors in the strongly driven regime.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Authors:
Yuyuan Liu,
Yuanhong Chen,
Chong Wang,
Junlin Han,
Junde Wu,
Can Peng,
Jingkun Chen,
Yu Tian,
Gustavo Carneiro
Abstract:
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation m…
▽ More
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Stability and rigidity results of space-like hypersurface in the Minkowski space
Authors:
Jianhua Chen,
Haiyun Deng,
Haiqin Xie,
Jiabin Yin
Abstract:
In this paper, we establish some rigidity theorems for space-like hypersurfaces in Minkowski space by using a Weinberger-type approach with P-functions and integral identities. Firstly, for space-like hypersurfaces $M$ represented as graphs $x_{n+1}=u(x)$ over domain $Ω\subset\mathbb R^n$, if higher-order mean curvature ratio $\frac{H_{k}}{H_l}(l<k)$ is constant and the boundary $\partial M$ lies…
▽ More
In this paper, we establish some rigidity theorems for space-like hypersurfaces in Minkowski space by using a Weinberger-type approach with P-functions and integral identities. Firstly, for space-like hypersurfaces $M$ represented as graphs $x_{n+1}=u(x)$ over domain $Ω\subset\mathbb R^n$, if higher-order mean curvature ratio $\frac{H_{k}}{H_l}(l<k)$ is constant and the boundary $\partial M$ lies on a hyperplane intersecting with constant angles, then the hypersurface must be a part of hyperboloid. Secondly, for convex space-like hypersurfaces with boundaries on a hyperboloid or light cone, if higher-order mean curvature ratio $\frac{H_{k}}{H_l}(l<k)$ is constant and the angle function between the normal vectors of the hypersurface and the hyperboloid (or the lightcone) on the boundary is constant, then such hypersurfaces must be a part of hyperboloid. These results significantly extend Gao's previous work presented in \cite{Gao1,Gao2}.
Furthermore, we derive two fundamental integral identities for constant mean curvature (CMC) graphical hypersurfaces $x_{n+1}=u(x)$, $x\inΩ\subset\mathbb R^n$, and the boundary lies on a hyperplane. As some applications: we obtain complete equivalence conditions for hyperboloid identification through curvature properties. We also
establish a geometric stability estimate demonstrating that the square norm of the trace-free second fundamental form $\bar h$ of $M$ is quantitatively controlled by geometric quantities of $\partialΩ$, as expressed by the inequality: $$ ||\bar h||_{L^2(Ω)}\leq C(n,K)||H_{\partialΩ}-H_0||_{L^1(\partialΩ)}^{1/2}. $$
Here, $H_{\partialΩ}$ is the mean curvature of $\partialΩ$, $H_0$ is some reference constant and $C$ is a constant.
Finally, analogous estimates are established.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
$λ$ and $ρ$ Regge trajectories for the pentaquark $P_{cc\bar{c}bb}$ in the diquark-triquark picture
Authors:
He Song,
Xin-Ru Liu,
Jia-Qi Xie,
Jiao-Kai Chen
Abstract:
We propose the Regge trajectory relations for the fully heavy pentaquark $P_{cc\bar{c}bb}$ utilizing both diquark and triquark Regge trajectory relations. Using these new relations, we discuss four series of Regge trajectories: the $ρ_1$-, $ρ_2$-, $λ_1$-, and $λ_2$-trajectories. We provide rough estimates for the masses of the $ρ_1$-, $ρ_2$-, $λ_1$-, and $λ_2$-excited states. Except for the $λ_1$-…
▽ More
We propose the Regge trajectory relations for the fully heavy pentaquark $P_{cc\bar{c}bb}$ utilizing both diquark and triquark Regge trajectory relations. Using these new relations, we discuss four series of Regge trajectories: the $ρ_1$-, $ρ_2$-, $λ_1$-, and $λ_2$-trajectories. We provide rough estimates for the masses of the $ρ_1$-, $ρ_2$-, $λ_1$-, and $λ_2$-excited states. Except for the $λ_1$-trajectories, the complete forms of the other three series of Regge trajectories for the pentaquark $P_{cc\bar{c}bb}$ are lengthy and cumbersome. We show that the $ρ_1$-, $ρ_2$-, and $λ_2$-trajectories can not be obtained by simply imitating the meson Regge trajectories because mesons have no substructures. To derive these trajectories, pentaquark's structure and substructure should be taken into consideration. Otherwise, the $ρ_1$-, $ρ_2$-, and $λ_2$-trajectories must rely solely on fitting existing theoretical or future experimental data. The fundamental relationship between the slopes of the obtained trajectories and constituents' masses and string tension will remain unclear, and the predictive power of the Regge trajectories would be compromised. Moreover, we show that the lengthy complete forms of the $ρ_1$-, $ρ_2$-, and $λ_2$-trajectories can be well approximated by the simple fitted formulas. Four series of Regge trajectories for the pentaquark $P_{cc\bar{c}bb}$ all exhibit a behavior of $M{\sim}x^{2/3}$, where $x=n_{r_1},n_{r_2},l_1,l_2,N_{r_1},N_{r_2},L_1,L_2$. All four series of trajectories exhibit concave downward behavior in the $(M^2,\,x)$ plane.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Aligning VLM Assistants with Personalized Situated Cognition
Authors:
Yongqi Li,
Shen Zhou,
Xiaohu Li,
Xin Miao,
Jintao Wen,
Mayi Xu,
Jianhao Chen,
Birong Pan,
Hankun Kang,
Yuanyuan Zhu,
Ming Zhong,
Tieyun Qian
Abstract:
Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM…
▽ More
Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals' actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code at https://github.com/NLPGM/PCogAlign.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Perspectives for hyperon and hypernuclei physics
Authors:
Jin-Hui Chen,
Li-Sheng Geng,
Emiko Hiyama,
Zhi-Wei Liu,
Josef Pochodzalla
Abstract:
Hypernuclei, nuclei containing one or more hyperons, serve as unique laboratories for probing the non-perturbative quantum chromodynamics (QCD). Recent progress in hypernuclear physics, driven by advanced experimental techniques and theoretical innovations, is briefly reviewed with a focus on key findings and unresolved challenges, such as the precise determination of the hypertriton binding energ…
▽ More
Hypernuclei, nuclei containing one or more hyperons, serve as unique laboratories for probing the non-perturbative quantum chromodynamics (QCD). Recent progress in hypernuclear physics, driven by advanced experimental techniques and theoretical innovations, is briefly reviewed with a focus on key findings and unresolved challenges, such as the precise determination of the hypertriton binding energy, investigations of charge symmetry breaking in mirror hypernuclei, and the search for exotic systems, including the neutral nn$Λ$ state. Experimental breakthroughs, including invariant-mass analyses and femtoscopy studies in heavy-ion collisions, as well as high-resolution $γ$-spectroscopy, have enabled precise studies of light hypernuclei and offered critical insights into the hyperon-nucleon interaction. Theoretical progress, including ab initio calculations based on chiral effective field theory and lattice QCD, has further enhanced our understanding of hyperon-nucleon and hyperon-hyperon interactions.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning
Authors:
Xiang Zhang,
Run He,
Jiao Chen,
Di Fang,
Ming Li,
Ziqian Zeng,
Cen Chen,
Huiping Zhuang
Abstract:
Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in t…
▽ More
Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available at https://github.com/scut-zx/L3A.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Video Signature: In-generation Watermarking for Latent Video Diffusion Models
Authors:
Yu Huang,
Junhao Chen,
Qi Zheng,
Hanqian Li,
Shuliang Liu,
Xuming Hu
Abstract:
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional com…
▽ More
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
Authors:
Ruihan Yang,
Yikai Zhang,
Aili Chen,
Xintao Wang,
Siyu Yuan,
Jiangjie Chen,
Deqing Yang,
Yanghua Xiao
Abstract:
Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead…
▽ More
Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
△ Less
Submitted 4 June, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
-
The Security Threat of Compressed Projectors in Large Vision-Language Models
Authors:
Yudong Zhang,
Ruobing Xie,
Xingwu Sun,
Jiansheng Chen,
Zhanhui Kang,
Di Wang,
Yu Wang
Abstract:
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offering distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive ev…
▽ More
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offering distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structural information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code will be released.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Regionalized Metric Framework: A Novel Approach for Evaluating Multimodal Multi-Objective Optimization Algorithms
Authors:
Jintai Chen,
Fangqing Liu,
Xueming Yan,
Han Huang
Abstract:
This study aims to optimize the evaluation metric of multimodal multi-objective optimization problems using a Regionalized Metric Framework, which provides a certain boost to research in this field. Existing evaluation metrics usually use the reference set as the evaluation basis, which inevitably leads to reference set dependence. To optimize this problem, this study proposes an evaluation metric…
▽ More
This study aims to optimize the evaluation metric of multimodal multi-objective optimization problems using a Regionalized Metric Framework, which provides a certain boost to research in this field. Existing evaluation metrics usually use the reference set as the evaluation basis, which inevitably leads to reference set dependence. To optimize this problem, this study proposes an evaluation metric based on a Regionalized Metric Framework. The algorithm divides the set of solutions to be evaluated into three regions, and evaluates each solution according to a unique scoring function for each region, which is combined to form the evaluation value of the solution set. To verify the feasibility of this method, a comparative experiment was conducted in this study. The results of the experiment are roughly the same as the trend of existing indicators, and at the same time, it can accurately judge the advantages and disadvantages of points equidistant from the reference set. Our method provides a new perspective for further research on evaluation metrics for multimodal multi-objective optimization algorithms.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
Authors:
Yakun Song,
Jiawei Chen,
Xiaobin Zhuang,
Chenpeng Du,
Ziyang Ma,
Jian Wu,
Jian Cong,
Dongya Jia,
Zhuo Chen,
Yuping Wang,
Yuxuan Wang,
Xie Chen
Abstract:
Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottlenec…
▽ More
Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Exploring the Performance of Perforated Backpropagation through Further Experiments
Authors:
Rorry Brenner,
Evan Davis,
Rushi Chaudhari,
Rowan Morse,
Jingyao Chen,
Xirui Liu,
Zhaoyi You,
Laurent Itti
Abstract:
Perforated Backpropagation is a neural network optimization technique based on modern understanding of the computational importance of dendrites within biological neurons. This paper explores further experiments from the original publication, generated from a hackathon held at the Carnegie Mellon Swartz Center in February 2025. Students and local Pittsburgh ML practitioners were brought together t…
▽ More
Perforated Backpropagation is a neural network optimization technique based on modern understanding of the computational importance of dendrites within biological neurons. This paper explores further experiments from the original publication, generated from a hackathon held at the Carnegie Mellon Swartz Center in February 2025. Students and local Pittsburgh ML practitioners were brought together to experiment with the Perforated Backpropagation algorithm on the datasets and models which they were using for their projects. Results showed that the system could enhance their projects, with up to 90% model compression without negative impact on accuracy, or up to 16% increased accuracy of their original models.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Heterogeneous Graph Backdoor Attack
Authors:
Jiawei Chen,
Lusi Li,
Daniel Takabi,
Masha Sosonkina,
Rui Ning
Abstract:
Heterogeneous Graph Neural Networks (HGNNs) excel in modeling complex, multi-typed relationships across diverse domains, yet their vulnerability to backdoor attacks remains unexplored. To address this gap, we conduct the first investigation into the susceptibility of HGNNs to existing graph backdoor attacks, revealing three critical issues: (1) high attack budget required for effective backdoor in…
▽ More
Heterogeneous Graph Neural Networks (HGNNs) excel in modeling complex, multi-typed relationships across diverse domains, yet their vulnerability to backdoor attacks remains unexplored. To address this gap, we conduct the first investigation into the susceptibility of HGNNs to existing graph backdoor attacks, revealing three critical issues: (1) high attack budget required for effective backdoor injection, (2) inefficient and unreliable backdoor activation, and (3) inaccurate attack effectiveness evaluation. To tackle these issues, we propose the Heterogeneous Graph Backdoor Attack (HGBA), the first backdoor attack specifically designed for HGNNs, introducing a novel relation-based trigger mechanism that establishes specific connections between a strategically selected trigger node and poisoned nodes via the backdoor metapath. HGBA achieves efficient and stealthy backdoor injection with minimal structural modifications and supports easy backdoor activation through two flexible strategies: Self-Node Attack and Indiscriminate Attack. Additionally, we improve the ASR measurement protocol, enabling a more accurate assessment of attack effectiveness. Extensive experiments demonstrate that HGBA far surpasses multiple state-of-the-art graph backdoor attacks in black-box settings, efficiently attacking HGNNs with low attack budgets. Ablation studies show that the strength of HBGA benefits from our trigger node selection method and backdoor metapath selection strategy. In addition, HGBA shows superior robustness against node feature perturbations and multiple types of existing graph backdoor defense mechanisms. Finally, extension experiments demonstrate that the relation-based trigger mechanism can effectively extend to tasks in homogeneous graph scenarios, thereby posing severe threats to broader security-critical domains.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Probabilistic intraday electricity price forecasting using generative machine learning
Authors:
Jieyu Chen,
Sebastian Lerch,
Melanie Schienle,
Tomasz Serafin,
Rafał Weron
Abstract:
The growing importance of intraday electricity trading in Europe calls for improved price forecasting and tailored decision-support tools. In this paper, we propose a novel generative neural network model to generate probabilistic path forecasts for intraday electricity prices and use them to construct effective trading strategies for Germany's continuous-time intraday market. Our method demonstra…
▽ More
The growing importance of intraday electricity trading in Europe calls for improved price forecasting and tailored decision-support tools. In this paper, we propose a novel generative neural network model to generate probabilistic path forecasts for intraday electricity prices and use them to construct effective trading strategies for Germany's continuous-time intraday market. Our method demonstrates competitive performance in terms of statistical evaluation metrics compared to two state-of-the-art statistical benchmark approaches. To further assess its economic value, we consider a realistic fixed-volume trading scenario and propose various strategies for placing market sell orders based on the path forecasts. Among the different trading strategies, the price paths generated by our generative model lead to higher profit gains than the benchmark methods. Our findings highlight the potential of generative machine learning tools in electricity price forecasting and underscore the importance of economic evaluation.
△ Less
Submitted 28 May, 2025;
originally announced June 2025.
-
Decoupled Competitive Framework for Semi-supervised Medical Image Segmentation
Authors:
Jiahe Chen,
Jiahe Ying,
Shen Wang,
Jianwei Zheng
Abstract:
Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limita…
▽ More
Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limitations, \textit{e.g.}, the over-coupling problem within MT structure owing to the employment of exponential moving average (EMA) mechanism, as well as the severe cognitive bias between two students of DS structure, both of which potentially lead to reduced efficacy, or even model collapse eventually. To mitigate these issues, a Decoupled Competitive Framework (DCF) is elaborated in this work, which utilizes a straightforward competition mechanism for the update of EMA, effectively decoupling students and teachers in a dynamical manner. In addition, the seamless exchange of invaluable and precise insights is facilitated among students, guaranteeing a better learning paradigm. The DCF introduced undergoes rigorous validation on three publicly accessible datasets, which encompass both 2D and 3D datasets. The results demonstrate the superiority of our method over previous cutting-edge competitors. Code will be available at https://github.com/JiaheChen2002/DCF.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
All-sky search for individual Primordial Black Hole bursts with LHAASO
Authors:
Zhen Cao,
F. Aharonian,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
W. Bian,
A. V. Bukevich,
C. M. Cai,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
G. H. Chen,
H. X. Chen,
Liang Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. Chen,
S. H. Chen
, et al. (293 additional authors not shown)
Abstract:
Primordial Black Holes~(PBHs) are hypothetical black holes with a wide range of masses that formed in the early universe. As a result, they may play an important cosmological role and provide a unique probe of the early universe. A PBH with an initial mass of approximately $10^{15}$~g is expected to explode today in a final burst of Hawking radiation. In this work, we conduct an all-sky search for…
▽ More
Primordial Black Holes~(PBHs) are hypothetical black holes with a wide range of masses that formed in the early universe. As a result, they may play an important cosmological role and provide a unique probe of the early universe. A PBH with an initial mass of approximately $10^{15}$~g is expected to explode today in a final burst of Hawking radiation. In this work, we conduct an all-sky search for individual PBH burst events using the data collected from March 2021 to July 2024 by the Water Cherenkov Detector Array of the Large High Altitude Air Shower Observatory (LHAASO). Three PBH burst durations, 10~s, 20~s, and 100~s, are searched, with no significant PBH bursts observed. The upper limit on the local PBH burst rate density is set to be as low as 181~pc$^{-3}$~yr$^{-1}$ at 99$\%$ confidence level, representing the most stringent limit achieved to date.
△ Less
Submitted 2 June, 2025; v1 submitted 30 May, 2025;
originally announced May 2025.
-
VUDG: A Dataset for Video Understanding Domain Generalization
Authors:
Ziyi Wang,
Zhi Gao,
Boxuan Yu,
Zirui Dai,
Yuxiang Song,
Qingyuan Lu,
Jin Chen,
Xinxiao Wu
Abstract:
Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Gene…
▽ More
Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large video-language models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts. These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a valuable resource for prompting future research in domain generalization video understanding.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows
Authors:
Orlando Marquez Ayala,
Patrice Bechard,
Emily Chen,
Maggie Baird,
Jingfei Chen
Abstract:
Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still hav…
▽ More
Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Pretraining Deformable Image Registration Networks with Random Images
Authors:
Junyu Chen,
Shuwen Wei,
Yihao Liu,
Aaron Carass,
Yong Du
Abstract:
Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between…
▽ More
Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Deep learning-derived arterial input function
Authors:
Junyu Chen,
Zirui Jiang,
Jennifer M. Coughlin,
Martin G. Pomper,
Yong Du
Abstract:
Dynamic positron emission tomography (PET) imaging combined with radiotracer kinetic modeling is a powerful technique for visualizing biological processes in the brain, offering valuable insights into brain functions and neurological disorders such as Alzheimer's and Parkinson's diseases. Accurate kinetic modeling relies heavily on the use of a metabolite-corrected arterial input function (AIF), w…
▽ More
Dynamic positron emission tomography (PET) imaging combined with radiotracer kinetic modeling is a powerful technique for visualizing biological processes in the brain, offering valuable insights into brain functions and neurological disorders such as Alzheimer's and Parkinson's diseases. Accurate kinetic modeling relies heavily on the use of a metabolite-corrected arterial input function (AIF), which typically requires invasive and labor-intensive arterial blood sampling. While alternative non-invasive approaches have been proposed, they often compromise accuracy or still necessitate at least one invasive blood sampling. In this study, we present the deep learning-derived arterial input function (DLIF), a deep learning framework capable of estimating a metabolite-corrected AIF directly from dynamic PET image sequences without any blood sampling. We validated DLIF using existing dynamic PET patient data. We compared DLIF and resulting parametric maps against ground truth measurements. Our evaluation shows that DLIF achieves accurate and robust AIF estimation. By leveraging deep learning's ability to capture complex temporal dynamics and incorporating prior knowledge of typical AIF shapes through basis functions, DLIF provides a rapid, accurate, and entirely non-invasive alternative to traditional AIF measurement methods.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors:
Junyu Chen,
Shuwen Wei,
Joel Honkamaa,
Pekka Marttinen,
Hang Zhang,
Min Liu,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao,
Lukas Förner,
Thomas Wendler,
Bailiang Jian,
Benedikt Wiestler,
Tim Hable,
Jin Kim,
Dan Ruan,
Frederic Madesta,
Thilo Sentker,
Wiebke Heyer,
Lianrui Zuo
, et al. (11 additional authors not shown)
Abstract:
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI…
▽ More
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Sharp Concentration of Simple Random Tensors II: Asymmetry
Authors:
Jiaheng Chen,
Daniel Sanz-Alonso
Abstract:
This paper establishes sharp concentration inequalities for simple random tensors. Our theory unveils a phenomenon that arises only for asymmetric tensors of order $p \ge 3:$ when the effective ranks of the covariances of the component random variables lie on both sides of a critical threshold, an additional logarithmic factor emerges that is not present in sharp bounds for symmetric tensors. To e…
▽ More
This paper establishes sharp concentration inequalities for simple random tensors. Our theory unveils a phenomenon that arises only for asymmetric tensors of order $p \ge 3:$ when the effective ranks of the covariances of the component random variables lie on both sides of a critical threshold, an additional logarithmic factor emerges that is not present in sharp bounds for symmetric tensors. To establish our results, we develop empirical process theory for products of $p$ different function classes evaluated at $p$ different random variables, extending generic chaining techniques for quadratic and product empirical processes to higher-order settings.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Electrical Detection of Single-Domain Néel Vector Reorientation across the Spin-Flop Transition in Cr2O3 Crystals
Authors:
Wei-Cheng Liao,
Haoyu Liu,
Weilun Tan,
Josiah Keagy,
Jia-mou Chen,
Jing Shi
Abstract:
Electrical transport measurements in heterostructures of antiferromagnetic Cr2O3 bulk crystals and a thin Pt layer exhibit sharp responses as the Néel vector of the Cr2O3 undergoes the spin-flop transition. This abrupt change can arise from several distinct mechanisms including magnetostriction, proximity-induced anomalous Hall, spin Hall anomalous Hall, and spin Hall planar Hall effects. While la…
▽ More
Electrical transport measurements in heterostructures of antiferromagnetic Cr2O3 bulk crystals and a thin Pt layer exhibit sharp responses as the Néel vector of the Cr2O3 undergoes the spin-flop transition. This abrupt change can arise from several distinct mechanisms including magnetostriction, proximity-induced anomalous Hall, spin Hall anomalous Hall, and spin Hall planar Hall effects. While large Pt devices sensing multiple up/down domains can produce indistinguishable Hall signal jumps due to different initial Néel vector orientations, smaller Pt devices that sense single domains isolate the proximity-induced Hall signals. This allows direct electrical detection of Néel vector reorientation across the spin-flop transition in single domain regions. Furthermore, the single-domain state can be prepared by magnetic field cooling or magnetoelectric cooling. We demonstrate a method to control and characterize almost the three-dimensional orientation of single-domain Néel vectors by exploiting Hall measurements and cooling techniques, crucial for future antiferromagnetic spintronic applications.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Improved Approximations for Hard Graph Problems using Predictions
Authors:
Anders Aamand,
Justin Y. Chen,
Siddharth Gollapudi,
Sandeep Silwal,
Hao Wu
Abstract:
We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to pr…
▽ More
We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only $\varepsilon$-correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve
Authors:
Yuanzhe Liu,
Ryan Deng,
Tim Kaler,
Xuhao Chen,
Charles E. Leiserson,
Yao Ma,
Jie Chen
Abstract:
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a codi…
▽ More
Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
FinRipple: Aligning Large Language Models with Financial Market for Event Ripple Effect Awareness
Authors:
Yuanjian Xu,
Jianing Hao,
Kunsheng Tang,
Jingnan Chen,
Anxian Liu,
Peng Liu,
Guang Zhang
Abstract:
Financial markets exhibit complex dynamics where localized events trigger ripple effects across entities. Previous event studies, constrained by static single-company analyses and simplistic assumptions, fail to capture these ripple effects. While large language models (LLMs) offer emergent reasoning capabilities, their direct application falters due to structural market unawareness and limited ca…
▽ More
Financial markets exhibit complex dynamics where localized events trigger ripple effects across entities. Previous event studies, constrained by static single-company analyses and simplistic assumptions, fail to capture these ripple effects. While large language models (LLMs) offer emergent reasoning capabilities, their direct application falters due to structural market unawareness and limited capacity to analyze ripple effects. We propose FinRipple, an elegant framework that empowers LLMs with the ability to analyze ripple effects through financial theory-guided large-scale reinforcement learning. We begin by relaxing the assumptions of previous methods, incorporating a time-varying knowledge graph to accurately represent market structure. By seamlessly integrating classical asset pricing theory, we align the LLM with the market, enabling it to predict ripple effects. To the best of our knowledge, we are the first to provide a standardized definition of ripple effect prediction, a task that is extremely important yet unexplored in the financial domain. Extensive experiments demonstrate that FinRipple provides a promising solution to this task.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Authors:
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Michael Wornow,
Juan M. Banda,
Nikesh Kotecha,
Timothy Keyes,
Yifan Mai,
Mert Oez,
Hao Qiu,
Shrey Jain,
Leonardo Schettini,
Mehr Kashyap,
Jason Alan Fries,
Akshay Swaminathan,
Philip Chung,
Fateme Nateghi,
Asad Aali,
Ashwin Nayak,
Shivam Vedak,
Sneha S. Jain,
Birju Patel,
Oluseyi Fayanju,
Shreya Shah
, et al. (56 additional authors not shown)
Abstract:
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego…
▽ More
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
△ Less
Submitted 2 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Position Dependent Prediction Combination For Intra-Frame Video Coding
Authors:
Amir Said,
Xin Zhao,
Marta Karczewicz,
Jianle Chen,
Feng Zou
Abstract:
Intra-frame prediction in the High Efficiency Video Coding (HEVC) standard can be empirically improved by applying sets of recursive two-dimensional filters to the predicted values. However, this approach does not allow (or complicates significantly) the parallel computation of pixel predictions. In this work we analyze why the recursive filters are effective, and use the results to derive sets of…
▽ More
Intra-frame prediction in the High Efficiency Video Coding (HEVC) standard can be empirically improved by applying sets of recursive two-dimensional filters to the predicted values. However, this approach does not allow (or complicates significantly) the parallel computation of pixel predictions. In this work we analyze why the recursive filters are effective, and use the results to derive sets of non-recursive predictors that have superior performance. We present an extension to HEVC intra prediction that combines values predicted using non-filtered and filtered (smoothed) reference samples, depending on the prediction mode, and block size. Simulations using the HEVC common test conditions show that a 2.0% bit rate average reduction can be achieved compared to HEVC, for All Intra (AI) configurations.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Are Reasoning Models More Prone to Hallucination?
Authors:
Zijun Yao,
Yantao Liu,
Yanxu Chen,
Jianhui Chen,
Junfeng Fang,
Lei Hou,
Juanzi Li,
Tat-Seng Chua
Abstract:
Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports incre…
▽ More
Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
Authors:
Griffin Dietz Smith,
Dianna Yee,
Jennifer King Chen,
Leah Findlater
Abstract:
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates th…
▽ More
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Measurement of the Lund plane for light- and beauty-quark jets
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis,
L. An
, et al. (1133 additional authors not shown)
Abstract:
The substructure of jets in quantum chromodynamics (QCD) has garnered significant attention with the advent of infrared- and collinear-safe clustering algorithms and observables. A key question emerging from these studies is how in-jet emissions at soft and hard energy scales, across collinear and wide angles relative to the emitter, differ with the mass of the emitting parton. The Lund jet plane…
▽ More
The substructure of jets in quantum chromodynamics (QCD) has garnered significant attention with the advent of infrared- and collinear-safe clustering algorithms and observables. A key question emerging from these studies is how in-jet emissions at soft and hard energy scales, across collinear and wide angles relative to the emitter, differ with the mass of the emitting parton. The Lund jet plane (LJP) is a perturbatively well-defined substructure observable that maps the radiation pattern of jets onto a plane, visually distinguishing emissions with different kinematic properties. Comparing LJP for jets containing hadrons of low versus high mass enables the testing of QCD splitting functions from first-principles calculations across both soft and hard regimes and at different radiation angles. This article presents the first measurement of the LJP for light-quark-enriched and beauty-initiated jets at center-of-mass energy of 13\tev at LHCb. This marks the first direct observation of the dead-cone effect in beauty-quark jets, measured in the collinear region of the LJP.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Comparison of total $σ_k$-curvature
Authors:
Jiaqi Chen,
Yufei Shan,
Yinghui Ye
Abstract:
Volume comparison theorem is a type of fundamental results in Riemannian geometry. In this article, we extend the volume comparison result in \cite{Besse2008} to the comparison of total $σ_l$-curvature with respect to $σ_k$-curvature ($l<k$). In particular, we prove the comparison holds for metrics close to strictly stable positive Einstein metric with $l<\frac{n}{2}$. As for negative Einstein met…
▽ More
Volume comparison theorem is a type of fundamental results in Riemannian geometry. In this article, we extend the volume comparison result in \cite{Besse2008} to the comparison of total $σ_l$-curvature with respect to $σ_k$-curvature ($l<k$). In particular, we prove the comparison holds for metrics close to strictly stable positive Einstein metric with $l<\frac{n}{2}$. As for negative Einstein metrics, we prove a similar comparison result provided certain assumptions on sectional curvature holds for the manifold.
△ Less
Submitted 9 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning
Authors:
Xiaofeng Pan,
Jing Chen,
Haitong Zhang,
Menglin Xing,
Jiayi Wei,
Xuefeng Mu,
Zhongqian Xie
Abstract:
Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fai…
▽ More
Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
The Chemical Clock of High-mass Star-forming Regions: N2H+/CCS
Authors:
J. L. Chen,
J. S. Zhang,
J. X. Ge,
Y. X. Wang,
H. Z. Yu,
Y. P. Zou,
Y. T. Yan,
X. Y. Wang,
D. Y. Wei
Abstract:
Using the IRAM 30 m telescope, we presented observations of N2H+ J = 1-0, CCS JN = 87-76 and 77-66 lines toward a large sample of ultracompact HII regions (UC HIIs). Among our 88 UC HIIs, 87 and 33 sources were detected in the N2H+ J = 1-0 and CCS JN = 87-76 lines, respectively. For the CCS 77-66 transition, we detected emission in 10 out of 82 targeted sources, all of which also exhibited emissio…
▽ More
Using the IRAM 30 m telescope, we presented observations of N2H+ J = 1-0, CCS JN = 87-76 and 77-66 lines toward a large sample of ultracompact HII regions (UC HIIs). Among our 88 UC HIIs, 87 and 33 sources were detected in the N2H+ J = 1-0 and CCS JN = 87-76 lines, respectively. For the CCS 77-66 transition, we detected emission in 10 out of 82 targeted sources, all of which also exhibited emission in the CCS JN = 87-76 line. Physical parameters are derived for our detections, including the optical depth and excitation temperature of N2H+, the rotational temperature of CCS and the column density. Combining our results and previous observation results in different stages of high-mass star-forming regions (HMSFRs), we found that the column density ratio N(N2H+)/N(CCS) increases from high-mass starless cores (HMSCs) through high-mass protostellar cores (HMPOs) to UC HIIs. This implies that N(N2H+)/N(CCS) can trace the evolution process of HMSFRs. It was supported by our gas-grain chemical model, which shows that N(N2H+)/N(CCS) increases with the evolution age of HMSFRs. The temperature, density and chemical age were also constrained from our best-fit model at each stage. Thus, we propose N(N2H+)/N(CCS) as a reliable chemical clock of HMSFRs.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver Assistance Systems
Authors:
Hao Wu,
Junzhou Chen,
Ronghui Zhang,
Nengchao Lyu,
Hongyu Hu,
Yanyong Guo,
Tony Z. Qiu
Abstract:
Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios,…
▽ More
Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios, with strong adaptability to mainstream detectors. WTEFNet comprises three core modules: a Low-Light Enhancement (LLE) module, a Wavelet-based Feature Extraction (WFE) module, and an Adaptive Fusion Detection (AFFD) module. The LLE enhances dark regions while suppressing overexposed areas; the WFE applies multi-level discrete wavelet transforms to isolate high- and low-frequency components, enabling effective denoising and structural feature retention; the AFFD fuses semantic and illumination features for robust detection. To support training and evaluation, we introduce GSN, a manually annotated dataset covering both clear and rainy night-time scenes. Extensive experiments on BDD100K, SHIFT, nuScenes, and GSN demonstrate that WTEFNet achieves state-of-the-art accuracy under low-light conditions. Furthermore, deployment on a embedded platform (NVIDIA Jetson AGX Orin) confirms the framework's suitability for real-time ADAS applications.
△ Less
Submitted 29 May, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning
Authors:
Jinquan Guan,
Qi Chen,
Lizhou Liang,
Yuhang Liu,
Vu Minh Hieu Phan,
Minh-Son To,
Jian Chen,
Yutong Xie
Abstract:
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's archit…
▽ More
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
Authors:
Jinwen Chen,
Hainan Zhang,
Fei Sun,
Qinnan Zhang,
Sijia Wen,
Ziwei Wang,
Zhiming Zheng
Abstract:
Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the la…
▽ More
Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions
Authors:
Shuolin Xu,
Siming Zheng,
Ziyi Wang,
HC Yu,
Jinwei Chen,
Huaqi Zhang,
Bo Li,
Peng-Tao Jiang
Abstract:
Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain…
▽ More
Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.