-
NetPress: Dynamically Generated LLM Benchmarks for Network Applications
Authors:
Yajie Zhou,
Jiajun Ruan,
Eric S. Wang,
Sadjad Fouladi,
Francis Y. Yan,
Kevin Hsieh,
Zaoxing Liu
Abstract:
Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduce…
▽ More
Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Three-pion Bose-Einstein correlations measured in proton-proton collisions
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis,
L. An
, et al. (1125 additional authors not shown)
Abstract:
A study on the Bose-Einstein correlations for triplets of same-sign pions is presented. The analysis is performed using proton-proton collisions at a centre-of-mass energy of $\sqrt{s}$ = 7 TeV, recorded by the LHCb experiment, corresponding to an integrated luminosity of 1.0 fb$^{-1}$. For the first time, the results are interpreted in the core-halo model. The parameters of the model are determin…
▽ More
A study on the Bose-Einstein correlations for triplets of same-sign pions is presented. The analysis is performed using proton-proton collisions at a centre-of-mass energy of $\sqrt{s}$ = 7 TeV, recorded by the LHCb experiment, corresponding to an integrated luminosity of 1.0 fb$^{-1}$. For the first time, the results are interpreted in the core-halo model. The parameters of the model are determined in regions of charged-particle multiplicity. This measurement provides insight into the nature of hadronisation in terms of coherence, showing a coherent emission of pions.
△ Less
Submitted 9 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
Measurement of the branching fractions of the Cabibbo-favored decays $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ and $Λ_{c}^{+}\toΞ^{0}K_{S}^{0}π^{+}$ and search for $Λ_{c}^{+}\toΣ^{0} K_{S}^{0}K^{+}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (660 additional authors not shown)
Abstract:
Based on $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of about 4.5 fb$^{-1}$ collected at center-of-mass energies between 4599.53 MeV and 4698.82 MeV with the BESIII detector, the absolute branching fraction of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is measured to be $(3.12\pm0.46\pm0.15)\times10^{-3}$. Combined with a previous measurement from the BESIII…
▽ More
Based on $e^{+}e^{-}$ collision data corresponding to an integrated luminosity of about 4.5 fb$^{-1}$ collected at center-of-mass energies between 4599.53 MeV and 4698.82 MeV with the BESIII detector, the absolute branching fraction of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is measured to be $(3.12\pm0.46\pm0.15)\times10^{-3}$. Combined with a previous measurement from the BESIII Collaboration, the branching fraction of the decay $Λ_{c}^{+}\toΛK_{S}^{0}K^{+}$ is calculated to be $(3.07\pm0.26\pm0.13)\times10^{-3}$. The decay $Λ_{c}^{+}\toΞ^{0}K_{S}^{0}π^{+}$ is observed for the first time with a statistical significance of $6.6σ$, and its branching fraction is determined to be $(3.70\pm0.60\pm0.21)\times10^{-3}$. In addition, a search for the decay $Λ_{c}^{+}\toΣ^{0} K_{S}^{0}K^{+}$ is performed and its branching fraction is determined to be $(0.80^{+0.28}_{-0.24}\pm0.16)\times10^{-3}$, corresponding to an upper limit of $1.28\times10^{-3}$ at $90\%$ confidence level. These measurements provide new information that can be used to distinguish between theoretical models.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver
Authors:
Yuepeng Zheng,
Fu Luo,
Zhenkun Wang,
Yaoxin Wu,
Yu Zhou
Abstract:
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To ov…
▽ More
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.
△ Less
Submitted 14 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results
Authors:
Xiaohong Liu,
Xiongkuo Min,
Qiang Hu,
Xiaoyun Zhang,
Jie Guo,
Guangtao Zhai,
Shushi Wang,
Yingjie Zhou,
Lu Liu,
Jingxin Li,
Liu Yang,
Farong Wen,
Li Xu,
Yanwei Jiang,
Xilei Zhu,
Chunyi Li,
Zicheng Zhang,
Huiyu Duan,
Xiele Wu,
Yixuan Gao,
Yuqin Cao,
Jun Jia,
Wei Sun,
Jiezhang Cao,
Radu Timofte
, et al. (70 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking he…
▽ More
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
UTCS: Effective Unsupervised Temporal Community Search with Pre-training of Temporal Dynamics and Subgraph Knowledge
Authors:
Yue Zhang,
Yankai Chen,
Yingli Zhou,
Yucan Guo,
Xiaolin Han,
Chenhao Ma
Abstract:
In many real-world applications, the evolving relationships between entities can be modeled as temporal graphs, where each edge has a timestamp representing the interaction time.
As a fundamental problem in graph analysis, {\it community search (CS)} in temporal graphs has received growing attention but exhibits two major limitations: (1) Traditional methods typically require predefined subgraph…
▽ More
In many real-world applications, the evolving relationships between entities can be modeled as temporal graphs, where each edge has a timestamp representing the interaction time.
As a fundamental problem in graph analysis, {\it community search (CS)} in temporal graphs has received growing attention but exhibits two major limitations: (1) Traditional methods typically require predefined subgraph structures, which are not always known in advance. (2) Learning-based methods struggle to capture temporal interaction information. To fill this research gap, in this paper, we propose an effective \textbf{U}nsupervised \textbf{T}emporal \textbf{C}ommunity \textbf{S}earch with pre-training of temporal dynamics and subgraph knowledge model (\textbf{\model}). \model~contains two key stages: offline pre-training and online search. In the first stage, we introduce multiple learning objectives to facilitate the pre-training process in the unsupervised learning setting. In the second stage, we identify a candidate subgraph and compute community scores using the pre-trained node representations and a novel scoring mechanism to determine the final community members. Experiments on five real-world datasets demonstrate the effectiveness.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
SiamNAS: Siamese Surrogate Model for Dominance Relation Prediction in Multi-objective Neural Architecture Search
Authors:
Yuyang Zhou,
Ferrante Neri,
Yew-Soon Ong,
Ruibin Bai
Abstract:
Modern neural architecture search (NAS) is inherently multi-objective, balancing trade-offs such as accuracy, parameter count, and computational cost. This complexity makes NAS computationally expensive and nearly impossible to solve without efficient approximations. To address this, we propose a novel surrogate modelling approach that leverages an ensemble of Siamese network blocks to predict dom…
▽ More
Modern neural architecture search (NAS) is inherently multi-objective, balancing trade-offs such as accuracy, parameter count, and computational cost. This complexity makes NAS computationally expensive and nearly impossible to solve without efficient approximations. To address this, we propose a novel surrogate modelling approach that leverages an ensemble of Siamese network blocks to predict dominance relationships between candidate architectures. Lightweight and easy to train, the surrogate achieves 92% accuracy and replaces the crowding distance calculation in the survivor selection strategy with a heuristic rule based on model size. Integrated into a framework termed SiamNAS, this design eliminates costly evaluations during the search process. Experiments on NAS-Bench-201 demonstrate the framework's ability to identify Pareto-optimal solutions with significantly reduced computational costs. The proposed SiamNAS identified a final non-dominated set containing the best architecture in NAS-Bench-201 for CIFAR-10 and the second-best for ImageNet, in terms of test error rate, within 0.01 GPU days. This proof-of-concept study highlights the potential of the proposed Siamese network surrogate model to generalise to multi-tasking optimisation, enabling simultaneous optimisation across tasks. Additionally, it offers opportunities to extend the approach for generating Sets of Pareto Sets (SOS), providing diverse Pareto-optimal solutions for heterogeneous task settings.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving
Authors:
Xuewen Luo,
Fengze Yang,
Fan Ding,
Xiangbo Gao,
Shuo Xing,
Yang Zhou,
Zhengzhong Tu,
Chenxi Liu
Abstract:
Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everyth…
▽ More
Knowledge-driven autonomous driving systems(ADs) offer powerful reasoning capabilities, but face two critical challenges: limited perception due to the short-sightedness of single-vehicle sensors, and hallucination arising from the lack of real-time environmental grounding. To address these issues, this paper introduces V2X-UniPool, a unified framework that integrates multimodal Vehicle-to-Everything (V2X) data into a time-indexed and language-based knowledge pool. By leveraging a dual-query Retrieval-Augmented Generation (RAG) mechanism, which enables retrieval of both static and dynamic knowledge, our system enables ADs to perform accurate, temporally consistent reasoning over both static environment and dynamic traffic context. Experiments on a real-world cooperative driving dataset demonstrate that V2X-UniPool significantly enhances motion planning accuracy and reasoning capability. Remarkably, it enables even zero-shot vehicle-side models to achieve state-of-the-art performance by leveraging V2X-UniPool, while simultaneously reducing transmission cost by over 99.9\% compared to prior V2X methods.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Improved Measurements of $D^+ \to ηe^+ν_e$ and $D^+ \to ημ^+ν_μ$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (682 additional authors not shown)
Abstract:
Using 20.3 fb$^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector, we measure the branching fractions of $D^+\to ηe^+ν_e$ and $D^+\to ημ^+ν_μ$ to be $(9.75\pm0.29\pm0.28)\times10^{-4}$ and $(9.08\pm0.35\pm0.23)\times10^{-4}$, where the first and second uncertainties are statistical and systematic, respectively. From a simultaneous fit to t…
▽ More
Using 20.3 fb$^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector, we measure the branching fractions of $D^+\to ηe^+ν_e$ and $D^+\to ημ^+ν_μ$ to be $(9.75\pm0.29\pm0.28)\times10^{-4}$ and $(9.08\pm0.35\pm0.23)\times10^{-4}$, where the first and second uncertainties are statistical and systematic, respectively. From a simultaneous fit to their partial decay rates, we determine the product of the hadronic form factor $f^η_+(0)$ and the modulus of the $c\to d$ Cabibbo-Kobayashi-Maskawa matrix element $|V_{cd}|$ to be $f^η_+(0)|V_{cd}|=0.078\pm0.002\pm0.001$. Taking the $|V_{cd}|$ value from the Standard Model global fit as input, we obtain $f^η_+(0)=0.345\pm0.008\pm0.003$. The ratio between the measured branching fractions of $D^+\toη^+μ^+ν_μ$ and $D^+\toηe^+ν_e$, is determined to be $0.93\pm0.05_{\rm stat.}\pm0.02_{\rm syst.}$, indicating no violation of lepton flavor universality.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts
Authors:
Haizhong Zheng,
Yang Zhou,
Brian R. Bartoldson,
Bhavya Kailkhura,
Fan Lai,
Jiawei Zhao,
Beidi Chen
Abstract:
Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this over…
▽ More
Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B, we show that GRESO achieves up to 2.4x wall-clock time speedup in rollout and up to 2.0x speedup in total training time without accuracy degradation.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
Authors:
Juncheng Wu,
Sheng Liu,
Haoqin Tu,
Hang Yu,
Xiaoke Huang,
James Zou,
Cihang Xie,
Yuyin Zhou
Abstract:
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decom…
▽ More
Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data
Authors:
Yan Zhou,
Bradley Malin,
Murat Kantarcioglu
Abstract:
Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques sp…
▽ More
Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Some series connecting Fibonacci numbers to $π$
Authors:
Zhi-Wei Sun,
Yajun Zhou
Abstract:
Exploring the theory of Guillera--Rogers, we evaluate some infinite series whose summands are quadratic irrationals, in terms of $π$ and special values of Dirichlet $L$-functions. For example, we show that \[\sum_{k=1}^\infty\frac{3 \left(16 \sqrt{5}-35\right) k-4 \left(5 \sqrt{5}-11\right)}{k^{3}\binom{2k}{k}^3}\left(\frac{1+\sqrt{5}}{2} \right)^{8 k}=\frac{71π^{2}}{30}\]and\begin{align*}&\sum_{k…
▽ More
Exploring the theory of Guillera--Rogers, we evaluate some infinite series whose summands are quadratic irrationals, in terms of $π$ and special values of Dirichlet $L$-functions. For example, we show that \[\sum_{k=1}^\infty\frac{3 \left(16 \sqrt{5}-35\right) k-4 \left(5 \sqrt{5}-11\right)}{k^{3}\binom{2k}{k}^3}\left(\frac{1+\sqrt{5}}{2} \right)^{8 k}=\frac{71π^{2}}{30}\]and\begin{align*}&\sum_{k=1}^\infty\frac{6 \left(17 \sqrt{7}+35\right) k- 35 \sqrt{7}-89}{k^{3}\binom{2k}{k}^3}\left(-2^{11}\right)^k\big(45-17\sqrt{7}\big)^{2k}\\={}&128\left[ 20L_{-8}(2)-7\sqrt{7}L_{-56}(2) \right],\end{align*}where the central binomial coefficients are given by $ \binom{2k}k:=\frac{(2k)!}{(k!)^{2}} $, and the special Dirichlet $L$-values $ L_d(2):= \sum_{k=1}^\infty\left( \frac{d}{k} \right)\frac1{k^2}$ are defined through the Kronecker symbol $ \left(\frac{d}{\cdot}\right)$.
△ Less
Submitted 12 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation
Authors:
Sen Liang,
Zhentao Yu,
Zhengguang Zhou,
Teng Hu,
Hongmei Wang,
Yi Chen,
Qin Lin,
Yuan Zhou,
Xin Li,
Qinglin Lu,
Zhibo Chen
Abstract:
The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video mode…
▽ More
The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Macroscopic entanglement of three magnon modes in three cavities via optical parametric amplifier
Authors:
Ying Zhou,
Guo-Qiang Zhang
Abstract:
We propose a scheme to generate bipartite and tripartite entanglements of three magnon modes in a three-cavity system using a nonlinear optical parametric amplifier (OPA). The three magnon modes in three YIG spheres are respectively placed inside three cavities near the maximum magnetic fields of the cavities and coupled to cavity modes via linear magnetic dipole interaction. Additionally, linear…
▽ More
We propose a scheme to generate bipartite and tripartite entanglements of three magnon modes in a three-cavity system using a nonlinear optical parametric amplifier (OPA). The three magnon modes in three YIG spheres are respectively placed inside three cavities near the maximum magnetic fields of the cavities and coupled to cavity modes via linear magnetic dipole interaction. Additionally, linear coupling interaction exists between two cavities. Using experimentally feasible parameters, we demonstrate that OPA can prepare the three magnon modes in a steady-state entangled state, bipartite and tripartite entanglements increase with the nonlinear interaction strength of OPA. An alternative approach to enhance quantum entanglement involves multiplexed OPA inputs. By employing individual OPA for each cavity, we observe a significant improvement in entanglement generation. All the entanglements are robust against bath temperature.
△ Less
Submitted 14 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
Self-Challenging Language Model Agents
Authors:
Yifei Zhou,
Sergey Levine,
Jason Weston,
Xian Li,
Sainbayar Sukhbaatar
Abstract:
Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. T…
▽ More
Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
A Graph Neural Network for the Era of Large Atomistic Models
Authors:
Duo Zhang,
Anyang Peng,
Chun Cai,
Wentao Li,
Yuanchang Zhou,
Jinzhe Zeng,
Mingyu Guo,
Chengqian Zhang,
Bowen Li,
Hong Jiang,
Tong Zhu,
Weile Jia,
Linfeng Zhang,
Han Wang
Abstract:
Foundation models, or large atomistic models (LAMs), aim to universally represent the ground-state potential energy surface (PES) of atomistic systems as defined by density functional theory (DFT). The scaling law is pivotal in the development of large models, suggesting that their generalizability in downstream tasks consistently improves with increased model size, expanded training datasets, and…
▽ More
Foundation models, or large atomistic models (LAMs), aim to universally represent the ground-state potential energy surface (PES) of atomistic systems as defined by density functional theory (DFT). The scaling law is pivotal in the development of large models, suggesting that their generalizability in downstream tasks consistently improves with increased model size, expanded training datasets, and larger computational budgets. In this study, we present DPA3, a multi-layer graph neural network founded on line graph series (LiGS), designed explicitly for the era of LAMs. We demonstrate that the generalization error of the DPA3 model adheres to the scaling law. The scalability in the number of model parameters is attained by stacking additional layers within DPA3. Additionally, the model employs a dataset encoding mechanism that decouples the scaling of training data size from the model size within its multi-task training framework. When trained as problem-oriented potential energy models, the DPA3 model exhibits superior accuracy in the majority of benchmark cases, encompassing systems with diverse features, including molecules, bulk materials, surface and cluster catalysts, two-dimensional materials, and battery materials. When trained as a LAM on the OpenLAM-v1 dataset, the DPA-3.1-3M model exhibits state-of-the-art performance in the LAMBench benchmark suite for LAMs, demonstrating lowest overall zero-shot generalization error across 17 downstream tasks from a broad spectrum of research domains. This performance suggests superior accuracy as an out-of-the-box potential model, requiring minimal fine-tuning data for downstream scientific applications.
△ Less
Submitted 9 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
State Similarity in Modular Superconducting Quantum Processors with Classical Communications
Authors:
Bujiao Wu,
Changrong Xie,
Peng Mi,
Zhiyi Wu,
Zechen Guo,
Peisheng Huang,
Wenhui Huang,
Xuandong Sun,
Jiawei Zhang,
Libo Zhang,
Jiawei Qiu,
Xiayu Linpeng,
Ziyu Tao,
Ji Chu,
Ji Jiang,
Song Liu,
Jingjing Niu,
Yuxuan Zhou,
Yuxuan Du,
Wenhui Ren,
Youpeng Zhong,
Tongliang Liu,
Dapeng Yu
Abstract:
As quantum devices continue to scale, distributed quantum computing emerges as a promising strategy for executing large-scale tasks across modular quantum processors. A central challenge in this paradigm is verifying the correctness of computational outcomes when subcircuits are executed independently following circuit cutting. Here we propose a cross-platform fidelity estimation algorithm tailore…
▽ More
As quantum devices continue to scale, distributed quantum computing emerges as a promising strategy for executing large-scale tasks across modular quantum processors. A central challenge in this paradigm is verifying the correctness of computational outcomes when subcircuits are executed independently following circuit cutting. Here we propose a cross-platform fidelity estimation algorithm tailored for modular architectures. Our method achieves substantial reductions in sample complexity compared to previous approaches designed for single-processor systems. We experimentally implement the protocol on modular superconducting quantum processors with up to 6 qubits to verify the similarity of two 11-qubit GHZ states. Beyond verification, we show that our algorithm enables a federated quantum kernel method that preserves data privacy. As a proof of concept, we apply it to a 5-qubit quantum phase learning task using six 3-qubit modules, successfully extracting phase information with just eight training samples. These results establish a practical path for scalable verification and trustworthy quantum machine learning of modular quantum processors.
△ Less
Submitted 11 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
STSA: Federated Class-Incremental Learning via Spatial-Temporal Statistics Aggregation
Authors:
Zenghao Guan,
Guojun Zhu,
Yucan Zhou,
Wu Liu,
Weiping Wang,
Jiebo Luo,
Xiaoyan Gu
Abstract:
Federated Class-Incremental Learning (FCIL) enables Class-Incremental Learning (CIL) from distributed data. Existing FCIL methods typically integrate old knowledge preservation into local client training. However, these methods cannot avoid spatial-temporal client drift caused by data heterogeneity and often incur significant computational and communication overhead, limiting practical deployment.…
▽ More
Federated Class-Incremental Learning (FCIL) enables Class-Incremental Learning (CIL) from distributed data. Existing FCIL methods typically integrate old knowledge preservation into local client training. However, these methods cannot avoid spatial-temporal client drift caused by data heterogeneity and often incur significant computational and communication overhead, limiting practical deployment. To address these challenges simultaneously, we propose a novel approach, Spatial-Temporal Statistics Aggregation (STSA), which provides a unified framework to aggregate feature statistics both spatially (across clients) and temporally (across stages). The aggregated feature statistics are unaffected by data heterogeneity and can be used to update the classifier in closed form at each stage. Additionally, we introduce STSA-E, a communication-efficient variant with theoretical guarantees, achieving similar performance to STSA-E with much lower communication overhead. Extensive experiments on three widely used FCIL datasets, with varying degrees of data heterogeneity, show that our method outperforms state-of-the-art FCIL methods in terms of performance, flexibility, and both communication and computation efficiency.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
Authors:
Yiyang Zhou,
Yangfan He,
Yaofeng Su,
Siwei Han,
Joel Jang,
Gedas Bertasius,
Mohit Bansal,
Huaxiu Yao
Abstract:
Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation…
▽ More
Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MobCLIP: Learning General-purpose Geospatial Representation at Scale
Authors:
Ya Wen,
Jixuan Cai,
Qiyao Ma,
Linyan Li,
Xinhua Chen,
Chris Webster,
Yulun Zhou
Abstract:
Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective a…
▽ More
Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective and scalable multimodal fusion. Adopting a novel CLIP-based architecture, our framework aligns 100M+ POIs, nationwide remote sensing imagery, and structured demographic statistics with a billion-edge mobility graph. By tokenizing spatial locations into grid cells inspired by Vision Transformers, we establish a unified representation space bridging mobility patterns and multimodal features. To rigorously evaluate the general-purpose effectiveness of MobCLIP, we construct a benchmark dataset composed of 11 downstream prediction tasks across social, economic, and natural domains. Experiments show that MobCLIP, with four input modalities and a compact 128-dimensional representation space, achieves significantly superior general-purpose predictive performances than state-of-the-art models by an average of 35%. Thanks to the effective integration of human-centric modalities, the performance gain is particularly profound in human-centric tasks, such as energy consumption (+260%), offline retail consumption amount (+98%), and crime cases (+95%) predictions. Echoing LLM scaling laws, we further demonstrate the scaling behavior in geospatial representation learning. We open-source code and pretrained models at: https://github.com/ylzhouchris/MobCLIP.
△ Less
Submitted 3 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
Towards Efficient Few-shot Graph Neural Architecture Search via Partitioning Gradient Contribution
Authors:
Wenhao Song,
Xuan Wu,
Bo Yang,
You Zhou,
Yubin Xiao,
Yanchun Liang,
Hongwei Ge,
Heow Pueh Lee,
Chunguo Wu
Abstract:
To address the weight coupling problem, certain studies introduced few-shot Neural Architecture Search (NAS) methods, which partition the supernet into multiple sub-supernets. However, these methods often suffer from computational inefficiency and tend to provide suboptimal partitioning schemes. To address this problem more effectively, we analyze the weight coupling problem from a novel perspecti…
▽ More
To address the weight coupling problem, certain studies introduced few-shot Neural Architecture Search (NAS) methods, which partition the supernet into multiple sub-supernets. However, these methods often suffer from computational inefficiency and tend to provide suboptimal partitioning schemes. To address this problem more effectively, we analyze the weight coupling problem from a novel perspective, which primarily stems from distinct modules in succeeding layers imposing conflicting gradient directions on the preceding layer modules. Based on this perspective, we propose the Gradient Contribution (GC) method that efficiently computes the cosine similarity of gradient directions among modules by decomposing the Vector-Jacobian Product during supernet backpropagation. Subsequently, the modules with conflicting gradient directions are allocated to distinct sub-supernets while similar ones are grouped together. To assess the advantages of GC and address the limitations of existing Graph Neural Architecture Search methods, which are limited to searching a single type of Graph Neural Networks (Message Passing Neural Networks (MPNNs) or Graph Transformers (GTs)), we propose the Unified Graph Neural Architecture Search (UGAS) framework, which explores optimal combinations of MPNNs and GTs. The experimental results demonstrate that GC achieves state-of-the-art (SOTA) performance in supernet partitioning quality and time efficiency. In addition, the architectures searched by UGAS+GC outperform both the manually designed GNNs and those obtained by existing NAS methods. Finally, ablation studies further demonstrate the effectiveness of all proposed methods.
△ Less
Submitted 20 June, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
DeepVerse: 4D Autoregressive Video Generation as a World Model
Authors:
Junyi Chen,
Haoyi Zhu,
Xianglong He,
Yifan Wang,
Jianjun Zhou,
Wenzheng Chang,
Yang Zhou,
Zizun Li,
Zhoujie Fu,
Jiangmiao Pang,
Tong He
Abstract:
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error…
▽ More
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
A Group-Wise Narrow Beam Design for Uplink Channel Estimation in Hybrid Beamforming Systems
Authors:
Yufan Zhou,
Yongbo Xiao,
An Liu
Abstract:
In this paper, we consider uplink channel estimation for massive multi-input multi-output (MIMO) systems with partially connected hybrid beamforming (PC-HBF) structures. Existing beam design and channel estimation schemes are usually based on ideal assumptions and require transmitting pilots across multiple timeslots, making them unsuitable for practical PC-HBF systems. To overcome these drawbacks…
▽ More
In this paper, we consider uplink channel estimation for massive multi-input multi-output (MIMO) systems with partially connected hybrid beamforming (PC-HBF) structures. Existing beam design and channel estimation schemes are usually based on ideal assumptions and require transmitting pilots across multiple timeslots, making them unsuitable for practical PC-HBF systems. To overcome these drawbacks, we propose a novel beam design and a corresponding channel estimation algorithm to achieve accurate and real-time uplink channel estimation. Firstly, we introduce a group-wise narrow beam design in the vertical dimension to suppress interference from undesired angular components and improve vertical angle estimation accuracy,which divides the columns of the uniform planar array (UPA)into groups and the vertical angle interval into sub-intervals.In this way, each group is assigned with a narrow beam to cover one vertical angle sub-interval, and the set of narrow beams is designed based on the filter design theory. Secondly, we optimize the antenna grouping pattern using the Estimation of Distribution Algorithm (EDA), balancing interference suppression and resolution capability in the horizontal dimension, leading to a better horizontal angle estimation performance. Finally, we design a low-complexity group-wise subspace constrained variational Bayesian inference (GW-SC-VBI) algorithm to fully take advantage of the proposed beam design to achieve both low-complexity and high-accurate channel estimation. Simulation results demonstrate that the proposed scheme achieves notable performance gains over baseline methods.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Improve MLLM Benchmark Efficiency through Interview
Authors:
Farong Wen,
Yijin Guo,
Junying Wang,
Jiaohao Xiao,
Yingjie Zhou,
Chunyi Li,
Zicheng Zhang,
Guangtao Zhai
Abstract:
The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM per…
▽ More
The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection
Authors:
Yue Zhou,
Xinan He,
KaiQing Lin,
Bin Fan,
Feng Ding,
Bin Li
Abstract:
Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manif…
▽ More
Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator's output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers
Authors:
Zhengcong Fei,
Hao Jiang,
Di Qiu,
Baoxuan Gu,
Youqiang Zhang,
Jiahua Wang,
Jialin Bai,
Debang Li,
Mingyuan Fan,
Guibin Chen,
Yahui Zhou
Abstract:
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation a…
▽ More
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
Authors:
Running Yang,
Wenlong Deng,
Minghui Chen,
Yuyin Zhou,
Xiaoxiao Li
Abstract:
Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., in…
▽ More
Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.
△ Less
Submitted 3 July, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
-
The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation
Authors:
Yuhang Zhou,
Yimin Xiao,
Wei Ai,
Ge Gao
Abstract:
Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful mea…
▽ More
Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet's semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs
Authors:
Yufa Zhou,
Shaobo Wang,
Xingyu Dong,
Xiangqi Jin,
Yifang Chen,
Yue Min,
Kexin Yang,
Xingzhang Ren,
Dayiheng Liu,
Linfeng Zhang
Abstract:
Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $\textit{generalize}$ to mu…
▽ More
Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $\textit{generalize}$ to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce $\textbf{Recon}$ ($\textbf{R}$easoning like an $\textbf{ECON}$omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing
Authors:
Changyue Wang,
Weihang Su,
Qingyao Ai,
Yujia Zhou,
Yiqun Liu
Abstract:
Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly…
▽ More
Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning path.In this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: https://github.com/bebr2/DecKER .
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Asteroseismology of the G8 subgiant beta Aquilae with SONG-Tenerife, SONG-Australia and TESS
Authors:
Hans Kjeldsen,
Timothy R. Bedding,
Yaguang Li,
Frank Grundahl,
Mads Fredslund Andersen,
Duncan J. Wright,
Jack Soutter,
Robert Wittenmyer,
Claudia Reyes,
Dennis Stello,
Courtney Crawford,
Yixiao Zhou,
Mathieu Clerte,
Pere L. Palle,
Sergio Simon-Diaz,
Joergen Christensen-Dalsgaard,
Rasmus Handberg,
Hasse Hansen,
Paul Heeren,
Jens Jessen-Hansen,
Mikkel N. Lund,
Mia S. Lundkvist,
Karsten Brogaard,
Rene Tronsgaard,
Jonatan Rudrasingam
, et al. (6 additional authors not shown)
Abstract:
We present time-series radial velocities of the G8 subgiant star beta Aql obtained in 2022 and 2023 using SONG-Tenerife and, for the first time, SONG-Australia. We also analyse a sector of TESS photometry that overlapped with the 2022 SONG data. The resulting power spectrum clearly shows solar-like oscillations centred at 430 muHz. The TESS light curve shows the oscillations at lower signal-to-noi…
▽ More
We present time-series radial velocities of the G8 subgiant star beta Aql obtained in 2022 and 2023 using SONG-Tenerife and, for the first time, SONG-Australia. We also analyse a sector of TESS photometry that overlapped with the 2022 SONG data. The resulting power spectrum clearly shows solar-like oscillations centred at 430 muHz. The TESS light curve shows the oscillations at lower signal-to-noise, reflecting the fact that photometric measurements are much more affected by the granulation background than are radial velocities. The simultaneous observations in velocity and photometry represent the best such measurements for any star apart from the Sun. They allowed us to measure the ratio between the bolometric photometric amplitude and the velocity amplitude to be 26.6 +/- 3.1 ppm/(m/s). We measured this ratio for the Sun from published SOHO data to be 19.5 +/- 0.7 ppm/(m/s) and, after accounting for the difference in effective temperatures of and the Sun, these values align with expectations. In both the Sun and beta Aql, the photometry-to-velocity ratio appears to be a function of frequency. We also measured the phase shift of the oscillations in beta Aql between SONG and TESS to be -113 +/- 7 deg, which agrees with the value for the Sun and also with a 3-D simulation of a star with similar properties to beta Aql. Importantly for exoplanet searches, we argue that simultaneous photometry can be used to predict the contribution of oscillations to radial velocities. We measured frequencies for 22 oscillation modes in beta Aql and carried out asteroseismic modelling, yielding an excellent fit to the frequencies. We derived accurate values for the mass and age, and were able to place quite strong constraints on the mixing-length parameter. Finally, we show that the oscillation properties of beta Aql are very similar to stars in the open cluster M67.
△ Less
Submitted 16 June, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
-
Strain Enhanced Spin Readout Contrast in Silicon Carbide Membranes
Authors:
Haibo Hu,
Guodong Bian,
Ailun Yi,
Chunhui Jiang,
Junhua Tan,
Qi Luo,
Bo Liang,
Zhengtong Liu,
Xinfang Nie,
Dawei Lu,
Shumin Xiao,
Xin Ou,
Adam Gali,
Yu Zhou,
Qinghai Song
Abstract:
Quantum defects in solids have emerged as a transformative platform for advancing quantum technologies. A key requirement for these applications is achieving high-fidelity single-spin readout, particularly at room temperature for quantum biosensing. Here, we demonstrate through ab initio simulations of a primary quantum defect in 4H silicon carbide that strain is an effective control parameter for…
▽ More
Quantum defects in solids have emerged as a transformative platform for advancing quantum technologies. A key requirement for these applications is achieving high-fidelity single-spin readout, particularly at room temperature for quantum biosensing. Here, we demonstrate through ab initio simulations of a primary quantum defect in 4H silicon carbide that strain is an effective control parameter for significantly enhancing readout contrast. We validate this principle experimentally by inducing local strain in silicon carbide-on-insulator membranes, achieving a readout contrast exceeding 60% while preserving the favorable coherence properties of single spins. Our findings establish strain engineering as a powerful and versatile strategy for optimizing coherent spin-photon interfaces in solid-state quantum systems.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3
Authors:
Brendan Sands,
Yining Wang,
Chenhao Xu,
Yuxuan Zhou,
Lai Wei,
Rohitash Chandra
Abstract:
Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performanc…
▽ More
Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Authors:
Yucheng Zhou,
Jiahao Yuan,
Qianning Wang
Abstract:
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements…
▽ More
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption
Authors:
Yajie Zhou,
Xiaoyi Pang,
Zhibo Wang
Abstract:
Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model's…
▽ More
Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model's performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
All-optical diode via nonreciprocal nonlinear absorption and interfacial charge transfer in two-dimensional van der Waals heterostructures
Authors:
Erkang Li,
Jinhong Liu,
Yanqing Ge,
Mingjian Shi,
Yijie Wang,
Chunhui Lu,
Yixuan Zhou,
Xinlong Xu
Abstract:
Nonreciprocity is fundamental to photonic and optoelectronic devices such as all-optical diodes for ultrafast optical signal processing. However, previous nonreciprocity is mainly based on linear optical response instead of nonlinear optical response based on recently developed two-dimensional (2D) van der Waals heterostructures. Herein, an all-optical diode prototype based on nonreciprocal nonlin…
▽ More
Nonreciprocity is fundamental to photonic and optoelectronic devices such as all-optical diodes for ultrafast optical signal processing. However, previous nonreciprocity is mainly based on linear optical response instead of nonlinear optical response based on recently developed two-dimensional (2D) van der Waals heterostructures. Herein, an all-optical diode prototype based on nonreciprocal nonlinear absorption and interfacial charge transfer is proposed and designed by both simulation and experiment based on ready van der Waals heterostructures. The giant saturable absorption from 2D MXenes (NbC) and reverse saturable absorption from 2D chalcogenides (GaS) play a synergistic role in the designed all-optical diodes, which is characterized by a femtosecond laser based Z-scan system. The comprehensive physical mechanism of this all-optical diode based on 2D van der Waals NbC/GaS heterostructure designed by simulations, is consistent with experiments under the consideration of both nonreciprocal nonlinear absorption and interfacial effect. This all-optical diode based on the 2D van der Waals heterostructure features the simplicity, scalability, stability, integration, and compatibility with the complementary planar fabrication technology, which can further extend and miniaturize the nonlinear photonic and optoelectric devices.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Characterizing the limiting critical Potts measures on locally regular-tree-like expander graphs
Authors:
Hang Du,
Yanxin Zhou
Abstract:
For any integers $d,q\ge 3$, we consider the $q$-state ferromagnetic Potts model with an external field on a sequence of expander graphs that converges to the $d$-regular tree $\mathtt{T}_d$ in the Benjamini-Schramm sense. We show that along the critical line, any subsequential local weak limit of the Potts measures is a mixture of the free and wired Potts Gibbs measures on $\mathtt{T}_d$. Further…
▽ More
For any integers $d,q\ge 3$, we consider the $q$-state ferromagnetic Potts model with an external field on a sequence of expander graphs that converges to the $d$-regular tree $\mathtt{T}_d$ in the Benjamini-Schramm sense. We show that along the critical line, any subsequential local weak limit of the Potts measures is a mixture of the free and wired Potts Gibbs measures on $\mathtt{T}_d$. Furthermore, we show the possibility of an arbitrary extent of strong phase coexistence: for any $α\in [0,1]$, there exists a sequence of locally $\mathtt{T}_d$-like expander graphs $\{G_n\}$, such that the Potts measures on $\{G_n\}$ locally weakly converges to the $(α,1-α)$-mixture of the free and wired Potts Gibbs measures. Our result extends results of \cite{HJP23} which restrict to the zero-field case and also require $q$ to be sufficiently large relative to $d$, and results of \cite{BDS23} which restrict to the even $d$ case. We also confirm the phase coexistence prediction of \cite{BDS23}, asserting that the Potts local weak limit is a genuine mixture of the free and wired states in a generic setting. We further characterize the subsequential local weak limits of random cluster measures on such graph sequences, for any cluster parameter $q>2$ (not necessarily integer).
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors:
Junyu Chen,
Shuwen Wei,
Joel Honkamaa,
Pekka Marttinen,
Hang Zhang,
Min Liu,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao,
Lukas Förner,
Thomas Wendler,
Bailiang Jian,
Benedikt Wiestler,
Tim Hable,
Jin Kim,
Dan Ruan,
Frederic Madesta,
Thilo Sentker,
Wiebke Heyer,
Lianrui Zuo
, et al. (11 additional authors not shown)
Abstract:
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI…
▽ More
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Authors:
Zefan Cai,
Wen Xiao,
Hanshi Sun,
Cheng Luo,
Yikai Zhang,
Ke Wan,
Yucheng Li,
Yeyang Zhou,
Li-Wen Chang,
Jiuxiang Gu,
Zhen Dong,
Anima Anandkumar,
Abedelkadir Asi,
Junjie Hu
Abstract:
Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV…
▽ More
Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
△ Less
Submitted 13 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Nonlinear Oscillatory Response of Automated Vehicle Car-following: Theoretical Analysis with Traffic State and Control Input Limits
Authors:
Sixu Li,
Yang Zhou
Abstract:
This paper presents a framework grounded in the theory of describing function (DF) and incremental-input DF to theoretically analyze the nonlinear oscillatory response of automated vehicles (AVs) car-following (CF) amidst traffic oscillations, considering the limits of traffic state and control input. While prevailing approaches largely ignore these limits (i.e., saturation of acceleration/deceler…
▽ More
This paper presents a framework grounded in the theory of describing function (DF) and incremental-input DF to theoretically analyze the nonlinear oscillatory response of automated vehicles (AVs) car-following (CF) amidst traffic oscillations, considering the limits of traffic state and control input. While prevailing approaches largely ignore these limits (i.e., saturation of acceleration/deceleration and speed) and focus on linear string stability analysis, this framework establishes a basis for theoretically analyzing the frequency response of AV systems with nonlinearities imposed by these limits. To this end, trajectories of CF pairs are decomposed into nominal and oscillatory trajectories, subsequently, the controlled AV system is repositioned within the oscillatory trajectory coordinates. Built on this base, DFs are employed to approximate the frequency responses of nonlinear saturation components by using their first harmonic output, thereby capturing the associated amplification ratio and phase shift. Considering the closed-loop nature of AV control systems, where system states and control input mutually influence each other, amplification ratios and phase shifts are balanced within the loop to ensure consistency. This balancing process may render multiple solutions, hence the incremental-input DF is further applied to identify the reasonable ones. The proposed method is validated by estimations from Simulink, and further comparisons with prevailing methods are conducted. Results confirm the alignment of our framework with Simulink results and exhibit its superior accuracy in analysis compared to the prevailing methods. Furthermore, the framework proves valuable in string stability analysis, especially when conventional linear methods offer misleading insights.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Authors:
Mengkang Hu,
Yuhang Zhou,
Wendong Fan,
Yuzhou Nie,
Bowei Xia,
Tao Sun,
Ziyu Ye,
Zhaoxuan Jin,
Yingru Li,
Qiguang Chen,
Zeyu Zhang,
Yifeng Wang,
Qianshuo Ye,
Bernard Ghanem,
Ping Luo,
Guohao Li
Abstract:
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework t…
▽ More
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
△ Less
Submitted 10 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization
Authors:
Chengli Tan,
Yubo Zhou,
Haishan Ye,
Guang Dai,
Junmin Liu,
Zengjie Song,
Jiangshe Zhang,
Zixiang Zhao,
Yunda Hao,
Yong Xu
Abstract:
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently propose…
▽ More
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
ValueSim: Generating Backstories to Model Individual Value Systems
Authors:
Bangde Du,
Ziyi Ye,
Zhijing Wu,
Jankowska Monika,
Shuqi Zhu,
Qingyao Ai,
Yujia Zhou,
Yiqun Liu
Abstract:
As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely fo…
▽ More
As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model's ability to refine its persona simulation capabilities over time.
△ Less
Submitted 5 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Authors:
Zixiang Xu,
Yanbo Wang,
Yue Huang,
Jiayi Ye,
Haomin Zhuang,
Zirui Song,
Lang Gao,
Chenxi Wang,
Zhaorun Chen,
Yujun Zhou,
Sixian Li,
Wang Pan,
Yue Zhao,
Jieyu Zhao,
Xiangliang Zhang,
Xiuying Chen
Abstract:
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently n…
▽ More
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Authors:
Yu Li,
Jin Jiang,
Jianhua Zhu,
Shuai Peng,
Baole Wei,
Yuxuan Zhou,
Liangcai Gao
Abstract:
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advance…
▽ More
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER
△ Less
Submitted 1 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Measurement of the Lund plane for light- and beauty-quark jets
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis,
L. An
, et al. (1133 additional authors not shown)
Abstract:
The substructure of jets in quantum chromodynamics (QCD) has garnered significant attention with the advent of infrared- and collinear-safe clustering algorithms and observables. A key question emerging from these studies is how in-jet emissions at soft and hard energy scales, across collinear and wide angles relative to the emitter, differ with the mass of the emitting parton. The Lund jet plane…
▽ More
The substructure of jets in quantum chromodynamics (QCD) has garnered significant attention with the advent of infrared- and collinear-safe clustering algorithms and observables. A key question emerging from these studies is how in-jet emissions at soft and hard energy scales, across collinear and wide angles relative to the emitter, differ with the mass of the emitting parton. The Lund jet plane (LJP) is a perturbatively well-defined substructure observable that maps the radiation pattern of jets onto a plane, visually distinguishing emissions with different kinematic properties. Comparing LJP for jets containing hadrons of low versus high mass enables the testing of QCD splitting functions from first-principles calculations across both soft and hard regimes and at different radiation angles. This article presents the first measurement of the LJP for light-quark-enriched and beauty-initiated jets at center-of-mass energy of 13\tev at LHCb. This marks the first direct observation of the dead-cone effect in beauty-quark jets, measured in the collinear region of the LJP.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration
Authors:
Hao Lu,
Yanchi Gu,
Haoyuan Huang,
Yulin Zhou,
Ningxin Zhu,
Chen Li
Abstract:
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like…
▽ More
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
Authors:
Jinhui Wei,
Ye Huang,
Yuhui Zhou,
Jiazhi Jiang,
Jiangsu Du,
Yutong Lu
Abstract:
In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devic…
▽ More
In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.
△ Less
Submitted 9 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
Authors:
Jinchuan Zhang,
Lu Yin,
Yan Zhou,
Songlin Hu
Abstract:
The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignm…
▽ More
The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.