-
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Authors:
Jiayi Tian,
Ryan Solgi,
Jinming Lu,
Yifan Yang,
Hai Li,
Zheng Zhang
Abstract:
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result i…
▽ More
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis (PCA), and employ an importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast
Authors:
Shreeram Suresh Chandra,
Lucas Goncalves,
Junchen Lu,
Carlos Busso,
Berrak Sisman
Abstract:
Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle thes…
▽ More
Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Hybrid Learning for Cold-Start-Aware Microservice Scheduling in Dynamic Edge Environments
Authors:
Jingxi Lu,
Wenhao Li,
Jianxiong Guo,
Xingjian Ding,
Zhiqing Tang,
Tian Wang,
Weijia Jia
Abstract:
With the rapid growth of IoT devices and their diverse workloads, container-based microservices deployed at edge nodes have become a lightweight and scalable solution. However, existing microservice scheduling algorithms often assume static resource availability, which is unrealistic when multiple containers are assigned to an edge node. Besides, containers suffer from cold-start inefficiencies du…
▽ More
With the rapid growth of IoT devices and their diverse workloads, container-based microservices deployed at edge nodes have become a lightweight and scalable solution. However, existing microservice scheduling algorithms often assume static resource availability, which is unrealistic when multiple containers are assigned to an edge node. Besides, containers suffer from cold-start inefficiencies during early-stage training in currently popular reinforcement learning (RL) algorithms. In this paper, we propose a hybrid learning framework that combines offline imitation learning (IL) with online Soft Actor-Critic (SAC) optimization to enable a cold-start-aware microservice scheduling with dynamic allocation for computing resources. We first formulate a delay-and-energy-aware scheduling problem and construct a rule-based expert to generate demonstration data for behavior cloning. Then, a GRU-enhanced policy network is designed in the policy network to extract the correlation among multiple decisions by separately encoding slow-evolving node states and fast-changing microservice features, and an action selection mechanism is given to speed up the convergence. Extensive experiments show that our method significantly accelerates convergence and achieves superior final performance. Compared with baselines, our algorithm improves the total objective by $50\%$ and convergence speed by $70\%$, and demonstrates the highest stability and robustness across various edge configurations.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Search for a dark baryon in the $Ξ^-\rightarrowπ^-+{\rm invisible}$ decay
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (697 additional authors not shown)
Abstract:
A search for a dark baryon is performed for the first time in the two-body decay $Ξ^-\rightarrowπ^-+{\rm invisible}$ using $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected at a center-of-mass energy of $\sqrt{s}=3.097\,\mbox{GeV}$ with the BESIII detector at the BEPCII collider. No significant signal is observed, and the 90% (95%) confidence level upper limits on the branching fraction…
▽ More
A search for a dark baryon is performed for the first time in the two-body decay $Ξ^-\rightarrowπ^-+{\rm invisible}$ using $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected at a center-of-mass energy of $\sqrt{s}=3.097\,\mbox{GeV}$ with the BESIII detector at the BEPCII collider. No significant signal is observed, and the 90% (95%) confidence level upper limits on the branching fraction $B(Ξ^-\rightarrowπ^-+{\rm invisible})$ are determined to be $4.2\times10^{-5}$ ($5.2\times10^{-5}$), $6.9\times10^{-5}$ ($8.4\times10^{-5}$), $6.5\times10^{-4}$ ($7.6\times10^{-4}$), $1.1\times10^{-4}$ ($1.3\times10^{-4}$) and $4.5\times10^{-5}$ ($5.5\times10^{-5}$), under the dark baryon mass hypotheses of 1.07$\,\mbox{GeV}/c^2$, 1.10$\,\mbox{GeV}/c^2$, $m_Λ$ (1.116$\,\mbox{GeV}/c^2$), 1.13$\,\mbox{GeV}/c^2$, and 1.16$\,\mbox{GeV}/c^2$, respectively. The constraints obtained on the Wilson coefficients $C_{u s, s}^L$ and $C_{u s, s}^R$ are more stringent than the previous limits derived from the LHC searches for the colored mediators.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models
Authors:
Yi Zhan,
Qi Liu,
Weibo Gao,
Zheng Zhang,
Tianfu Wang,
Shuanghong Shen,
Junyu Lu,
Zhenya Huang
Abstract:
Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this…
▽ More
Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students' programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Reasoning LLMs are Wandering Solution Explorers
Authors:
Jiahao Lu,
Ziwei Xu,
Mohan Kankanhalli
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure m…
▽ More
Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Authors:
Jiahao Qiu,
Fulian Xiao,
Yimin Wang,
Yuchen Mao,
Yijia Chen,
Xinzhe Juan,
Siran Wang,
Xuan Qi,
Tongcheng Zhang,
Zixin Yao,
Jiacheng Guo,
Yifu Lu,
Charles Argon,
Jundi Cui,
Daixin Chen,
Junran Zhou,
Shuyao Zhou,
Zhanpeng Zhou,
Ling Yang,
Shilong Liu,
Hongru Wang,
Kaixuan Huang,
Xun Jiang,
Yuming Cao,
Yue Chen
, et al. (73 additional authors not shown)
Abstract:
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks,…
▽ More
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
△ Less
Submitted 7 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
First measurement of $Σ^{+}n\rightarrowΛp$ and $Σ^{+}n\rightarrowΣ^{0}p$ cross-sections via $Σ^+$-nucleus scattering at an electron-positron collider
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (680 additional authors not shown)
Abstract:
Using $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected with the BESIII detector at the BEPCII storage ring, the reactions $Σ^{+}n\rightarrowΛp$ and $Σ^{+}n\rightarrowΣ^{0}p$ are studied, where the $Σ^{+}$ baryon is produced in the process $J/ψ\rightarrowΣ^{+}\barΣ^-$ and the neutron is a component of the $^9\rm{Be}$, $^{12}\rm{C}$ and $^{197}\rm{Au}$ nuclei in the beam pipe. Clear signals o…
▽ More
Using $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected with the BESIII detector at the BEPCII storage ring, the reactions $Σ^{+}n\rightarrowΛp$ and $Σ^{+}n\rightarrowΣ^{0}p$ are studied, where the $Σ^{+}$ baryon is produced in the process $J/ψ\rightarrowΣ^{+}\barΣ^-$ and the neutron is a component of the $^9\rm{Be}$, $^{12}\rm{C}$ and $^{197}\rm{Au}$ nuclei in the beam pipe. Clear signals of these two reactions are observed for the first time. Their cross-sections are measured to be $σ(Σ^{+}+{^9\rm{Be}}\rightarrowΛ+p+{^8\rm{Be}})=(45.2\pm12.1_{\rm{stat}}\pm7.2_{\rm{sys}})$ mb and $σ(Σ^{+}+{^9\rm{Be}}\rightarrowΣ^{0}+p+{^8\rm{Be}})=(29.8\pm9.7_{\rm{stat}}\pm6.9_{\rm{sys}})$ mb for a $Σ^{+}$ average momentum of $0.992$ GeV/$c$, within a range of $\pm0.015$ GeV/$c$. This is the first study of $Σ^{+}$-nucleon scattering at an electron-positron collider.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions
Authors:
Zheng Wang,
Xiaobin Rong,
Yu Sun,
Tianchi Sun,
Zhibin Lin,
Jing Lu
Abstract:
Although deep learning based multi-channel speech enhancement has achieved significant advancements, its practical deployment is often limited by constrained computational resources, particularly in low signal-to-noise ratio (SNR) conditions. In this paper, we propose a lightweight hybrid dual-channel speech enhancement system that combines independent vector analysis (IVA) with a modified version…
▽ More
Although deep learning based multi-channel speech enhancement has achieved significant advancements, its practical deployment is often limited by constrained computational resources, particularly in low signal-to-noise ratio (SNR) conditions. In this paper, we propose a lightweight hybrid dual-channel speech enhancement system that combines independent vector analysis (IVA) with a modified version of the dual-channel grouped temporal convolutional recurrent network (GTCRN). IVA functions as a coarse estimator, providing auxiliary information for both speech and noise, while the modified GTCRN further refines the speech quality. We investigate several modifications to ensure the comprehensive utilization of both original and auxiliary information. Experimental results demonstrate the effectiveness of the proposed system, achieving enhanced speech with minimal parameters and low computational complexity.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
YOPO-Rally: A Sim-to-Real Single-Stage Planner for Off-Road Terrain
Authors:
Hongyu Cao,
Junjie Lu,
Xuewei Zhang,
Yulin Hui,
Zhiyu Li,
Bailing Tian
Abstract:
Off-road navigation remains challenging for autonomous robots due to the harsh terrain and clustered obstacles. In this letter, we extend the YOPO (You Only Plan Once) end-to-end navigation framework to off-road environments, explicitly focusing on forest terrains, consisting of a high-performance, multi-sensor supported off-road simulator YOPO-Sim, a zero-shot transfer sim-to-real planner YOPO-Ra…
▽ More
Off-road navigation remains challenging for autonomous robots due to the harsh terrain and clustered obstacles. In this letter, we extend the YOPO (You Only Plan Once) end-to-end navigation framework to off-road environments, explicitly focusing on forest terrains, consisting of a high-performance, multi-sensor supported off-road simulator YOPO-Sim, a zero-shot transfer sim-to-real planner YOPO-Rally, and an MPC controller. Built on the Unity engine, the simulator can generate randomized forest environments and export depth images and point cloud maps for expert demonstrations, providing competitive performance with mainstream simulators. Terrain Traversability Analysis (TTA) processes cost maps, generating expert trajectories represented as non-uniform cubic Hermite curves. The planner integrates TTA and the pathfinding into a single neural network that inputs the depth image, current velocity, and the goal vector, and outputs multiple trajectory candidates with costs. The planner is trained by behavior cloning in the simulator and deployed directly into the real-world without fine-tuning. Finally, a series of simulated and real-world experiments is conducted to validate the performance of the proposed framework.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network
Authors:
Xiaobin Rong,
Dahan Wang,
Qinwen Hu,
Yushi Wang,
Yuxiang Hu,
Jing Lu
Abstract:
Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage.…
▽ More
Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Semantic Correspondence: Unified Benchmarking and a Strong Baseline
Authors:
Kaiyan Zhang,
Xinghui Li,
Jingyi Lu,
Kai Han
Abstract:
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive sur…
▽ More
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.
△ Less
Submitted 27 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Measurement of branching fractions of $Λ_{c}^{+}$ decays to $Σ^{+} η$ and $Σ^{+} η'$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (644 additional authors not shown)
Abstract:
By analyzing $e^+e^-$ collision data taken at center-of-mass energies
$\sqrt{s} = 4.600 \sim 4.699$ $\mbox{GeV}$ with the BESIII detector at the BEPCII collider, corresponding to an integrated luminosity of $\rm 4.5~fb^{-1}$, we study the hadronic decays $Λ_{c}^{+} \rightarrow Σ^{+} η$ and $Λ_{c}^{+} \rightarrow Σ^{+} η^{\prime}$ using the single-tag method. The branching fraction ratio of…
▽ More
By analyzing $e^+e^-$ collision data taken at center-of-mass energies
$\sqrt{s} = 4.600 \sim 4.699$ $\mbox{GeV}$ with the BESIII detector at the BEPCII collider, corresponding to an integrated luminosity of $\rm 4.5~fb^{-1}$, we study the hadronic decays $Λ_{c}^{+} \rightarrow Σ^{+} η$ and $Λ_{c}^{+} \rightarrow Σ^{+} η^{\prime}$ using the single-tag method. The branching fraction ratio of $Λ_{c}^+ \rightarrow Σ^+ η$ relative to $Λ_{c}^+ \rightarrow Σ^+ π^0$ is determined to be $0.305 \pm 0.046_{\rm stat.} \pm 0.007_{\rm sys.}$, and that of $Λ_{c}^+ \rightarrow Σ^+ η'$ relative to $Λ_{c}^+ \rightarrow Σ^+ ω$ is $0.336 \pm 0.094_{\rm stat.} \pm 0.037_{\rm sys.}$. The ratio of $\frac{\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} η'\right)}{\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} η\right)} $ is determined to be $1.50\pm 0.48 \pm 0.17 \pm 0.21$, where the uncertainties are statistical, systematic, and from $\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} π^0\right) $ or $\mathcal{B}\left(Λ_{c}^{+} \rightarrow Σ^{+} ω\right) $, respectively. These results enrich our knowledge of charmed baryon decays.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
ADLGen: Synthesizing Symbolic, Event-Triggered Sensor Sequences for Human Activity Modeling
Authors:
Weihang You,
Hanqi Jiang,
Zishuai Liu,
Zihang Xie,
Tianming Liu,
Jin Lu,
Fei Dou
Abstract:
Real world collection of Activities of Daily Living data is challenging due to privacy concerns, costly deployment and labeling, and the inherent sparsity and imbalance of human behavior. We present ADLGen, a generative framework specifically designed to synthesize realistic, event triggered, and symbolic sensor sequences for ambient assistive environments. ADLGen integrates a decoder only Transfo…
▽ More
Real world collection of Activities of Daily Living data is challenging due to privacy concerns, costly deployment and labeling, and the inherent sparsity and imbalance of human behavior. We present ADLGen, a generative framework specifically designed to synthesize realistic, event triggered, and symbolic sensor sequences for ambient assistive environments. ADLGen integrates a decoder only Transformer with sign based symbolic temporal encoding, and a context and layout aware sampling mechanism to guide generation toward semantically rich and physically plausible sensor event sequences. To enhance semantic fidelity and correct structural inconsistencies, we further incorporate a large language model into an automatic generate evaluate refine loop, which verifies logical, behavioral, and temporal coherence and generates correction rules without manual intervention or environment specific tuning. Through comprehensive experiments with novel evaluation metrics, ADLGen is shown to outperform baseline generators in statistical fidelity, semantic richness, and downstream activity recognition, offering a scalable and privacy-preserving solution for ADL data synthesis.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Towards Practical Defect-Focused Automated Code Review
Authors:
Junyi Lu,
Lili Jiang,
Xiaojia Li,
Jianbing Fang,
Fengjun Zhang,
Li Yang,
Chun Zuo
Abstract:
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues,…
▽ More
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.
△ Less
Submitted 28 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Graph Mamba for Efficient Whole Slide Image Understanding
Authors:
Jiaxuan Lu,
Junyan Shi,
Yuhui Lin,
Fang Yan,
Yue Gao,
Shaoting Zhang,
Xiaosong Wang
Abstract:
Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WS…
▽ More
Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WSI-GMamba framework, which synergistically combines the relational modeling strengths of GNNs with the efficiency of Mamba, the State Space Model designed for sequence learning. The proposed GMamba block integrates Message Passing, Graph Scanning & Flattening, and feature aggregation via a Bidirectional State Space Model (Bi-SSM), achieving Transformer-level performance with 7* fewer FLOPs. By leveraging the complementary strengths of lightweight GNNs and Mamba, the WSI-GMamba framework delivers a scalable solution for large-scale WSI analysis, offering both high accuracy and computational efficiency for slide-level classification.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Unconventional tunnel magnetoresistance scaling with altermagnets
Authors:
Zongmeng Yang,
Xingyue Yang,
Jianhua Wang,
Rui Peng,
Lee Ching Hua,
Lay Kee Ang,
Jing Lu,
Yee Sin Ang,
Shibo Fang
Abstract:
In conventional magnetic tunnel junctions (MTJs), the tunnel magnetoresistance (TMR) typically increases with barrier thickness as electron transmission in the antiparallel configuration decays faster than that of the parallel configuration. In this work, we reveal an anomalous scaling effect in altermagnetic tunnel junctions (AMTJs), where the TMR decreases anomalously with an increasing barrier…
▽ More
In conventional magnetic tunnel junctions (MTJs), the tunnel magnetoresistance (TMR) typically increases with barrier thickness as electron transmission in the antiparallel configuration decays faster than that of the parallel configuration. In this work, we reveal an anomalous scaling effect in altermagnetic tunnel junctions (AMTJs), where the TMR decreases anomalously with an increasing barrier thickness. The anomalous scaling originates from the overlapping spin-split branches form a transmission path that cannot be suppressed in the antiparallel state. Such phenomena is explained by adouble-barrier model and is further demonstrated using ab initio quantum transport simulations in 2D V2Te2O/Cr2Se2O/V2Te2O-based AMTJ, where the TMR anomalously decreases from 220% to 40% as the layer number of Cr2Se2O increases from 1 to 5. Our work identifies a peculiar unexpected transport characteristic of AMTJ, providing a fundamental limit on AMTJ device design and illustrating the potential optimal design of AMTJ at the ultrascaled monolayer limit.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Non-Parametric Attenuation Curves in Local Star-Forming Galaxies: Geometry Effect, Dust Evolution, and ISS
Authors:
Jiafeng Lu,
Xi Kang,
Shiyin Shen,
Qi Zeng,
Shuai Feng
Abstract:
We introduce a non-parametric approach, the Stellar Population Synthesis with Equivalent Widths (SEW) method, to reconstruct spectral-resolution wavelength-dependent attenuation curves for 169,568 star-forming galaxies from the SDSS DR7. Composite attenuation curves, stacked across stellar mass and inclination bins, reveal systematic trends: higher stellar mass correlates with steeper attenuation…
▽ More
We introduce a non-parametric approach, the Stellar Population Synthesis with Equivalent Widths (SEW) method, to reconstruct spectral-resolution wavelength-dependent attenuation curves for 169,568 star-forming galaxies from the SDSS DR7. Composite attenuation curves, stacked across stellar mass and inclination bins, reveal systematic trends: higher stellar mass correlates with steeper attenuation slopes (lower $R_V$), while edge-on galaxies exhibit flatter curves due to geometric saturation effects. Radiative transfer modelling under a uniform dust-star mixture confirms that the observed slope evolution with inclination comes from the galaxy geometry; the slope evolution with stellar mass arises from intrinsic dust property variations, linked to mass-dependent grain processing mechanisms. Additionally, intermediate-scale structures (ISS) at 4870, 6370, and 7690 Å are tentatively detected. These findings underscore the interplay between dust geometry, grain evolution, and galactic environment, offering new insights into dust lifecycle models.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Recursive Offloading for LLM Serving in Multi-tier Networks
Authors:
Zhiyuan Wu,
Sheng Sun,
Yuwei Wang,
Min Liu,
Bo Gao,
Jinda Lu,
Zheming Yang,
Tian Wen
Abstract:
Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier networ…
▽ More
Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50\% compared to centralized cloud-based serving.
△ Less
Submitted 24 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Authors:
Shuhao Han,
Haotian Fan,
Fangyuan Kong,
Wenjie Liao,
Chunle Guo,
Chongyi Li,
Radu Timofte,
Liang Li,
Tao Li,
Junhui Cui,
Yunqiu Wang,
Yang Tai,
Jingwei Sun,
Jianhui Sun,
Xinli Yue,
Tianyi Wang,
Huan Hou,
Junda Lu,
Xinyang Huang,
Zitang Zhou,
Zijian Zhang,
Xuhui Zheng,
Xuecheng Wu,
Chong Peng,
Xuezhi Cao
, et al. (90 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspe…
▽ More
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models
Authors:
Chenzhuo Zhao,
Ziqian Liu,
Xingda Wang,
Junting Lu,
Chaoyi Ruan
Abstract:
Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Promp…
▽ More
Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond
Authors:
Shangding Gu,
Donghao Ying,
Ming Jin,
Yu Joe Lu,
Jun Wang,
Javad Lavaei,
Costas Spanos
Abstract:
We introduce Model Feedback Learning (MFL), a novel test-time optimization framework for optimizing inputs to pre-trained AI models or deployed hardware systems without requiring any retraining of the models or modifications to the hardware. In contrast to existing methods that rely on adjusting model parameters, MFL leverages a lightweight reverse model to iteratively search for optimal inputs, e…
▽ More
We introduce Model Feedback Learning (MFL), a novel test-time optimization framework for optimizing inputs to pre-trained AI models or deployed hardware systems without requiring any retraining of the models or modifications to the hardware. In contrast to existing methods that rely on adjusting model parameters, MFL leverages a lightweight reverse model to iteratively search for optimal inputs, enabling efficient adaptation to new objectives under deployment constraints. This framework is particularly advantageous in real-world settings, such as semiconductor manufacturing recipe generation, where modifying deployed systems is often infeasible or cost-prohibitive. We validate MFL on semiconductor plasma etching tasks, where it achieves target recipe generation in just five iterations, significantly outperforming both Bayesian optimization and human experts. Beyond semiconductor applications, MFL also demonstrates strong performance in chemical processes (e.g., chemical vapor deposition) and electronic systems (e.g., wire bonding), highlighting its broad applicability. Additionally, MFL incorporates stability-aware optimization, enhancing robustness to process variations and surpassing conventional supervised learning and random search methods in high-dimensional control settings. By enabling few-shot adaptation, MFL provides a scalable and efficient paradigm for deploying intelligent control in real-world environments.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Neural Quantum Digital Twins for Optimizing Quantum Annealing
Authors:
Jianlong Lu,
Hanqiu Peng,
Ying Chen
Abstract:
Quantum annealers have shown potential in addressing certain combinatorial optimization problems, though their performance is often limited by scalability and errors rates. In this work, we propose a Neural Quantum Digital Twin (NQDT) framework that reconstructs the energy landscape of quantum many-body systems relevant to quantum annealing. The digital twin models both ground and excited state dy…
▽ More
Quantum annealers have shown potential in addressing certain combinatorial optimization problems, though their performance is often limited by scalability and errors rates. In this work, we propose a Neural Quantum Digital Twin (NQDT) framework that reconstructs the energy landscape of quantum many-body systems relevant to quantum annealing. The digital twin models both ground and excited state dynamics, enabling detailed simulation of the adiabatic evolution process. We benchmark NQDT on systems with known analytical solutions and demonstrate that it accurately captures key quantum phenomena, including quantum criticality and phase transitions. Leveraging this framework, one can identify optimal annealing schedules that minimize excitation-related errors. These findings highlight the utility of neural network-based digital twins as a diagnostic and optimization tool for improving the performance of quantum annealers.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Observation of $χ_{cJ}\to 3K_S^0K^\pmπ^\mp$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (678 additional authors not shown)
Abstract:
By analyzing $(2712.4\pm14.3)\times10^6$ $ψ(3686)$ events collected with the BESIII detector operating at the BEPCII collider, the decays $χ_{c0,1,2} \to 3K_S^0K^\pmπ^\mp$ are observed for the first time with statistical significances greater than $10σ$. The branching fractions of these decays are determined to be $\mathcal{B}(χ_{c0}\to 3K_S^0K^\pmπ^\mp )=(7.95\pm0.50\pm0.65)\times10^{-5},$…
▽ More
By analyzing $(2712.4\pm14.3)\times10^6$ $ψ(3686)$ events collected with the BESIII detector operating at the BEPCII collider, the decays $χ_{c0,1,2} \to 3K_S^0K^\pmπ^\mp$ are observed for the first time with statistical significances greater than $10σ$. The branching fractions of these decays are determined to be $\mathcal{B}(χ_{c0}\to 3K_S^0K^\pmπ^\mp )=(7.95\pm0.50\pm0.65)\times10^{-5},$ $\mathcal{B}(χ_{c1}\to 3K_S^0K^\pmπ^\mp)=(2.62\pm0.08\pm0.19)\times10^{-4},$ and $\mathcal{B}(χ_{c2}\to 3K_S^0K^\pmπ^\mp)=(1.72\pm0.07\pm0.15)\times10^{-4},$ where the first uncertainties are statistical and the second systematic.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Authors:
Jinghui Lu,
Haiyang Yu,
Siliang Xu,
Shiwei Ran,
Guozhi Tang,
Siqi Wang,
Bin Shan,
Teng Fu,
Hao Feng,
Jingqun Tang,
Han Wang,
Can Huang
Abstract:
Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally i…
▽ More
Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Test of local realism via entangled $Λ\barΛ$ system
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
M. R. An,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann
, et al. (597 additional authors not shown)
Abstract:
The non-locality of quantum correlations is a fundamental feature of quantum theory. The Bell inequality serves as a benchmark for distinguishing between predictions made by quantum theory and local hidden variable theory (LHVT). Recent advancements in photon-entanglement experiments have addressed potential loopholes and have observed significant violations of variants of Bell inequality. However…
▽ More
The non-locality of quantum correlations is a fundamental feature of quantum theory. The Bell inequality serves as a benchmark for distinguishing between predictions made by quantum theory and local hidden variable theory (LHVT). Recent advancements in photon-entanglement experiments have addressed potential loopholes and have observed significant violations of variants of Bell inequality. However, examples of Bell inequalities violation in high energy physics are scarce. In this study, we utilize $(10.087\pm0.044)\times10^{9}$ $J/ψ$ events collected with the BES-III detector at the BEPCII collider, performing non-local correlation tests using the entangled hyperon pairs. The massive-entangled $Λ\barΛ$ systems are formed and decay through strong and weak interactions, respectively. Through measurements of the angular distribution of $p\bar{p}$ in $J/ψ\to γη_c$ and subsequent $η_c\toΛ(pπ^-)\barΛ(\bar{p}π^{+})$ cascade decays, a significant violation of LHVT predictions is observed. The exclusion of LHVT is found to be statistically significant at a level exceeding $5.2σ$ in the testing of three Bell-like inequalities.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Self-Evolving Curriculum for LLM Reasoning
Authors:
Xiaoyin Chen,
Jiarui Lu,
Minsu Kim,
Dinghuai Zhang,
Jian Tang,
Alexandre Piché,
Nicolas Gontier,
Yoshua Bengio,
Ehsan Kamalloo
Abstract:
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptima…
▽ More
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
△ Less
Submitted 29 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Authors:
Rui Tian,
Mingfei Gao,
Mingze Xu,
Jiaming Hu,
Jiasen Lu,
Zuxuan Wu,
Yinfei Yang,
Afshin Dehghan
Abstract:
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, whi…
▽ More
We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
Authors:
Ruihuang Li,
Caijin Zhou,
Shoujian Zheng,
Jianxiang Lu,
Jiabin Huang,
Comi Chen,
Junshu Tang,
Guangzheng Xu,
Jiale Tao,
Hongmei Wang,
Donghao Li,
Wenqing Yu,
Senbo Wang,
Zhimin Li,
Yetshuan Shi,
Haoyu Yang,
Yukun Wang,
Wenxun Dai,
Jiaqi Li,
Linqing Wang,
Qixun Wang,
Zhiyong Xu,
Yingfang Zhang,
Jiangfeng Xiong,
Weijie Kong
, et al. (33 additional authors not shown)
Abstract:
Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult…
▽ More
Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.
△ Less
Submitted 28 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Authors:
Hao Feng,
Shu Wei,
Xiang Fei,
Wei Shi,
Yingdong Han,
Lei Liao,
Jinghui Lu,
Binghong Wu,
Qi Liu,
Chunhui Lin,
Jingqun Tang,
Hao Liu,
Can Huang
Abstract:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitati…
▽ More
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Inverse-Designed Silicon Nitride Nanophotonics
Authors:
Toby Bi,
Shuangyou Zhang,
Egemen Bostan,
Danxian Liu,
Aditya Paul,
Olga Ohletz,
Irina Harder,
Yaojing Zhang,
Alekhya Ghosh,
Abdullah Alabbadi,
Masoud Kheyri,
Tianyi Zeng,
Jesse Lu,
Kiyoul Yang,
Pascal Del'Haye
Abstract:
Silicon nitride photonics has enabled integration of a variety of components for applications in linear and nonlinear optics, including telecommunications, optical clocks, astrocombs, bio-sensing, and LiDAR. With the advent of inverse design - where desired device performance is specified and closely achieved through iterative, gradient-based optimization - and the increasing availability of silic…
▽ More
Silicon nitride photonics has enabled integration of a variety of components for applications in linear and nonlinear optics, including telecommunications, optical clocks, astrocombs, bio-sensing, and LiDAR. With the advent of inverse design - where desired device performance is specified and closely achieved through iterative, gradient-based optimization - and the increasing availability of silicon nitride photonics via foundries, it is now feasible to expand the photonic design library beyond the limits of traditional approaches and unlock new functionalities. In this work, we present inverse-designed photonics on a silicon nitride platform and demonstrate both the design capabilities and experimental validation of manipulating light in wavelength and spatial mode dimensions to high-Q resonators with controllable wavelength range and dispersion. Furthermore, we use these inverse-designed structures to form optical cavities that hold promise for on-chip nonlinear and quantum optics experiments.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges
Authors:
Hongru Wang,
Wenyu Huang,
Yufei Wang,
Yuanhao Xi,
Jianqiao Lu,
Huan Zhang,
Nan Hu,
Zeming Liu,
Jeff Z. Pan,
Kam-Fai Wong
Abstract:
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool i…
▽ More
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Partial Wave Analysis of $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$ and Cross Section Measurement of $e^{+}e^{-} \rightarrow π^{\pm}Z_{c}(3900)^{\mp}$ from 4.1271 to 4.3583 GeV
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (639 additional authors not shown)
Abstract:
Based on 12.0 $\mathrm{fb^{-1}}$ of $e^{+}e^{-}$ collision data samples collected by the BESIII detector at center-of-mass energies from 4.1271 to 4.3583 GeV, a partial wave analysis is performed for the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. The cross sections for the sub processes ${e^{+}e^{-}\rightarrowπ^{+}Z_{c}(3900)^{-}+c.c.\rightarrowπ^{+}π^{-}J/ψ}$,…
▽ More
Based on 12.0 $\mathrm{fb^{-1}}$ of $e^{+}e^{-}$ collision data samples collected by the BESIII detector at center-of-mass energies from 4.1271 to 4.3583 GeV, a partial wave analysis is performed for the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. The cross sections for the sub processes ${e^{+}e^{-}\rightarrowπ^{+}Z_{c}(3900)^{-}+c.c.\rightarrowπ^{+}π^{-}J/ψ}$, $f_{0}(980)(\rightarrowπ^{+}π^{-})J/ψ$, and $(π^{+}π^{-})_{\rm{S\mbox{-}wave}} J/ψ$ are measured for the first time. The mass and width of the $Z_{c}(3900)^{\pm}$ are determined to be $3884.6\pm0.7\pm3.3$ MeV/$c^{2}$ and $37.2\pm1.3\pm6.6$ MeV, respectively. The first errors are statistical and the second systematic. The final state $(π^{+}π^{-})_{\rm{S\mbox{-}wave}} J/ψ$ dominates the process $e^{+}e^{-} \rightarrow π^{+}π^{-}J/ψ$. By analyzing the cross sections of $π^{\pm}Z_{c}(3900)^{\mp}$ and $f_{0}(980)J/ψ$, $Y(4220)$ has been observed. Its mass and width are determined to be $4225.8\pm4.2\pm3.1$ MeV/$c^{2}$ and $55.3\pm9.5\pm11.1$ MeV, respectively.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
Authors:
Xiaoyu Yang,
Jie Lu,
En Yu
Abstract:
This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theo…
▽ More
This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT's autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contributed a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Advancing Sequential Numerical Prediction in Autoregressive Models
Authors:
Xiang Fei,
Jinghui Lu,
Qi Sun,
Hao Feng,
Yanjie Wang,
Wei Shi,
An-Lan Wang,
Jingqun Tang,
Can Huang
Abstract:
Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Dista…
▽ More
Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.
△ Less
Submitted 28 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Modular Symmetry with Weighton
Authors:
Gui-Jun Ding,
Stephen F. King,
Jun-Nan Lu,
Ming-Hua Weng
Abstract:
We systematically develop the weighton mechanism for natural quark and charged lepton mass hierarchies in the framework of modular symmetry with a single modulus field $τ$. The weighton $φ$ is defined as a complete singlet with unit modular weight, leading to fermion mass suppression by powers of $\tildeφ$, which is the vacuum expectation value of the field scaled by a flavour cut-off. Further mas…
▽ More
We systematically develop the weighton mechanism for natural quark and charged lepton mass hierarchies in the framework of modular symmetry with a single modulus field $τ$. The weighton $φ$ is defined as a complete singlet with unit modular weight, leading to fermion mass suppression by powers of $\tildeφ$, which is the vacuum expectation value of the field scaled by a flavour cut-off. Further mass and mixing angle suppression comes from powers of the small parameter, $q\equiv e^{i2πτ}$. Assuming some fields transform as triplets under the finite modular symmetry, with general assignments for the other fields, we perform a complete analysis for the levels $N=3, 4, 5$, expressing fermion masses and mixings in terms of powers of the small parameters $\tildeφ$ and $q$. We present two examples in detail, based on the modular group $T'$, close to the CP boundary of $τ$, which can address both fermion mass and mixing hierarchies using a weighton field.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Learning Robust Spectral Dynamics for Temporal Domain Generalization
Authors:
En Yu,
Jie Lu,
Xiaoyu Yang,
Guangquan Zhang,
Zhen Fang
Abstract:
Modern machine learning models struggle to maintain performance in dynamic environments where temporal distribution shifts, \emph{i.e., concept drift}, are prevalent. Temporal Domain Generalization (TDG) seeks to enable model generalization across evolving domains, yet existing approaches typically assume smooth incremental changes, struggling with complex real-world drifts involving long-term str…
▽ More
Modern machine learning models struggle to maintain performance in dynamic environments where temporal distribution shifts, \emph{i.e., concept drift}, are prevalent. Temporal Domain Generalization (TDG) seeks to enable model generalization across evolving domains, yet existing approaches typically assume smooth incremental changes, struggling with complex real-world drifts involving long-term structure (incremental evolution/periodicity) and local uncertainties. To overcome these limitations, we introduce FreKoo, which tackles these challenges via a novel frequency-domain analysis of parameter trajectories. It leverages the Fourier transform to disentangle parameter evolution into distinct spectral bands. Specifically, low-frequency component with dominant dynamics are learned and extrapolated using the Koopman operator, robustly capturing diverse drift patterns including both incremental and periodicity. Simultaneously, potentially disruptive high-frequency variations are smoothed via targeted temporal regularization, preventing overfitting to transient noise and domain uncertainties. In addition, this dual spectral strategy is rigorously grounded through theoretical analysis, providing stability guarantees for the Koopman prediction, a principled Bayesian justification for the high-frequency regularization, and culminating in a multiscale generalization bound connecting spectral dynamics to improved generalization. Extensive experiments demonstrate FreKoo's significant superiority over SOTA TDG approaches, particularly excelling in real-world streaming scenarios with complex drifts and uncertainties.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Observation of $χ_{cJ}(J=0,1,2)\rightarrow p\bar{p}ηη$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (678 additional authors not shown)
Abstract:
Using $(2712.4\pm14.3)\times10^6$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII storage ring, the decays $χ_{cJ}(J=0,1,2)\rightarrow p\bar{p}ηη$ are observed for the first time through the radiative transition $ψ(3686)\toγχ_{cJ}$. The statistical significances for $χ_{cJ}$ signals are all larger than 5$σ$. The branching fractions of $χ_{c0,1,2}\to p\bar{p} ηη$ are deter…
▽ More
Using $(2712.4\pm14.3)\times10^6$ $ψ(3686)$ events collected by the BESIII detector operating at the BEPCII storage ring, the decays $χ_{cJ}(J=0,1,2)\rightarrow p\bar{p}ηη$ are observed for the first time through the radiative transition $ψ(3686)\toγχ_{cJ}$. The statistical significances for $χ_{cJ}$ signals are all larger than 5$σ$. The branching fractions of $χ_{c0,1,2}\to p\bar{p} ηη$ are determined to be $({5.75 \pm 0.59 \pm 0.42}) \times 10^{-5}$, $({1.40 \pm 0.33 \pm 0.17}) \times 10^{-5}$, and $({2.64 \pm 0.40 \pm 0.27}) \times 10^{-5}$, respectively, where the first uncertainties are statistical and the second systematic. No evident resonant structures are found in the $p\bar{p}$ and $pη/\bar{p}η$ systems.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Speeding up quantum Markov processes through lifting
Authors:
Bowen Li,
Jianfeng Lu
Abstract:
We generalize the concept of non-reversible lifts for reversible diffusion processes initiated by Eberle and Lorler (2024) to quantum Markov dynamics. The lifting operation, which naturally results in hypocoercive processes, can be formally interpreted as, though not restricted to, the reverse of the overdamped limit. We prove that the $L^2$ convergence rate of the lifted process is bounded above…
▽ More
We generalize the concept of non-reversible lifts for reversible diffusion processes initiated by Eberle and Lorler (2024) to quantum Markov dynamics. The lifting operation, which naturally results in hypocoercive processes, can be formally interpreted as, though not restricted to, the reverse of the overdamped limit. We prove that the $L^2$ convergence rate of the lifted process is bounded above by the square root of the spectral gap of its overdamped dynamics, indicating that the lifting approach can at most achieve a transition from diffusive to ballistic mixing speeds. Further, using the variational hypocoercivity framework based on space-time Poincare inequalities, we derive a lower bound for the convergence rate of the lifted dynamics. These findings not only offer quantitative convergence guarantees for hypocoercive quantum Markov processes but also characterize the potential and limitations of accelerating the convergence through lifting. In addition, we develop an abstract lifting framework in the Hilbert space setting applicable to any symmetric contraction $C_0$-semigroup, thereby unifying the treatment of classical and quantum dynamics. As applications, we construct optimal lifts for various detailed balanced classical and quantum processes, including the symmetric random walk on a chain, the depolarizing semigroup, Schur multipliers, and quantum Markov semigroups on group von Neumann algebras.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Observation of an Altered $a_{0}(980)$ Line-shape in $D^{+} \rightarrow π^{+}ηη$ due to the Triangle Loop Rescattering Effect
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (705 additional authors not shown)
Abstract:
Using 20.3~${\rm fb}^{-1}$ of $e^{+}e^{-}$ collision data taken with the BESIII detector at the center-of-mass energy 3.773~GeV, we report the first amplitude analysis of the hadronic decay $D^{+} \rightarrow π^{+}ηη$. The intermediate process $D^{+} \to a_{0}(980)^{+}η, a_{0}(980)^{+} \to π^{+}η$ is observed and is found to be the only component and its branching fraction is measured to be…
▽ More
Using 20.3~${\rm fb}^{-1}$ of $e^{+}e^{-}$ collision data taken with the BESIII detector at the center-of-mass energy 3.773~GeV, we report the first amplitude analysis of the hadronic decay $D^{+} \rightarrow π^{+}ηη$. The intermediate process $D^{+} \to a_{0}(980)^{+}η, a_{0}(980)^{+} \to π^{+}η$ is observed and is found to be the only component and its branching fraction is measured to be $(3.67\pm0.12_{\mathrm{stat.}}\pm 0.06_{\mathrm{syst.}})\times 10^{-3}$. Unlike the $a_{0}(980)$ line-shape observed in the decays of charmed mesons to $a_{0}(980)π$ and in the decay $D^{0} \to a_{0}(980)^{-}e^{+}ν_{e}$, where the low-mass side of the $a_0(980)$ is wider than the high-mass side, the $a_{0}(980)$ line-shape in $D^{+} \to a_{0}(980)^{+}η$ is found to be significantly altered, with the high-mass side being wider than the low-mass side. We establish that the $a_0(980)$ line-shape arises from the triangle loop rescattering of $D^+ \to \bar{K}_0^*(1430)^0K^+ \to a_0(980)^+ η$ and $D^+ \to K_0^*(1430)^+\bar{K}^0 \to a_0(980)^+ η$ with a significance of 5.8$σ$. This is the first experimental confirmation of the triangle loop rescattering effect.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Model Merging in Pre-training of Large Language Models
Authors:
Yunshui Li,
Yiyuan Ma,
Shen Yan,
Chaoyi Zhang,
Jing Liu,
Jianqiao Lu,
Ziwen Xu,
Mengzhao Chen,
Minrui Wang,
Shiyi Zhan,
Jin Ma,
Xunhao Lai,
Deyi Liu,
Yao Luo,
Xingyan Bin,
Hongbin Ren,
Mingji Han,
Wenhao Hao,
Bairen Yi,
LingJun Liu,
Bole Ma,
Xiaoying Jia,
Xun Zhou,
Siyuan Qiao,
Liang Xiang
, et al. (1 additional authors not shown)
Abstract:
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to…
▽ More
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
△ Less
Submitted 22 May, 2025; v1 submitted 17 May, 2025;
originally announced May 2025.
-
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Authors:
Yusu Qian,
Jiasen Lu,
Tsu-Jui Fu,
Xinze Wang,
Chen Chen,
Yinfei Yang,
Wenze Hu,
Zhe Gan
Abstract:
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more…
▽ More
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
A Multi-modal Fusion Network for Terrain Perception Based on Illumination Aware
Authors:
Rui Wang,
Shichun Yang,
Yuyi Chen,
Zhuoyang Li,
Zexiang Tong,
Jianyi Xu,
Jiayi Lu,
Xinjie Feng,
Yaoguang Cao
Abstract:
Road terrains play a crucial role in ensuring the driving safety of autonomous vehicles (AVs). However, existing sensors of AVs, including cameras and Lidars, are susceptible to variations in lighting and weather conditions, making it challenging to achieve real-time perception of road conditions. In this paper, we propose an illumination-aware multi-modal fusion network (IMF), which leverages bot…
▽ More
Road terrains play a crucial role in ensuring the driving safety of autonomous vehicles (AVs). However, existing sensors of AVs, including cameras and Lidars, are susceptible to variations in lighting and weather conditions, making it challenging to achieve real-time perception of road conditions. In this paper, we propose an illumination-aware multi-modal fusion network (IMF), which leverages both exteroceptive and proprioceptive perception and optimizes the fusion process based on illumination features. We introduce an illumination-perception sub-network to accurately estimate illumination features. Moreover, we design a multi-modal fusion network which is able to dynamically adjust weights of different modalities according to illumination features. We enhance the optimization process by pre-training of the illumination-perception sub-network and incorporating illumination loss as one of the training constraints. Extensive experiments demonstrate that the IMF shows a superior performance compared to state-of-the-art methods. The comparison results with single modality perception methods highlight the comprehensive advantages of multi-modal fusion in accurately perceiving road terrains under varying lighting conditions. Our dataset is available at: https://github.com/lindawang2016/IMF.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
Authors:
An-Lan Wang,
Jingqun Tang,
Liao Lei,
Hao Feng,
Qi Liu,
Xiang Fei,
Jinghui Lu,
Han Wang,
Weiwei Liu,
Hao Liu,
Yuliang Liu,
Xiang Bai,
Can Huang
Abstract:
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This…
▽ More
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at https://bytedance.github.io/WildDoc.
△ Less
Submitted 27 May, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
FRET: Feature Redundancy Elimination for Test Time Adaptation
Authors:
Linjing You,
Jiabao Lu,
Xiayuan Huang,
Xiangli Nie
Abstract:
Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase a…
▽ More
Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model's adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the model to extract non-redundant and highly discriminative features during inference, thereby facilitating more robust test-time adaptation.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Rejoining fragmented ancient bamboo slips with physics-driven deep learning
Authors:
Jinchi Zhu,
Zhou Zhao,
Hailong Lei,
Xiaoguang Wang,
Jialiang Lu,
Jing Li,
Qianqian Tang,
Jiachen Shen,
Gui-Song Xia,
Bo Du,
Yongchao Xu
Abstract:
Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content.…
▽ More
Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content. Here we introduce WisePanda, a physics-driven deep learning framework designed to rejoin fragmented bamboo slips. Based on the physics of fracture and material deterioration, WisePanda automatically generates synthetic training data that captures the physical properties of bamboo fragmentations. This approach enables the training of a matching network without requiring manually paired samples, providing ranked suggestions to facilitate the rejoining process. Compared to the leading curve matching method, WisePanda increases Top-50 matching accuracy from 36\% to 52\%. Archaeologists using WisePanda have experienced substantial efficiency improvements (approximately 20 times faster) when rejoining fragmented bamboo slips. This research demonstrates that incorporating physical principles into deep learning models can significantly enhance their performance, transforming how archaeologists restore and study fragmented artifacts. WisePanda provides a new paradigm for addressing data scarcity in ancient artifact restoration through physics-driven machine learning.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Universal enveloping H-pseudoalgebras of DGP pseudoalgebras
Authors:
Ying Chen,
Jiafeng Lü,
Jiaqun Wei
Abstract:
The notions of Poisson $H$-pseudoalgebras are generalizations of Poisson algebras in a pseudotensor category $\mathcal{M}^{\ast}(H)$. This paper introduces an analogue of Poisson-Ore extension in Poisson $H$-pseudoalgebras. Poisson $H$-pseudoalgebras with the differential graded setting induces the notions of differential graded Poisson $H$-pseudoalgebras (DGP pseudoalgebras, for short). The DGP p…
▽ More
The notions of Poisson $H$-pseudoalgebras are generalizations of Poisson algebras in a pseudotensor category $\mathcal{M}^{\ast}(H)$. This paper introduces an analogue of Poisson-Ore extension in Poisson $H$-pseudoalgebras. Poisson $H$-pseudoalgebras with the differential graded setting induces the notions of differential graded Poisson $H$-pseudoalgebras (DGP pseudoalgebras, for short). The DGP pseudoalgebra with some compatibility conditions is proved to be closed under tensor product. Furthermore, the universal enveloping $H$-pseudoalgebras of DGP pseudoalgebras are constructed by a $\mathcal{P}$-triple. A unique differential graded pseudoalgebra homomorphism between a universal enveloping $H$-pseudoalgebra of a DGP pseudoalgebra and a $\mathcal{P}$-triple of a DGP pseudoalgebra is obtained.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Observational constraints on the Kerr and its several single-parameter modified spacetimes using quasi-periodic oscillation data
Authors:
Shining Yang,
Jianbo Lu,
Wenmei Li,
Mou Xu,
Jingyang Xu
Abstract:
This paper investigates the dynamical effects of particles moving in the Kerr spacetime and its nine single-parameter modified spacetimes, including Bardeen, Ayon-Beato and Garcia (ABG), Hayward, Kerr-Newman (KN), Kerr-Taub-NUT (KTN), Braneworld Kerr (BK), Kerr-MOG, Kerr-Sen, and Perfect Fluid Dark Matter (PFDM) black holes. Using quasi-periodic oscillation (QPO) observational data, we constrain t…
▽ More
This paper investigates the dynamical effects of particles moving in the Kerr spacetime and its nine single-parameter modified spacetimes, including Bardeen, Ayon-Beato and Garcia (ABG), Hayward, Kerr-Newman (KN), Kerr-Taub-NUT (KTN), Braneworld Kerr (BK), Kerr-MOG, Kerr-Sen, and Perfect Fluid Dark Matter (PFDM) black holes. Using quasi-periodic oscillation (QPO) observational data, we constrain the free parameters of the ten spacetimes through $χ^2$ analysis under the relativistic precession model of QPO. We constrain the modification parameters for the nine single-parameter modified spacetimes and provide the spin and mass ranges of three microquasars within the ten spacetime models (including Kerr) at the $68\%$ confidence level (CL). The results demonstrate that, at the $68 \%$ CL, the QPO data impose stringent constraints on the free parameters, as evidenced by the narrow confidence intervals. Among them, only the KN spacetime yields a modification parameter constraint spanning both negative and positive values (encompassing the Kerr case at zero). In contrast, all other tested geometries mandate positive-definite parameters at $68 \%$ CL, demonstrating statistical deviation of the Kerr solution. This highlights the significance of exploring modifications to the Kerr spacetime. Finally, we evaluate the spacetime models using the Bayes factor and the Akaike Information Criterion (AIC). Based on the current QPO observational data, the Bayesian factor analysis indicates that the ABG, Hayward, KN, BK, and Kerr-MOG spacetime have a slight advantage over the Kerr solution, while the Bardeen, KTN, Kerr-Sen, and PFDM spacetime are somewhat inferior to the Kerr model. In contrast, the AIC analysis shows that the Kerr spacetime remains the optimal model under the current QPO data.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities
Authors:
Jueqing Lu,
Yuanyuan Qi,
Xiaohao Yang,
Shujie Zhou,
Lan Du
Abstract:
Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities, such as visual and textual inputs. However, most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications. Recent works have introduced learnable missing-case-aware prompts to mitigate performance de…
▽ More
Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities, such as visual and textual inputs. However, most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications. Recent works have introduced learnable missing-case-aware prompts to mitigate performance degradation caused by missing modalities while reducing the need for extensive model fine-tuning. Building upon the effectiveness of missing-case-aware handling for missing modalities, we propose a novel decoupled prototype-based output head, which leverages missing-case-aware class-wise prototypes tailored for each individual modality. This approach dynamically adapts to different missing modality scenarios and can be seamlessly integrated with existing prompt-based methods. Extensive experiments demonstrate that our proposed output head significantly improves performance across a wide range of missing-modality scenarios and varying missing rates.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Multiqubit coherence of mixed states near event horizon
Authors:
Wen-Mei Li,
Jianbo Lu,
Shu-Min Wu
Abstract:
We study physically accessible and inaccessible N-qubit coherence of the mixed Greenberger-Horne-Zeilinger (GHZ) and W states for bosonic and fermionic fields when any $n$ ($n<N$) qubits hover over the Schwarzschild black hole. We derive a comprehensive analytical expression for the coherence of mixed N-qubit systems, taking into account both accessible and inaccessible components in the curved sp…
▽ More
We study physically accessible and inaccessible N-qubit coherence of the mixed Greenberger-Horne-Zeilinger (GHZ) and W states for bosonic and fermionic fields when any $n$ ($n<N$) qubits hover over the Schwarzschild black hole. We derive a comprehensive analytical expression for the coherence of mixed N-qubit systems, taking into account both accessible and inaccessible components in the curved spacetime background. Notably, as the number of qubits increases in the mixed W state, its coherence becomes more robust against the degrading effects of Hawking radiation, even as entanglement becomes more fragile. Moreover, with increasing Hawking temperature, W-state coherence surpasses that of the GHZ state, while the entanglement of the W state remains consistently weaker than that of the GHZ state. Interestingly, in Schwarzschild spacetime, fermionic fields exhibit stronger multiqubit entanglement, while bosonic fields show greater multiqubit coherence, revealing a fundamental contrast in their behavior under strong gravity. Our study reveals how Schwarzschild spacetime reshapes quantum resource trade-offs across states, statistics, and correlations, guiding relativistic quantum information tasks.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.