-
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Authors:
Sicong Leng,
Jing Wang,
Jiaxi Li,
Hao Zhang,
Zhiqiang Hu,
Boqiang Zhang,
Yuming Jiang,
Hang Zhang,
Xin Li,
Lidong Bing,
Deli Zhao,
Wei Lu,
Yu Rong,
Aixin Sun,
Shijian Lu
Abstract:
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanis…
▽ More
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning
Authors:
Guizhen Chen,
Weiwen Xu,
Hao Zhang,
Hou Pong Chan,
Deli Zhao,
Anh Tuan Luu,
Yu Rong
Abstract:
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning tr…
▽ More
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning
Authors:
Yan Rong,
Chenxing Li,
Dong Yu,
Li Liu
Abstract:
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To addre…
▽ More
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be made publicly available.
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
Investigation of hadronic cross sections of cosmic ray carbon and oxygen on BGO from 200 GeV to 10 TeV energy at the DAMPE experiment
Authors:
F. Alemanno,
Q. An,
P. Azzarello,
F. C. T. Barbato,
P. Bernardini,
X. J. Bi,
H. Boutin,
I. Cagnoli,
M. S. Cai,
E. Casilli,
E. Catanzani,
J. Chang,
D. Y. Chen,
J. L. Chen,
Z. F. Chen,
Z. X. Chen,
P. Coppin,
M. Y. Cui,
T. S. Cui,
Y. X. Cui,
I. De Mitri,
F. de Palma,
A. Di Giovanni,
T. K. Dong,
Z. X. Dong
, et al. (122 additional authors not shown)
Abstract:
The Dark Matter Particle Explorer (DAMPE) has made significant progress in measuring the fluxes of cosmic rays. These new measurements are pivotal in advancing our understanding of the origins and propagation mechanisms of cosmic rays. The bismuth germanium oxide (BGO) calorimeter plays a crucial role in these measurements, particularly in the precise determination of cosmic ray fluxes. However, f…
▽ More
The Dark Matter Particle Explorer (DAMPE) has made significant progress in measuring the fluxes of cosmic rays. These new measurements are pivotal in advancing our understanding of the origins and propagation mechanisms of cosmic rays. The bismuth germanium oxide (BGO) calorimeter plays a crucial role in these measurements, particularly in the precise determination of cosmic ray fluxes. However, for a calorimetric experiment like DAMPE, uncertainties in hadronic models persist as a major barrier in achieving more accurate measurements of fluxes of cosmic ray nuclei. This study centers on the measurement of the inelastic hadronic cross sections of carbon and oxygen nuclei interacting with BGO crystals target over an extensive energy range, spanning from 200 GeV to 10 TeV. For carbon nuclei interacting with the BGO target, the measurements of the cross sections have achieved a total relative uncertainty of less than 10% below 8 TeV for carbon, and below 3 TeV for oxygen. For oxygen nuclei, the same level of precision was attained below 3 TeV. Additionally, we compare the experimental results with Geant4 and FLUKA simulations to validate the accuracy and consistency of these simulation tools. Through comprehensive analysis of the inelastic hadronic interaction cross sections, this research provides validation for the hadronic interaction models used in DAMPE's cosmic-ray flux measurements.
△ Less
Submitted 21 September, 2025;
originally announced September 2025.
-
Distributed Coherent Beamforming at 60 GHz Enabled by Optically-Established Coherence
Authors:
Drake Silbernagel,
Yu Rong,
Isabella Lenz,
Prithvi Hemanth,
Carl Morgenstern,
Owen Ma,
Nolan Matthews,
Nader Zaki,
Kyle W. Martin,
John D. Elgin,
Jacob Holtom,
Daniel W. Bliss,
Kimberly Frey
Abstract:
We implement and experimentally demonstrate a 60 GHz distributed system leveraging an optical time synchronization system that provides precise time and frequency alignment between independent elements of the distributed mesh. Utilizing such accurate coherence, we perform receive beamforming with interference rejection and transmit nulling. In these configurations, the system achieves a coherent g…
▽ More
We implement and experimentally demonstrate a 60 GHz distributed system leveraging an optical time synchronization system that provides precise time and frequency alignment between independent elements of the distributed mesh. Utilizing such accurate coherence, we perform receive beamforming with interference rejection and transmit nulling. In these configurations, the system achieves a coherent gain over an incoherent network of N nodes, significantly improving the relevant signal power ratios. Our system demonstrates extended array phase coherence times, enabling advanced techniques. Results from over-the-air experiments demonstrate a 14.3 dB signal-to-interference-plus-noise improvement in interference-laden scenarios with a contributing 13.5 dB null towards interference in receive beamforming. In transmit nulling, a signal-to-noise ratio (SNR) gain of 7.9 dB is measured towards an intended receiver while maintaining an SNR reduction of 8.9 dB at another receiver. These findings represent the use of distributed coherence in the V band without the use of GPS timing.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals
Authors:
Milan Marocchi,
Matthew Fynn,
Kayapanda Mandana,
Yue Rong
Abstract:
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signal…
▽ More
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew's correlation coefficient (MCC) reach 92.48%, 93.05%, 93.63%, 92.48%, 94.93% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14%, 92.21%, 94.35%, 90.10%, 95.12% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13% accuracy, 74.25% UAR, 86.47% sensitivity, 62.04% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
△ Less
Submitted 25 September, 2025; v1 submitted 15 September, 2025;
originally announced September 2025.
-
Causal Emergence of Consciousness through Learned Multiscale Neural Dynamics in Mice
Authors:
Zhipeng Wang,
Yingqi Rong,
Kaiwei Liu,
Mingzhe Yang,
Jiang Zhang,
Jing He
Abstract:
Consciousness spans macroscopic experience and microscopic neuronal activity, yet linking these scales remains challenging. Prevailing theories, such as Integrated Information Theory, focus on a single scale, overlooking how causal power and its dynamics unfold across scales. Progress is constrained by scarce cross-scale data and difficulties in quantifying multiscale causality and dynamics. Here,…
▽ More
Consciousness spans macroscopic experience and microscopic neuronal activity, yet linking these scales remains challenging. Prevailing theories, such as Integrated Information Theory, focus on a single scale, overlooking how causal power and its dynamics unfold across scales. Progress is constrained by scarce cross-scale data and difficulties in quantifying multiscale causality and dynamics. Here, we present a machine learning framework that infers multiscale causal variables and their dynamics from near-cellular-resolution calcium imaging in the mouse dorsal cortex. At lower levels, variables primarily aggregate input-driven information, whereas at higher levels they realize causality through metastable or saddle-point dynamics during wakefulness, collapsing into localized, stochastic dynamics under anesthesia. A one-dimensional top-level conscious variable captures the majority of causal power, yet variables across other scales also contribute substantially, giving rise to high emergent complexity in the conscious state. Together, these findings provide a multiscale causal framework that links neural activity to conscious states.
△ Less
Submitted 13 September, 2025;
originally announced September 2025.
-
From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media
Authors:
Tian Ma,
Kaiyu Feng,
Yu Rong,
Kangfei Zhao
Abstract:
Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding…
▽ More
Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding and inferring personality traits from social media content. However, directly exploiting LLMs for MBTI prediction faces two key challenges: the hallucination problem inherent in LLMs and the naturally imbalanced distribution of MBTI types in the population. In this paper, we propose PostToPersonality (PtoP), a novel LLM based framework for MBTI prediction from social media posts of individuals. Specifically, PtoP leverages Retrieval Augmented Generation with in context learning to mitigate hallucination in LLMs. Furthermore, we fine tune a pretrained LLM to improve model specification in MBTI understanding with synthetic minority oversampling, which balances the class imbalance by generating synthetic samples. Experiments conducted on a real world social media dataset demonstrate that PtoP achieves state of the art performance compared with 10 ML and DL baselines.
△ Less
Submitted 28 August, 2025;
originally announced September 2025.
-
Galaxy Group Spin Alignment with Cosmic Filament in the TNG Simulation
Authors:
Wei Wang,
Peng Wang,
Yu Rong,
Hao-da Wang,
Xiao-xiao Tang
Abstract:
We investigate the alignment between the spin vectors of galaxy groups and the axes of their nearest cosmic filaments using the TNG300-1 cosmological hydrodynamical simulation. By systematically analyzing a large sample of groups, we find a robust perpendicular alignment between group spin and filament orientation. Among all examined properties, only group mass and the distance to the nearest fila…
▽ More
We investigate the alignment between the spin vectors of galaxy groups and the axes of their nearest cosmic filaments using the TNG300-1 cosmological hydrodynamical simulation. By systematically analyzing a large sample of groups, we find a robust perpendicular alignment between group spin and filament orientation. Among all examined properties, only group mass and the distance to the nearest filament significantly affect the strength of this alignment: more massive groups and those closer to filaments exhibit a stronger perpendicular signal. In contrast, the alignment is largely insensitive to group richness, the stellar mass threshold used to select member galaxies, and redshift. We further quantify the bias introduced by using member galaxies as tracers of group spin, finding a typical misalignment angle of $\sim38^\circ$ between the spin measured from all dark matter particles and that inferred from member galaxies, independent of group richness or stellar mass cut. Our results provide a clear theoretical benchmark for interpreting observational measurements of spin-filament alignment and highlight the importance of considering group mass and environment. These findings help clarify the main factors influencing spin-filament alignment and provide useful context for future observational and theoretical studies of angular momentum in the cosmic web.
△ Less
Submitted 27 September, 2025; v1 submitted 28 August, 2025;
originally announced August 2025.
-
AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational Databases
Authors:
Zhaodonghui Li,
Haitao Yuan,
Jiachen Shi,
Hao Zhang,
Yu Rong,
Gao Cong
Abstract:
Index recommendation is one of the most important problems in database management system (DBMS) optimization. Given queries and certain index-related constraints, traditional methods rely on heuristic optimization or learning-based models to select effective indexes and improve query performance. However, heuristic optimization suffers from high computation time, and learning-based models lose gen…
▽ More
Index recommendation is one of the most important problems in database management system (DBMS) optimization. Given queries and certain index-related constraints, traditional methods rely on heuristic optimization or learning-based models to select effective indexes and improve query performance. However, heuristic optimization suffers from high computation time, and learning-based models lose generalisability due to training for different workloads and database schemas. With the recent rapid development of large language models (LLMs), methods using prompt tuning have been proposed to enhance the efficiency of index selection. However, such methods still can not achieve the state-of-the-art (SOTA) results, and preparing the index selection demonstrations is also resource-intensive. To address these issues, we propose AMAZe, a zero-shot LLM-based index advisor with a multi-agent framework. We decompose the index recommendation problem into sub-steps, including planning, selection, combination, revision, and reflection. A set of LLM-embedded agents is designed to handle each one of the different sub-steps. Our method utilizes high-level agents to control the index selection process and low-level agents to select and revise indexes. Through extensive experiments, we show that our proposed AMAZe not only achieves the SOTA performance compared to the heuristic methods, but also outperforms learning-based and prompt-based methods with higher efficiency and better zero-shot inference ability.
△ Less
Submitted 16 September, 2025; v1 submitted 21 August, 2025;
originally announced August 2025.
-
The Cosmic Dance: Observational Detection of Coherent Spin in Galaxy Clusters
Authors:
Xiao-xiao Tang,
Peng Wang,
Yu Rong,
Weiguang cui
Abstract:
The spin of galaxy clusters encodes key information about their formation, dynamics, and the influence of large-scale structure. However, whether clusters possess statistically significant spin and how to measure it observationally remain open questions. Here, we present the first observational statistical detection of coherent spin in galaxy clusters, by using a sample of 2,170 systems with…
▽ More
The spin of galaxy clusters encodes key information about their formation, dynamics, and the influence of large-scale structure. However, whether clusters possess statistically significant spin and how to measure it observationally remain open questions. Here, we present the first observational statistical detection of coherent spin in galaxy clusters, by using a sample of 2,170 systems with $M > 10^{14}\, M_\odot$ selected from a publicly available group catalog based on the SDSS galaxy. Cluster spin is quantified by identifying the orientation in the projected plane that maximizes the redshift difference ($ΔZ_{\rm max}$) between member galaxies in two regions divided by a trial axis. We find strong statistical evidence for coherent rotation, with the observed $ΔZ_{\rm max}$ distribution significantly exceeding randomized controls (nearly $\sim150σ$ confidence at $\sim380~\mathrm{km\,s}^{-1}$), especially in richer clusters ($N_{\mathrm{gal}} > 10$, up to $\sim300σ$). Stacked visualizations confirm the spatial segregation of redshifted and blueshifted galaxies across the rotation axis. The radial profile of the rotational velocity indicates that it increases as a function of radius. The cluster rotation speed increases with mass, from $\sim330~\mathrm{km\,s}^{-1}$ at $10^{14} M_\odot$ to $\sim800~\mathrm{km\,s}^{-1}$ at $10^{15} M_\odot$. Additionally, cluster spin tends to align parallel with the central galaxy spin and perpendicular to the nearest cosmic filament, particularly in richer systems. These results reveal significant coherent spin in galaxy clusters, shaped by both internal dynamics and large-scale structure.
△ Less
Submitted 19 August, 2025;
originally announced August 2025.
-
Synthesizing Evidence: Data-Pooling as a Tool for Treatment Selection in Online Experiments
Authors:
Zhenkang Peng,
Chengzhang Li,
Ying Rong,
Renyu Zhang
Abstract:
Randomized experiments are the gold standard for causal inference but face significant challenges in business applications, including limited traffic allocation, the need for heterogeneous treatment effect estimation, and the complexity of managing overlapping experiments. These factors lead to high variability in treatment effect estimates, making data-driven policy roll out difficult. To address…
▽ More
Randomized experiments are the gold standard for causal inference but face significant challenges in business applications, including limited traffic allocation, the need for heterogeneous treatment effect estimation, and the complexity of managing overlapping experiments. These factors lead to high variability in treatment effect estimates, making data-driven policy roll out difficult. To address these issues, we introduce the data pooling treatment roll-out (DPTR) framework, which enhances policy roll-out by pooling data across experiments rather than focusing narrowly on individual ones. DPTR can effectively accommodate both overlapping and non-overlapping traffic scenarios, regardless of linear or nonlinear model specifications. We demonstrate the framework's robustness through a three-pronged validation: (a) theoretical analysis shows that DPTR surpasses the traditional difference-in-mean and ordinary least squares methods under non-overlapping experiments, particularly when the number of experiments is large; (b) synthetic simulations confirm its adaptability in complex scenarios with overlapping traffic, rich covariates and nonlinear specifications; and (c) empirical applications to two experimental datasets from real world platforms, demonstrating its effectiveness in guiding customized policy roll-outs for subgroups within a single experiment, as well as in coordinating policy deployments across multiple experiments with overlapping scenarios. By reducing estimation variability to improve decision-making effectiveness, DPTR provides a scalable, practical solution for online platforms to better leverage their experimental data in today's increasingly complex business environments.
△ Less
Submitted 15 August, 2025; v1 submitted 14 August, 2025;
originally announced August 2025.
-
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
Authors:
Ruifeng Yuan,
Chenghao Xiao,
Sicong Leng,
Jianyu Wang,
Long Li,
Weiwen Xu,
Hou Pong Chan,
Deli Zhao,
Tingyang Xu,
Zhongyu Wei,
Hao Zhang,
Yu Rong
Abstract:
Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across vari…
▽ More
Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.
△ Less
Submitted 31 July, 2025; v1 submitted 30 July, 2025;
originally announced July 2025.
-
Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
Authors:
Tianyu Zou,
Shengwu Xiong,
Ruilin Yao,
Yi Rong
Abstract:
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more…
▽ More
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5$^i$ and COCO-20$^i$ datasets, suggesting its effectiveness. The code is available at: [GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV'25)](https://github.com/tianyu-zou/PAHNet)
△ Less
Submitted 25 July, 2025;
originally announced July 2025.
-
Hierarchical Graph Information Bottleneck for Multi-Behavior Recommendation
Authors:
Hengyu Zhang,
Chunxu Shen,
Xiangguo Sun,
Jie Tan,
Yanchao Tan,
Yu Rong,
Hong Cheng,
Lingling Yi
Abstract:
In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current st…
▽ More
In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current state-of-the-art approaches typically employ hierarchical design following either cascading (e.g., view$\rightarrow$cart$\rightarrow$buy) or parallel (unified$\rightarrow$behavior$\rightarrow$specific components) paradigms, to capture behavioral relationships. However, these methods still face two critical challenges: (1) severe distribution disparities across behaviors, and (2) negative transfer effects caused by noise in auxiliary behaviors. In this paper, we propose a novel model-agnostic Hierarchical Graph Information Bottleneck (HGIB) framework for multi-behavior recommendation to effectively address these challenges. Following information bottleneck principles, our framework optimizes the learning of compact yet sufficient representations that preserve essential information for target behavior prediction while eliminating task-irrelevant redundancies. To further mitigate interaction noise, we introduce a Graph Refinement Encoder (GRE) that dynamically prunes redundant edges through learnable edge dropout mechanisms. We conduct comprehensive experiments on three real-world public datasets, which demonstrate the superior effectiveness of our framework. Beyond these widely used datasets in the academic community, we further expand our evaluation on several real industrial scenarios and conduct an online A/B testing, showing again a significant improvement in multi-behavior recommendations. The source code of our proposed HGIB is available at https://github.com/zhy99426/HGIB.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Light and heavy $Λ$ hyperclusters in nuclear matter with RMF models
Authors:
Cheng-Jun Xia,
Yu-Ting Rong,
Ting-Ting Sun
Abstract:
In the framework of RMF models, we investigate the properties of light and heavy $Λ$ hyperclusters emersed in nuclear matter at various densities $n_{\mathrm{gas}}$ and proton fractions $Y_p$. In particular, the (hyper)clusters are fixed by solving the Dirac equations imposing the Dirichlet-Neumann boundary condition, while the nuclear matter take constant densities and is treated with Thomas-Ferm…
▽ More
In the framework of RMF models, we investigate the properties of light and heavy $Λ$ hyperclusters emersed in nuclear matter at various densities $n_{\mathrm{gas}}$ and proton fractions $Y_p$. In particular, the (hyper)clusters are fixed by solving the Dirac equations imposing the Dirichlet-Neumann boundary condition, while the nuclear matter take constant densities and is treated with Thomas-Fermi approximation. The binding energies of (hyper)clusters decrease with the density of nuclear matter $n_{\mathrm{gas}}$, which eventually become unbound and melt in the presence of nuclear medium, i.e., Mott transition. For light clusters with proton numbers $N_p < 4$, with the addition of $Λ$ hyperons, the binding energies per baryon for $Λ$ hyperclusters become smaller and decrease faster with $n_{\mathrm{gas}}$ due to the weaker $N$-$Λ$ attraction. For heavy clusters with $N_p \geq 4$, on the contrary, the addition of $Λ$ hyperons increases the stability of (hyper)clusters so that the Mott transition density becomes larger as nucleons occupying higher energy states while $Λ$ hyperons remain in the $1s_{1/2}$ orbital. The isovector effects on (hyper)clusters in nuclear medium are also identified, where the binding energies for (hyper)clusters with $N_p> N_n$ ($N_p< N_n$) increase (decrease) with $Y_p$. For those predicted by nonlinear relativistic density functionals, light (hyper)clusters are destabilized drastically as $n_{\mathrm{gas}}$ increases, while the binding energies of heavier (hyper)clusters vary smoothly with $n_{\mathrm{gas}}$. By fitting the binding energy shifts to an analytical formula, the corresponding coefficients describing the in-medium properties of various (hyper)clusters are fixed, which should be useful to understand the evolutions of (hyper)clusters in both heavy-ion collisions and neutron stars.
△ Less
Submitted 13 July, 2025;
originally announced July 2025.
-
DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models
Authors:
Liang Wang,
Yu Rong,
Tingyang Xu,
Zhenyi Zhong,
Zhiyuan Liu,
Pengju Wang,
Deli Zhao,
Qiang Liu,
Shu Wu,
Liang Wang
Abstract:
Molecular structure elucidation from spectra is a foundational problem in chemistry, with profound implications for compound identification, synthesis, and drug development. Traditional methods rely heavily on expert interpretation and lack scalability. Pioneering machine learning methods have introduced retrieval-based strategies, but their reliance on finite libraries limits generalization to no…
▽ More
Molecular structure elucidation from spectra is a foundational problem in chemistry, with profound implications for compound identification, synthesis, and drug development. Traditional methods rely heavily on expert interpretation and lack scalability. Pioneering machine learning methods have introduced retrieval-based strategies, but their reliance on finite libraries limits generalization to novel molecules. Generative models offer a promising alternative, yet most adopt autoregressive SMILES-based architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data using diffusion models. DiffSpectra formulates structure elucidation as a conditional generation process. Its denoising network is parameterized by Diffusion Molecule Transformer, an SE(3)-equivariant architecture that integrates topological and geometric information. Conditioning is provided by SpecFormer, a transformer-based spectral encoder that captures intra- and inter-spectral dependencies from multi-modal spectra. Extensive experiments demonstrate that DiffSpectra achieves high accuracy in structure elucidation, recovering exact structures with 16.01% top-1 accuracy and 96.86% top-20 accuracy through sampling. The model benefits significantly from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. These results highlight the effectiveness of spectrum-conditioned diffusion modeling in addressing the challenge of molecular structure elucidation. To our knowledge, DiffSpectra is the first framework to unify multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
PLACE: Prompt Learning for Attributed Community Search
Authors:
Shuheng Fang,
Kangfei Zhao,
Rener Zhang,
Yu Rong,
Jeffrey Xu Yu
Abstract:
In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming…
▽ More
In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming a prompt-augmented graph. Within this prompt-augmented graph structure, the learned prompt tokens serve as a bridge that strengthens connections between graph nodes for the query, enabling the GNN to more effectively identify patterns of structural cohesiveness and attribute similarity related to the specific query. We employ an alternating training paradigm to optimize both the prompt parameters and the GNN jointly. Moreover, we design a divide-and-conquer strategy to enhance scalability, supporting the model to handle million-scale graphs. Extensive experiments on 9 real-world graphs demonstrate the effectiveness of PLACE for three types of ACS queries, where PLACE achieves higher F1 scores by 22% compared to the state-of-the-arts on average.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling
Authors:
Tianyu. Zou,
Shengwu. Xiong,
Ruilin. Yao,
Jirui. Huang,
Yi. Rong,
Yaxiong. Chen,
Shili. Xiong,
Cong. Wang
Abstract:
Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power. In this work, we propose a novel frame…
▽ More
Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power. In this work, we propose a novel framework for aligning MLLM benchmark based on Structural Equation Modeling (SEM) to analyze and quantify the internal validity, dimensional separability, and contribution of benchmark components. Motivated by the observed limitations of current designs, we further introduce a novel capability hierarchy grounded in Piagets theory of cognitive development, dividing MLLM abilities into three hierarchical layers, i.e., Perception, Memory, and Reasoning. We reorganize existing MLLM benchmarks under the proposed framework and construct a new benchmark named Gold. Experimental results demonstrate that the proposed benchmark exhibits stronger interpretability, reduced indicator redundancy, and clearer cognitive consistency compared to existing approaches.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Constraints on $ΛN$ Effective Interactions from Mirror Hypernuclei in a Deformed Relativistic Hartree-Bogoliubov Model
Authors:
Yu-Ting Rong,
Dan Yang,
Cheng-Jun Xia,
Ting-Ting Sun
Abstract:
We investigate the ground-state properties of four mirror hypernuclei pairs--$^{10}_Λ$Be-$^{10}_Λ$B, $^{12}_Λ$B-$^{12}_Λ$C, $^{16}_Λ$N-$^{16}_Λ$O, and $^{40}_Λ$K-$^{40}_Λ$Ca--within the deformed relativistic Hartree-Bogoliubov framework, analyzing their connection to $ΛN$ effective interactions. Systematic calculations with eight distinct effective interactions reveal linear correlations between m…
▽ More
We investigate the ground-state properties of four mirror hypernuclei pairs--$^{10}_Λ$Be-$^{10}_Λ$B, $^{12}_Λ$B-$^{12}_Λ$C, $^{16}_Λ$N-$^{16}_Λ$O, and $^{40}_Λ$K-$^{40}_Λ$Ca--within the deformed relativistic Hartree-Bogoliubov framework, analyzing their connection to $ΛN$ effective interactions. Systematic calculations with eight distinct effective interactions reveal linear correlations between mirror hypernuclei in $Λ$ separation energies and charge radii. The charge symmetry breaking effects, quantified through $Λ$ separation energy differences, exhibit a positive correlation with the SU(3) flavor symmetry violation. We emphasize that constraints derived from $A=10$ and $A=12$ hypernuclear pairs must explicitly incorporate rotational energy correction effects. Precision measurements of the (near) spherical $A=16$ and $A=40$ mirror systems are proposed as critical benchmarks for refining the isospin part of the hyperon-nucleon interactions.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Authors:
Yu Sun,
Xingyu Qian,
Weiwen Xu,
Hao Zhang,
Chenghao Xiao,
Long Li,
Deli Zhao,
Wenbing Huang,
Tingyang Xu,
Qifeng Bai,
Yu Rong
Abstract:
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasonin…
▽ More
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
△ Less
Submitted 22 September, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
The Internal Kinematics, Stellar Population, and Gas-phase Properties of The Pseudobulge in An Ultra-diffuse Galaxy: AGC721966
Authors:
Shihong Liu,
Yu Rong,
Huiyuan Wang,
Hong-Xin Zhang,
Tie Li,
Yao Yao,
Zhicheng He,
Teng Liu,
Enci Wang,
Cheng Cheng,
Xu Kong
Abstract:
Leveraging spectroscopic data from the Sloan Digital Sky Survey, we conduct a comprehensive analysis of the central stellar velocity dispersion, stellar population properties, star formation history, and gas-phase chemical abundances in AGC721966, a unique ultra-diffuse galaxy (UDG) harboring a pseudobulge. Our findings reveal that the pseudobulge formed in the early universe but underwent a recen…
▽ More
Leveraging spectroscopic data from the Sloan Digital Sky Survey, we conduct a comprehensive analysis of the central stellar velocity dispersion, stellar population properties, star formation history, and gas-phase chemical abundances in AGC721966, a unique ultra-diffuse galaxy (UDG) harboring a pseudobulge. Our findings reveal that the pseudobulge formed in the early universe but underwent a recent episode of rejuvenated star formation. The system exhibits a mass-weighted (light-weighted) stellar population age of $τ_{\star}\sim 7.4\pm2.5$ ($2.9\pm1.5$)~Gyr, a stellar metallicity of [M/H]$\sim -0.62\pm0.26$ ($-0.55\pm0.20$), an $α$-element enhancement of [$α$/Fe]$\sim 0.36\pm0.09$ ($0.37\pm0.07$), and a gas-phase oxygen abundance of \Oabund$\sim 8.15\pm0.03$. The central stellar velocity dispersion is measured as $σ_{\rm c}\sim 57.9\pm15.7$~km/s. These results provide robust evidence supporting the early halo-halo merging formation scenario proposed by \cite{Rong25}, while unequivocally ruling out the ``failed'' $L^{\star}$ formation model, at least for AGC721966. Furthermore, through systematic application of the baryonic Tully-Fisher relation, we establish that these pseudobulge-hosting UDGs are neither misidentified nuclear star cluster-bearing dwarf galaxies nor bulge-dominated massive galaxies, thereby affirming their distinct evolutionary pathway.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Authors:
LASA Team,
Weiwen Xu,
Hou Pong Chan,
Long Li,
Mahani Aljunied,
Ruifeng Yuan,
Jianyu Wang,
Chenghao Xiao,
Guizhen Chen,
Chaoqun Liu,
Zhaodonghui Li,
Yu Sun,
Junao Shen,
Chaojun Wang,
Jie Tan,
Deli Zhao,
Tingyang Xu,
Hao Zhang,
Yu Rong
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
△ Less
Submitted 13 June, 2025; v1 submitted 8 June, 2025;
originally announced June 2025.
-
Disentangling Language and Culture for Evaluating Multilingual Large Language Models
Authors:
Jiahao Ying,
Wei Tang,
Yiran Zhao,
Yixin Cao,
Yu Rong,
Wenxuan Zhang
Abstract:
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide r…
▽ More
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable "CulturalLinguistic Synergy" phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language's cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. github.io/Dual-Evaluation/.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring
Authors:
Bin Wang,
Pingjun Li,
Jinkun Liu,
Jun Cheng,
Hailong Lei,
Yinze Rong,
Huan-ang Gao,
Kangliang Chen,
Xing Pan,
Weihao Gu
Abstract:
End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD le…
▽ More
End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Neutron Magic Numbers in $sd$ Shell from Nuclear Charge Radii within Neutron-Proton Correction around the Fermi Surface
Authors:
Yu-Ting Rong,
Ping-Mo Liu,
Dan Yang,
Rong An
Abstract:
Charge radii are sensitive indicators to identify the nuclear structure phenomena throughout the whole nuclide chart. In particular, the shrunken trend of changes of charge radii along a long isotopic chain is intimately associated with the shell quenching effect. In this work, the systematic evolution of charge radii along the proton numbers $Z=8$, $10$, $12$, $14$, $18$ isotopes is investigated…
▽ More
Charge radii are sensitive indicators to identify the nuclear structure phenomena throughout the whole nuclide chart. In particular, the shrunken trend of changes of charge radii along a long isotopic chain is intimately associated with the shell quenching effect. In this work, the systematic evolution of charge radii along the proton numbers $Z=8$, $10$, $12$, $14$, $18$ isotopes is investigated by a relativistic Hartree Bogoliubov model. A ansatz about neutron-proton correlation around Fermi surface is considered for describing the abnormal behavior of nuclear charge radii. Our results show that the neutron-proton pairing corrections around the Fermi surface lead to a sudden strengthening of the charge radii of these isotopic chains at $N=8$, 20 and 28, reflecting the fact that this correction enhances the shell closure across $N=8$, 20 and 28. The reproduction of the $N=14$ charge radius in the Mg isotopes is affected by the way in which pairing correlations are handled, with BCS theory overestimating the shell effect of $N=14$, and the Bogoliubov quasiparticle transformation suggests a stronger pairing correlation near the proton Fermi surface, which is more consistent with experimental results. An analysis of the deviations from the theoretical and available experimental data for the charge radii of the 24 selected even-even nuclei shows that the neutron-proton pairing correction around the Fermi surface has an improved effect on the calculation of the charge {radii} using the meson-exchange effective interactions, but it does not help to significantly improve the results calculated by the density-dependent effective interactions.
△ Less
Submitted 14 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation
Authors:
Yan Rong,
Jinting Wang,
Guangzhi Lei,
Shan Yang,
Li Liu
Abstract:
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in…
▽ More
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie achieves state-of-the-art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic. The project website with audio samples can be found at https://audiogenie.github.io/.
△ Less
Submitted 5 August, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
The Rise of Parameter Specialization for Knowledge Storage in Large Language Models
Authors:
Yihuai Hong,
Yiran Zhao,
Wei Tang,
Yang Deng,
Yu Rong,
Wenxuan Zhang
Abstract:
Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utiliza…
▽ More
Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model's efficiency in leveraging stored knowledge.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Materials Generation in the Era of Artificial Intelligence: A Comprehensive Survey
Authors:
Zhixun Li,
Bin Cao,
Rui Jiao,
Liang Wang,
Ding Wang,
Yang Liu,
Dingshuo Chen,
Jia Li,
Qiang Liu,
Yu Rong,
Liang Wang,
Tong-yi Zhang,
Jeffrey Xu Yu
Abstract:
Materials are the foundation of modern society, underpinning advancements in energy, electronics, healthcare, transportation, and infrastructure. The ability to discover and design new materials with tailored properties is critical to solving some of the most pressing global challenges. In recent years, the growing availability of high-quality materials data combined with rapid advances in Artific…
▽ More
Materials are the foundation of modern society, underpinning advancements in energy, electronics, healthcare, transportation, and infrastructure. The ability to discover and design new materials with tailored properties is critical to solving some of the most pressing global challenges. In recent years, the growing availability of high-quality materials data combined with rapid advances in Artificial Intelligence (AI) has opened new opportunities for accelerating materials discovery. Data-driven generative models provide a powerful tool for materials design by directly create novel materials that satisfy predefined property requirements. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. To fill this gap, this paper provides a comprehensive overview of recent progress in AI-driven materials generation. We first organize various types of materials and illustrate multiple representations of crystalline materials. We then provide a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss the common evaluation metrics and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future directions and challenges in this fast-growing field. The related sources can be found at https://github.com/ZhixunLEE/Awesome-AI-for-Materials-Generation.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Authors:
Zongzhao Li,
Zongyang Ma,
Mingze Li,
Songyou Li,
Yu Rong,
Tingyang Xu,
Ziqi Zhang,
Deli Zhao,
Wenbing Huang
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SF…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.
△ Less
Submitted 10 July, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Measurement of separate electron and positron spectra from 10 GeV to 20GeV with the geomagnetic field on DAMPE
Authors:
DAMPE Collaboration,
F. Alemanno,
Q. An,
P. Azzarello,
F. C. T. Barbato,
P. Bernardini,
X. J. Bi,
H. Boutin,
I. Cagnoli,
M. S. Cai,
E. Casilli,
E. Catanzani,
J. Chang,
D. Y. Chen,
J. L. Chen,
Z. F. Chen,
Z. X. Chen,
P. Coppin,
M. Y. Cui,
T. S. Cui,
Y. X. Cui,
I. DeMitri,
F. dePalma,
A. DiGiovanni,
T. K. Dong
, et al. (127 additional authors not shown)
Abstract:
The cosmic-ray (CR) electrons and positrons in space are of great significance for studying the origin and propagation of cosmic-rays. The satellite-borne experiment DArk Matter Particle Explorer (DAMPE) has been used to measure the separate electron and positron spectra, as well as the positron fraction. In this work, the Earth's magnetic field is used to distinguish CR electrons and positrons, a…
▽ More
The cosmic-ray (CR) electrons and positrons in space are of great significance for studying the origin and propagation of cosmic-rays. The satellite-borne experiment DArk Matter Particle Explorer (DAMPE) has been used to measure the separate electron and positron spectra, as well as the positron fraction. In this work, the Earth's magnetic field is used to distinguish CR electrons and positrons, as the DAMPE detector does not carry an onboard magnet. The energy range for the measurements is from 10 to 20 GeV, being currently limited at high energy by the zenith pointing orientation of DAMPE. The results are consistent with previous measurements based on the magnetic spectrometer by AMS-02 and PAMELA, while the results of Fermi-LAT seem then to be systematically shifted to larger values.
△ Less
Submitted 21 August, 2025; v1 submitted 9 May, 2025;
originally announced May 2025.
-
Flow Along the K-Amplitude for Generative Modeling
Authors:
Weitao Du,
Shuning Chang,
Jiasheng Tang,
Yu Rong,
Fan Wang,
Shengchao Liu
Abstract:
In this work, we propose a novel generative learning paradigm, K-Flow, an algorithm that flows along the $K$-amplitude. Here, $k$ is a scaling parameter that organizes frequency bands (or projected coefficients), and amplitude describes the norm of such projected coefficients. By incorporating the $K$-amplitude decomposition, K-Flow enables flow matching across the scaling parameter as time. We di…
▽ More
In this work, we propose a novel generative learning paradigm, K-Flow, an algorithm that flows along the $K$-amplitude. Here, $k$ is a scaling parameter that organizes frequency bands (or projected coefficients), and amplitude describes the norm of such projected coefficients. By incorporating the $K$-amplitude decomposition, K-Flow enables flow matching across the scaling parameter as time. We discuss three venues and six properties of K-Flow, from theoretical foundations, energy and temporal dynamics, and practical applications, respectively. Specifically, from the practical usage perspective, K-Flow allows steerable generation by controlling the information at different scales. To demonstrate the effectiveness of K-Flow, we conduct experiments on unconditional image generation, class-conditional image generation, and molecule assembly generation. Additionally, we conduct three ablation studies to demonstrate how K-Flow steers scaling parameter to effectively control the resolution of image generation.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
A negative stellar mass$-$gaseous metallicity gradient relation of dwarf galaxies modulated by stellar feedback
Authors:
Tie Li,
Hong-Xin Zhang,
Wenhe Lyu,
Yimeng Tang,
Yao Yao,
Enci Wang,
Yu Rong,
Guangwen Chen,
Xu Kong,
Fuyan Bian,
Qiusheng Gu,
J. Evelyn Johnston,
Xin Li,
Shude Mao,
Yong Shi,
Junfeng Wang,
Xin Wang,
Xiaoling Yu,
Zhiyuan Zheng
Abstract:
Baryonic cycling is reflected in the spatial distribution of metallicity within galaxies, yet gas-phase metallicity distribution and its connection with other properties of dwarf galaxies are largely unexplored. We present the first systematic study of radial gradients of gas-phase metallicities for a sample of 55 normal nearby star-forming dwarf galaxies (stellar mass $M_\star$ ranging from…
▽ More
Baryonic cycling is reflected in the spatial distribution of metallicity within galaxies, yet gas-phase metallicity distribution and its connection with other properties of dwarf galaxies are largely unexplored. We present the first systematic study of radial gradients of gas-phase metallicities for a sample of 55 normal nearby star-forming dwarf galaxies (stellar mass $M_\star$ ranging from $10^7$ to $10^{9.5}\ M_\odot$), based on MUSE spectroscopic observations. We find that metallicity gradient shows a significant negative correlation (correlation coefficient $r \approx -0.56$) with $\log M_\star$, in contrast to the flat or even positive correlation observed for higher-mass galaxies. This negative correlation is accompanied by a stronger central suppression of metallicity compared to the outskirts in lower-mass galaxies. Among the other explored galaxy properties-including baryonic mass, star formation distribution, galaxy environment, regularity of the gaseous velocity field, and effective yield of metals $y_{\rm eff}$-only the velocity field regularity and $y_{\rm eff}$ show residual correlation with the metallicity gradient after controlling for $M_\star$, in the sense that galaxies with irregular velocity fields or lower $y_{\rm eff}$ tend to have less negative or more positive gradients. Particularly, a linear combination of $\log M_\star$ and $\log y_{\rm eff}$ significantly improves the correlation with metallicity gradient ($r \approx -0.68$) compared to $\log M_\star$ alone. The lack of correlation with environment disfavors gas accretion as a dominant factor. Our findings imply that metal mixing and transport processes, including but not limited to feedback-driven outflows, are more important than in-situ metal production in shaping the metallicity distribution of dwarf galaxies.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Authors:
Chenghao Xiao,
Hou Pong Chan,
Hao Zhang,
Mahani Aljunied,
Lidong Bing,
Noura Al Moubayed,
Yu Rong
Abstract:
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages.…
▽ More
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs' perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs' recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
△ Less
Submitted 24 June, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
Authors:
Yan Rong,
Shan Yang,
Chenxing Li,
Dong Yu,
Li Liu
Abstract:
Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-…
▽ More
Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training-free multi-agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human-like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow-based, context-aware framework for diverse audio generation with word-level semantic and temporal alignment. To enhance expressiveness, we then design word-level paralinguistic augmentation, utterance-level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance on multiple metrics. Importantly, our evaluation framework shows better alignment with human preferences and transferability across audio tasks.
△ Less
Submitted 12 August, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Authors:
Ziyi Wang,
Haoran Wu,
Yiming Rong,
Deyang Jiang,
Yixin Zhang,
Yunlong Zhao,
Shuang Xu,
Bo XU
Abstract:
Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are lim…
▽ More
Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Unexpected clustering pattern in dwarf galaxies challenges formation models
Authors:
Ziwen Zhang,
Yangyao Chen,
Yu Rong,
Huiyuan Wang,
Houjun Mo,
Xiong Luo,
Hao Li
Abstract:
The galaxy correlation function serves as a fundamental tool for studying cosmology, galaxy formation, and the nature of dark matter. It is well established that more massive, redder and more compact galaxies tend to have stronger clustering in space. These results can be understood in terms of galaxy formation in Cold Dark Matter (CDM) halos of different mass and assembly history. Here, we report…
▽ More
The galaxy correlation function serves as a fundamental tool for studying cosmology, galaxy formation, and the nature of dark matter. It is well established that more massive, redder and more compact galaxies tend to have stronger clustering in space. These results can be understood in terms of galaxy formation in Cold Dark Matter (CDM) halos of different mass and assembly history. Here, we report an unexpectedly strong large-scale clustering for isolated, diffuse and blue dwarf galaxies, comparable to that seen for massive galaxy groups but much stronger than that expected from their halo mass. Our analysis indicates that the strong clustering aligns with the halo assembly bias seen in simulations with the standard $Λ$CDM cosmology only if more diffuse dwarfs formed in low-mass halos of older ages. This pattern is not reproduced by existing models of galaxy evolution in a $Λ$CDM framework, and our finding provides new clues for the search of more viable models. Our results can be explained well by assuming self-interacting dark matter, suggesting that such a scenario should be considered seriously.
△ Less
Submitted 7 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
A Speech Production Model for Radar: Connecting Speech Acoustics with Radar-Measured Vibrations
Authors:
Isabella Lenz,
Yu Rong,
Daniel Bliss,
Julie Liss,
Visar Berisha
Abstract:
Millimeter Wave (mmWave) radar has emerged as a promising modality for speech sensing, offering advantages over traditional microphones. Prior works have demonstrated that radar captures motion signals related to vocal vibrations, but there is a gap in the understanding of the analytical connection between radar-measured vibrations and acoustic speech signals. We establish a mathematical framework…
▽ More
Millimeter Wave (mmWave) radar has emerged as a promising modality for speech sensing, offering advantages over traditional microphones. Prior works have demonstrated that radar captures motion signals related to vocal vibrations, but there is a gap in the understanding of the analytical connection between radar-measured vibrations and acoustic speech signals. We establish a mathematical framework linking radar-captured neck vibrations to speech acoustics. We derive an analytical relationship between neck surface displacements and speech. We use data from 66 human participants, and statistical spectral distance analysis to empirically assess the model. Our results show that the radar-measured signal aligns more closely with our model filtered vibration signal derived from speech than with raw speech itself. These findings provide a foundation for improved radar-based speech processing for applications in speech enhancement, coding, surveillance, and authentication.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Adaptive Extensive Cancellation Algorithm and Harmonic Enhanced Heart Rate Estimation based on MMWave Radar
Authors:
Hui Tang,
Zhan Yang,
Yu Rong,
Li Chai
Abstract:
Heart rate (HR) monitoring is crucial for assessing physical fitness, cardiovascular health, and stress management. Millimeter-wave radar offers a promising noncontact solution for long-term monitoring. However, accurate HR estimation remains challenging in low signal-tonoise ratio (SNR) conditions. To deal with both respiration harmonics and intermodulation interference, this paper proposes a can…
▽ More
Heart rate (HR) monitoring is crucial for assessing physical fitness, cardiovascular health, and stress management. Millimeter-wave radar offers a promising noncontact solution for long-term monitoring. However, accurate HR estimation remains challenging in low signal-tonoise ratio (SNR) conditions. To deal with both respiration harmonics and intermodulation interference, this paper proposes a cancellation-before-estimation strategy. Firstly, we present the adaptive extensive cancellation algorithm (ECA) to suppress respiratory and its low-order harmonics. Then, we propose an adaptive harmonic enhanced trace (AHET) method to avoid intermodulation interference by refining the HR search region. Various experimental results validate the effectiveness of the proposed methods, demonstrating improvements in accuracy, robustness, and computational efficiency compared to conventional approaches based on the FMCW (Frequency Modulated Continuous Wave) system
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Orthogonal Alignment of Galaxy Group Angular Momentum with Cosmic Filament Spines: An Observational Study
Authors:
Yu Rong,
Peng Wang,
Xiao-xiao Tang
Abstract:
We investigate the alignment between the angular momenta of galaxy groups and the spines of their associated cosmic filaments. Our results demonstrate a significant tendency for these two orientations to be perpendicular, indicating that the rotation of a galaxy group does not originate from the spin of cosmic filaments. Instead, it is driven by the orbital angular momentum contributed by member g…
▽ More
We investigate the alignment between the angular momenta of galaxy groups and the spines of their associated cosmic filaments. Our results demonstrate a significant tendency for these two orientations to be perpendicular, indicating that the rotation of a galaxy group does not originate from the spin of cosmic filaments. Instead, it is driven by the orbital angular momentum contributed by member galaxies as they accrete along the direction of the filament spines. Moreover, the strength of this perpendicular alignment signal varies with the richness of the galaxy groups, with the most pronounced alignment observed among the wealthiest groups. This pronounced alignment is largely due to the more coherent spatial distribution of member galaxies in richer groups relative to the filament spines. Our study provides valuable insights into the mechanisms of angular momentum acquisition in galaxy groups from an observational standpoint.
△ Less
Submitted 15 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
Frequency Autoregressive Image Generation with Continuous Tokens
Authors:
Hu Yu,
Hao Luo,
Hangjie Yuan,
Yu Rong,
Feng Zhao
Abstract:
Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce…
▽ More
Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
InversionGNN: A Dual Path Network for Multi-Property Molecular Optimization
Authors:
Yifan Niu,
Ziqi Gao,
Tingyang Xu,
Yang Liu,
Yatao Bian,
Yu Rong,
Junzhou Huang,
Jia Li
Abstract:
Exploring chemical space to find novel molecules that simultaneously satisfy multiple properties is crucial in drug discovery. However, existing methods often struggle with trading off multiple properties due to the conflicting or correlated nature of chemical properties. To tackle this issue, we introduce InversionGNN framework, an effective yet sample-efficient dual-path graph neural network (GN…
▽ More
Exploring chemical space to find novel molecules that simultaneously satisfy multiple properties is crucial in drug discovery. However, existing methods often struggle with trading off multiple properties due to the conflicting or correlated nature of chemical properties. To tackle this issue, we introduce InversionGNN framework, an effective yet sample-efficient dual-path graph neural network (GNN) for multi-objective drug discovery. In the direct prediction path of InversionGNN, we train the model for multi-property prediction to acquire knowledge of the optimal combination of functional groups. Then the learned chemical knowledge helps the inversion generation path to generate molecules with required properties. In order to decode the complex knowledge of multiple properties in the inversion path, we propose a gradient-based Pareto search method to balance conflicting properties and generate Pareto optimal molecules. Additionally, InversionGNN is able to search the full Pareto front approximately in discrete chemical space. Comprehensive experimental evaluations show that InversionGNN is both effective and sample-efficient in various discrete multi-objective settings including drug discovery.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
Authors:
Yiran Zhao,
Chaoqun Liu,
Yue Deng,
Jiahao Ying,
Mahani Aljunied,
Zhaodonghui Li,
Lidong Bing,
Hou Pong Chan,
Yu Rong,
Deli Zhao,
Wenxuan Zhang
Abstract:
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual…
▽ More
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
Authors:
Guizhen Chen,
Weiwen Xu,
Hao Zhang,
Hou Pong Chan,
Chaoqun Liu,
Lidong Bing,
Deli Zhao,
Anh Tuan Luu,
Yu Rong
Abstract:
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of…
▽ More
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
△ Less
Submitted 1 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer
Authors:
Yang Liu,
Zinan Zheng,
Jiashun Cheng,
Fugee Tsung,
Deli Zhao,
Yu Rong,
Jia Li
Abstract:
Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather…
▽ More
Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
LUCAS: Layered Universal Codec Avatars
Authors:
Di Liu,
Teng Deng,
Giljoo Nam,
Yu Rong,
Stanislav Pidhorskyi,
Junxuan Li,
Jason Saragih,
Dimitris N. Metaxas,
Chen Cao
Abstract:
Photorealistic 3D head avatar reconstruction faces critical challenges in modeling dynamic face-hair interactions and achieving cross-identity generalization, particularly during expressions and head movements. We present LUCAS, a novel Universal Prior Model (UPM) for codec avatar modeling that disentangles face and hair through a layered representation. Unlike previous UPMs that treat hair as an…
▽ More
Photorealistic 3D head avatar reconstruction faces critical challenges in modeling dynamic face-hair interactions and achieving cross-identity generalization, particularly during expressions and head movements. We present LUCAS, a novel Universal Prior Model (UPM) for codec avatar modeling that disentangles face and hair through a layered representation. Unlike previous UPMs that treat hair as an integral part of the head, our approach separates the modeling of the hairless head and hair into distinct branches. LUCAS is the first to introduce a mesh-based UPM, facilitating real-time rendering on devices. Our layered representation also improves the anchor geometry for precise and visually appealing Gaussian renderings. Experimental results indicate that LUCAS outperforms existing single-mesh and Gaussian-based avatar models in both quantitative and qualitative assessments, including evaluations on held-out subjects in zero-shot driving scenarios. LUCAS demonstrates superior dynamic performance in managing head pose changes, expression transfer, and hairstyle variations, thereby advancing the state-of-the-art in 3D head avatar reconstruction.
△ Less
Submitted 17 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
On the notion of Khovanov A-adequacy
Authors:
Lizzie Buchanan,
Huizheng Guo,
Gabriel Montoya-Vega,
Yongwu Rong,
Marithania Silvero
Abstract:
The concept of adequate links, introduced by Lickorish and Thistlethwaite as a generalization of alternating links, has recently gained interest among knot theorists in the context of Khovanov homology. Przytycki and Silvero introduced the more general concept of Khovanov adequacy: a diagram is Khovanov-adequate if its associated Khovanov chain complexes at both potential maximal and minimal quant…
▽ More
The concept of adequate links, introduced by Lickorish and Thistlethwaite as a generalization of alternating links, has recently gained interest among knot theorists in the context of Khovanov homology. Przytycki and Silvero introduced the more general concept of Khovanov adequacy: a diagram is Khovanov-adequate if its associated Khovanov chain complexes at both potential maximal and minimal quantum gradings have non-trivial homology. This article explores Khovanov adequacy within the framework of independence complexes and the calculation of the homotopy type of extreme Khovanov spectra.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
A Survey of Graph Transformers: Architectures, Theories and Applications
Authors:
Chaohao Yuan,
Kangfei Zhao,
Ercan Engin Kuruoglu,
Liang Wang,
Tingyang Xu,
Wenbing Huang,
Deli Zhao,
Hong Cheng,
Yu Rong
Abstract:
Graph Transformers (GTs) have demonstrated a strong capability in modeling graph structures by addressing the intrinsic limitations of graph neural networks (GNNs), such as over-smoothing and over-squashing. Recent studies have proposed diverse architectures, enhanced explainability, and practical applications for Graph Transformers. In light of these rapid developments, we conduct a comprehensive…
▽ More
Graph Transformers (GTs) have demonstrated a strong capability in modeling graph structures by addressing the intrinsic limitations of graph neural networks (GNNs), such as over-smoothing and over-squashing. Recent studies have proposed diverse architectures, enhanced explainability, and practical applications for Graph Transformers. In light of these rapid developments, we conduct a comprehensive review of Graph Transformers, covering aspects such as their architectures, theoretical foundations, and applications within this survey. We categorize the architecture of Graph Transformers according to their strategies for processing structural information, including graph tokenization, positional encoding, structure-aware attention and model ensemble. Furthermore, from the theoretical perspective, we examine the expressivity of Graph Transformers in various discussed architectures and contrast them with other advanced graph learning algorithms to discover the connections. Furthermore, we provide a summary of the practical applications where Graph Transformers have been utilized, such as molecule, protein, language, vision, traffic, brain and material data. At the end of this survey, we will discuss the current challenges and prospective directions in Graph Transformers for potential future research.
△ Less
Submitted 27 February, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
Authors:
Liang Wang,
Shaozhen Liu,
Yu Rong,
Deli Zhao,
Qiang Liu,
Shu Wu,
Liang Wang
Abstract:
Establishing the relationship between 3D structures and the energy states of molecular systems has proven to be a promising approach for learning 3D molecular representations. However, existing methods are limited to modeling the molecular energy states from classical mechanics. This limitation results in a significant oversight of quantum mechanical effects, such as quantized (discrete) energy le…
▽ More
Establishing the relationship between 3D structures and the energy states of molecular systems has proven to be a promising approach for learning 3D molecular representations. However, existing methods are limited to modeling the molecular energy states from classical mechanics. This limitation results in a significant oversight of quantum mechanical effects, such as quantized (discrete) energy level structures, which offer a more accurate estimation of molecular energy and can be experimentally measured through energy spectra. In this paper, we propose to utilize the energy spectra to enhance the pre-training of 3D molecular representations (MolSpectra), thereby infusing the knowledge of quantum mechanics into the molecular representations. Specifically, we propose SpecFormer, a multi-spectrum encoder for encoding molecular spectra via masked patch reconstruction. By further aligning outputs from the 3D encoder and spectrum encoder using a contrastive objective, we enhance the 3D encoder's understanding of molecules. Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Large Language-Geometry Model: When LLM meets Equivariance
Authors:
Zongzhao Li,
Jiacheng Cen,
Bing Su,
Wenbing Huang,
Tingyang Xu,
Yu Rong,
Deli Zhao
Abstract:
Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack th…
▽ More
Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.
△ Less
Submitted 19 February, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.