-
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Authors:
Junfei Wu,
Jian Guan,
Kaituo Feng,
Qiang Liu,
Shu Wu,
Liang Wang,
Wei Wu,
Tieniu Tan
Abstract:
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with t…
▽ More
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks
Authors:
Xinlong Chen,
Yuanxing Zhang,
Yushuo Guan,
Bohan Zeng,
Yang Shi,
Sihan Yang,
Pengfei Wan,
Qiang Liu,
Liang Wang,
Tieniu Tan
Abstract:
Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifi…
▽ More
Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model's advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
Authors:
Peiyan Li,
Yixiang Chen,
Hongtao Wu,
Xiao Ma,
Xiangnan Wu,
Yan Huang,
Liang Wang,
Tao Kong,
Tieniu Tan
Abstract:
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we in…
▽ More
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Implicit Neural Representation-Based MRI Reconstruction Method with Sensitivity Map Constraints
Authors:
Lixuan Rao,
Xinlin Zhang,
Yiman Huang,
Tao Tan,
Tong Tong
Abstract:
Magnetic Resonance Imaging (MRI) is a widely utilized diagnostic tool in clinical settings, but its application is limited by the relatively long acquisition time. As a result, fast MRI reconstruction has become a significant area of research. In recent years, Implicit Neural Representation (INR), as a scan-specific method, has demonstrated outstanding performance in fast MRI reconstruction withou…
▽ More
Magnetic Resonance Imaging (MRI) is a widely utilized diagnostic tool in clinical settings, but its application is limited by the relatively long acquisition time. As a result, fast MRI reconstruction has become a significant area of research. In recent years, Implicit Neural Representation (INR), as a scan-specific method, has demonstrated outstanding performance in fast MRI reconstruction without fully-sampled images for training. High acceleration reconstruction poses a challenging problem, and a key component in achieving high-quality reconstruction with much few data is the accurate estimation of coil sensitivity maps. However, most INR-based methods apply regularization constraints solely to the generated images, while overlooking the characteristics of the coil sensitivity maps. To handle this, this work proposes a joint coil sensitivity map and image estimation network, termed INR-CRISTAL. The proposed INR-CRISTAL introduces an extra sensitivity map regularization in the INR networks to make use of the smooth characteristics of the sensitivity maps. Experimental results show that INR-CRISTAL provides more accurate coil sensitivity estimates with fewer artifacts, and delivers superior reconstruction performance in terms of artifact removal and structure preservation. Moreover, INR-CRISTAL demonstrates stronger robustness to automatic calibration signals and the acceleration rate compared to existing methods.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Three Hot Jupiters transiting K-dwarfs with a significant heavy element mass
Authors:
Y. G. C. Frensch,
F. Bouchy,
G. Lo Curto,
S. Ulmer-Moll,
S. G. Sousa,
N. C. Santos,
K. G. Stassun,
C. N. Watkins,
H. Chakraborty,
K. Barkaoui,
M. Battley,
W. Ceva,
K. A. Collins,
T. Daylan,
P. Evans,
J. P. Faria,
C. Farret Jentink,
E. Fontanet,
E. Fridén,
G. Furesz,
M. Gillon,
N. Grieves,
C. Hellier,
E. Jehin,
J. M. Jenkins
, et al. (28 additional authors not shown)
Abstract:
Albeit at a lower frequency than around hotter stars, short-period gas giants around low-mass stars ($T_\mathrm{eff} < 4965$ K) do exist, despite predictions from planetary population synthesis models that such systems should be exceedingly rare. By combining data from TESS and ground-based follow-up observations, we seek to confirm and characterize giant planets transiting K dwarfs, particularly…
▽ More
Albeit at a lower frequency than around hotter stars, short-period gas giants around low-mass stars ($T_\mathrm{eff} < 4965$ K) do exist, despite predictions from planetary population synthesis models that such systems should be exceedingly rare. By combining data from TESS and ground-based follow-up observations, we seek to confirm and characterize giant planets transiting K dwarfs, particularly mid/late K dwarfs. Photometric data were obtained from the TESS mission, supplemented by ground-based imaging- and photometric observations, as well as high-resolution spectroscopic data from the CORALIE spectrograph. Radial velocity (RV) measurements were analyzed to confirm the presence of companions. We report the confirmation and characterization of three giants transiting mid-K dwarfs. Within the TOI-2969 system, a giant planet of $1.16\pm 0.04\,M_\mathrm{Jup}$ and a radius of $1.10 \pm 0.08\,R_\mathrm{Jup}$ revolves around its K3V host in 1.82 days. The system of TOI-2989 contains a $3.0 \pm 0.2\,M_\mathrm{Jup}$ giant with a radius of $1.12 \pm 0.05\,R_\mathrm{Jup}$, which orbits its K4V host in 3.12 days. The K4V TOI-5300 hosts a giant of $0.6 \pm 0.1\,M_\mathrm{Jup}$ with a radius of $0.88 \pm 0.08\,R_\mathrm{Jup}$ and an orbital period of 2.3 days. The equilibrium temperatures of the companions range from 1001 to 1186 K, classifying them as Hot Jupiters. However, they do not present radius inflation. The estimated heavy element masses in their interior, inferred from the mass, radius, and evolutionary models, are $90 \pm 30\,M_\oplus$, $114 \pm 30\,M_\oplus$, and $84 \pm 21\,M_\oplus$, respectively. The heavy element masses are significantly higher than most reported heavy elements for K-dwarf Hot Jupiters. These mass characterizations contribute to the poorly explored population of massive companions around low-mass stars.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets
Authors:
Mikhail Kennerley,
Angelica Aviles-Rivero,
Carola-Bibiane Schönlieb,
Robby T. Tan
Abstract:
Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is…
▽ More
Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual annotations. Across multiple benchmarks, LAT demonstrates consistent improvements in target-domain detection performance, achieving gains of up to +4.8AP over semi-supervised baselines.
△ Less
Submitted 6 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
A Diffusion-Driven Temporal Super-Resolution and Spatial Consistency Enhancement Framework for 4D MRI imaging
Authors:
Xuanru Zhou,
Jiarun Liu,
Shoujun Yu,
Hao Yang,
Cheng Li,
Tao Tan,
Shanshan Wang
Abstract:
In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity--especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations,…
▽ More
In medical imaging, 4D MRI enables dynamic 3D visualization, yet the trade-off between spatial and temporal resolution requires prolonged scan time that can compromise temporal fidelity--especially during rapid, large-amplitude motion. Traditional approaches typically rely on registration-based interpolation to generate intermediate frames. However, these methods struggle with large deformations, resulting in misregistration, artifacts, and diminished spatial consistency. To address these challenges, we propose TSSC-Net, a novel framework that generates intermediate frames while preserving spatial consistency. To improve temporal fidelity under fast motion, our diffusion-based temporal super-resolution network generates intermediate frames using the start and end frames as key references, achieving 6x temporal super-resolution in a single inference step. Additionally, we introduce a novel tri-directional Mamba-based module that leverages long-range contextual information to effectively resolve spatial inconsistencies arising from cross-slice misalignment, thereby enhancing volumetric coherence and correcting cross-slice errors. Extensive experiments were performed on the public ACDC cardiac MRI dataset and a real-world dynamic 4D knee joint dataset. The results demonstrate that TSSC-Net can generate high-resolution dynamic MRI from fast-motion data while preserving structural fidelity and spatial consistency.
△ Less
Submitted 8 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
$^{229}$Th Nuclear Spectroscopy in an Opaque Material: Laser-Based Conversion Electron Mössbauer Spectroscopy of $^{229}$ThO$_2$
Authors:
Ricky Elwell,
James E. S. Terhune,
Christian Schneider,
Harry W. T. Morgan,
Hoang Bao Tran Tan,
Udeshika C. Perera,
Daniel A. Rehn,
Marisa C. Alfonso,
Lars von der Wense,
Benedict Seiferle,
Kevin Scharl,
Peter G. Thirolf,
Andrei Derevianko,
Eric R. Hudson
Abstract:
Here, we report the first demonstration of laser-induced conversion electron Mössbauer spectroscopy of the $^{229}$Th nuclear isomeric state, which provides the ability to probe the nuclear transition in a material that is opaque to light resonant with the nuclear transition. Specifically, we excite the nuclear transition in a thin ThO$_2$ sample whose band gap ($\sim$ 6 eV) is considerably smalle…
▽ More
Here, we report the first demonstration of laser-induced conversion electron Mössbauer spectroscopy of the $^{229}$Th nuclear isomeric state, which provides the ability to probe the nuclear transition in a material that is opaque to light resonant with the nuclear transition. Specifically, we excite the nuclear transition in a thin ThO$_2$ sample whose band gap ($\sim$ 6 eV) is considerably smaller than the nuclear isomeric state energy (8.4 eV). As a result, the excited nucleus can quickly decay by internal conversion, resulting in the ejection of electrons from the surface. By collecting these conversion electrons, nuclear spectroscopy can be recorded. Unlike fluorescence spectroscopy, this technique is compatible with materials whose work function is less than the nuclear transition energy, opening a wider class of systems to study. Further, because ThO$_2$ can be made from spinless isotopes and the internal conversion decay process reduces the isomeric state lifetime to only $\sim$10 $μ$s, allowing $\sim$10$^8$ relative reduction in clock interrogation time, a conversion-electron-based nuclear clock could lead to a $\sim$10$^4$ reduction in clock instability.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Engineering continuous-variable entanglement in mechanical oscillators with optimal control
Authors:
Maverick J. Millican,
Vassili G. Matsos,
Christophe H. Valahu,
Tomas Navickas,
Liam J. Bond,
Ting Rei Tan
Abstract:
We demonstrate an optimal quantum control strategy for the deterministic preparation of entangled harmonic oscillator states in trapped ions. The protocol employs dynamical phase modulation of laser-driven Jaynes-Cummings and anti-Jaynes-Cummings interactions. We prepare Two-Mode Squeezed Vacuum (TMSV) states in the mechanical motions of a trapped ion and characterize the states with phase-space t…
▽ More
We demonstrate an optimal quantum control strategy for the deterministic preparation of entangled harmonic oscillator states in trapped ions. The protocol employs dynamical phase modulation of laser-driven Jaynes-Cummings and anti-Jaynes-Cummings interactions. We prepare Two-Mode Squeezed Vacuum (TMSV) states in the mechanical motions of a trapped ion and characterize the states with phase-space tomography. First, we verify continuous-variable entanglement by measuring an Einstein-Podolsky-Rosen entanglement parameter of 0.0132(7), which is below the threshold of 0.25 for Reid's EPR criterion. Second, we perform a continuous-variable Bell test and find a violation of the Clauser-Horne-Shimony-Holt inequality, measuring 2.26(3), which is above the entanglement threshold of 2. We also demonstrate the flexibility of our method by preparing a non-Gaussian entangled oscillator state--a superposition of TMSV states.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Authors:
Haitian Zhong,
Yuhuan Liu,
Ziyang Xu,
Guofan Liu,
Qiang Liu,
Shu Wu,
Zhe Zhao,
Liang Wang,
Tieniu Tan
Abstract:
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editin…
▽ More
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models
Authors:
Xinlong Chen,
Yuanxing Zhang,
Qiang Liu,
Junfei Wu,
Fuzheng Zhang,
Tieniu Tan
Abstract:
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image…
▽ More
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.
△ Less
Submitted 10 June, 2025; v1 submitted 17 May, 2025;
originally announced May 2025.
-
Ancestry-Adjusted Polygenic Risk Scores for Predicting Obesity Risk in the Indonesian Population
Authors:
Jocelyn Verna Siswanto,
Belinda Mutiara,
Felicia Austin,
Jonathan Susanto,
Cathelyn Theophila Tan,
Restu Unggul Kresnadi,
Kezia Irene
Abstract:
Obesity prevalence in Indonesian adults increased from 10.5% in 2007 to 23.4% in 2023. Studies showed that genetic predisposition significantly influences obesity susceptibility. To aid this, polygenic risk scores (PRS) help aggregate the effects of numerous genetic variants to assess genetic risk. However, 91% of genome-wide association studies (GWAS) involve European populations, limiting their…
▽ More
Obesity prevalence in Indonesian adults increased from 10.5% in 2007 to 23.4% in 2023. Studies showed that genetic predisposition significantly influences obesity susceptibility. To aid this, polygenic risk scores (PRS) help aggregate the effects of numerous genetic variants to assess genetic risk. However, 91% of genome-wide association studies (GWAS) involve European populations, limiting their applicability to Indonesians due to genetic diversity. This study aims to develop and validate an ancestry adjusted PRS for obesity in the Indonesian population using principal component analysis (PCA) method constructed from the 1000 Genomes Project data and our own genomic data from approximately 2,800 Indonesians. We calculate PRS for obesity using all races, then determine the first four principal components using ancestry-informative SNPs and develop a linear regression model to predict PRS based on these principal components. The raw PRS is adjusted by subtracting the predicted score to obtain an ancestry adjusted PRS for the Indonesian population. Our results indicate that the ancestry-adjusted PRS improves obesity risk prediction. Compared to the unadjusted PRS, the adjusted score improved classification performance with a 5% increase in area under the ROC curve (AUC). This approach underscores the importance of population-specific adjustments in genetic risk assessments to enable more effective personalized healthcare and targeted intervention strategies for diverse populations.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Not All Documents Are What You Need for Extracting Instruction Tuning Data
Authors:
Chi Zhang,
Huaping Zhong,
Hongtao Li,
Chengliang Chai,
Jiawei Hong,
Yuhao Deng,
Jiacheng Wang,
Tian Tan,
Yizhou Yan,
Jiantao Qiu,
Ye Yuan,
Guoren Wang,
Conghui He,
Lei Cao
Abstract:
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address t…
▽ More
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Authors:
Yexiang Liu,
Zekun Li,
Zhi Fang,
Nan Xu,
Ran He,
Tieniu Tan
Abstract:
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Exp…
▽ More
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
△ Less
Submitted 4 June, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction
Authors:
Hao Yang,
Tao Tan,
Shuai Tan,
Weiqin Yang,
Kunyan Cai,
Calvin Chen,
Yue Sun
Abstract:
Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion pro…
▽ More
Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer's disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
DESI DR1 Lyα 1D power spectrum: The Fast Fourier Transform estimator measurement
Authors:
Corentin Ravoux,
Marie-Lynn Abdul-Karim,
Jean-Marc Le Goff,
Eric Armengaud,
Jessica N. Aguilar,
Steven Ahlen,
Stephen Bailey,
Davide Bianchi,
Allyson Brodzeller,
David Brooks,
Jonás Chaves-Montero,
Todd Claybaugh,
Andrei Cuceu,
Roger de Belsunce,
Axel de la Macorra,
Arjun Dey,
Zhejie Ding,
Peter Doel,
Simone Ferraro,
Andreu Font-Ribera,
Jaime E. Forero-Romero,
Enrique Gaztañaga,
Naim Göksel Karaçaylı,
Satya Gontcho A Gontcho,
Gaston Gutierrez
, et al. (42 additional authors not shown)
Abstract:
We present the one-dimensional Lyman-alpha forest power spectrum measurement derived from the data release 1 (DR1) of the Dark Energy Spectroscopic Instrument (DESI). The measurement of the Lyman-alpha forest power spectrum along the line of sight from high-redshift quasar spectra provides information on the shape of the linear matter power spectrum, neutrino masses, and the properties of dark mat…
▽ More
We present the one-dimensional Lyman-alpha forest power spectrum measurement derived from the data release 1 (DR1) of the Dark Energy Spectroscopic Instrument (DESI). The measurement of the Lyman-alpha forest power spectrum along the line of sight from high-redshift quasar spectra provides information on the shape of the linear matter power spectrum, neutrino masses, and the properties of dark matter. In this work, we use a Fast Fourier Transform (FFT)-based estimator, which is validated on synthetic data in a companion paper. Compared to the FFT measurement performed on the DESI early data release, we improve the noise characterization with a cross-exposure estimator and test the robustness of our measurement using various data splits. We also refine the estimation of the uncertainties and now present an estimator for the covariance matrix of the measurement. Furthermore, we compare our results to previous high-resolution and eBOSS measurements. In another companion paper, we present the same DR1 measurement using the Quadratic Maximum Likelihood Estimator (QMLE). These two measurements are consistent with each other and constitute the most precise one-dimensional power spectrum measurement to date, while being in good agreement with results from the DESI early data release.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis
Authors:
Jiarun Liu,
Hong-Yu Zhou,
Weijian Huang,
Hao Yang,
Dongning Song,
Tao Tan,
Yong Liang,
Shanshan Wang
Abstract:
Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scali…
▽ More
Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scaling behavior in the medical domain. In this paper, we explored the scaling behavior across model sizes, training algorithms, data sizes, and imaging modalities in developing scalable medical vision foundation models by self-supervised learning. To support scalable pretraining, we introduce BioVFM-21M, a large-scale biomedical image dataset encompassing a wide range of biomedical image modalities and anatomies. We observed that scaling up does provide benefits but varies across tasks. Additional analysis reveals several factors correlated with scaling benefits. Finally, we propose BioVFM, a large-scale medical vision foundation model pretrained on 21 million biomedical images, which outperforms the previous state-of-the-art foundation models across 12 medical benchmarks. Our results highlight that while scaling up is beneficial for pursuing better performance, task characteristics, data diversity, pretraining methods, and computational efficiency remain critical considerations for developing scalable medical foundation models.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
DESI DR1 Ly$α$ 1D power spectrum: The optimal estimator measurement
Authors:
N. G. Karaçaylı,
P. Martini,
J. Aguilar,
S. Ahlen,
E. Armengaud,
S. Bailey,
A. Bault,
D. Bianchi,
A. Brodzeller,
D. Brooks,
J. Chaves-Montero,
T. Claybaugh,
A. Cuceu,
A. de la Macorra,
A. Dey,
B. Dey,
P. Doel,
S. Ferraro,
A. Font-Ribera,
J. E. Forero-Romero,
E. Gaztañaga,
S. Gontcho A Gontcho,
G. Gutierrez,
J. Guy,
C. Hahn
, et al. (39 additional authors not shown)
Abstract:
The one-dimensional power spectrum $P_{\mathrm{1D}}$ of Ly$α$ forest offers rich insights into cosmological and astrophysical parameters, including constraints on the sum of neutrino masses, warm dark matter models, and the thermal state of the intergalactic medium. We present the measurement of $P_{\mathrm{1D}}$ using the optimal quadratic maximum likelihood estimator applied to over 300,000 Ly…
▽ More
The one-dimensional power spectrum $P_{\mathrm{1D}}$ of Ly$α$ forest offers rich insights into cosmological and astrophysical parameters, including constraints on the sum of neutrino masses, warm dark matter models, and the thermal state of the intergalactic medium. We present the measurement of $P_{\mathrm{1D}}$ using the optimal quadratic maximum likelihood estimator applied to over 300,000 Ly$α$ quasars from Data Release 1 (DR1) of the Dark Energy Spectroscopic Instrument (DESI) survey. This sample represents the largest to date for $P_{\mathrm{1D}}$ measurements and is larger than the Extended Baryon Oscillation Spectroscopic Survey (eBOSS) by a factor of 1.7. We conduct a meticulous investigation of instrumental and analysis systematics and quantify their impact on $P_{\mathrm{1D}}$. This includes the development of a cross-exposure estimator that eliminates the need to model the pipeline noise and has strong potential for future $P_{\mathrm{1D}}$ measurements. We also present new insights into metal contamination through the 1D correlation function. Using a fitting function we measure the evolution of the Ly$α$ forest bias with high precision: $b_F(z) = (-0.218\pm0.002)\times((1 + z) / 4)^{2.96\pm0.06}$. In a companion validation paper, we substantially extend our previous suite of CCD image simulations to quantify the pipeline's exquisite performance accurately. In another companion paper, we present DR1 $P_{\mathrm{1D}}$ measurements using the Fast Fourier Transform (FFT) approach to power spectrum estimation. These two measurements are consistent with each other and constitute the most precise $P_{\mathrm{1D}}$ measurement to date, while being in good agreement with results from the DESI early data release.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Empowering the Grid: Collaborative Edge Artificial Intelligence for Decentralized Energy Systems
Authors:
Eddie de Paula Jr,
Niel Bunda,
Hezerul Abdul Karim,
Nouar AlDahoul,
Myles Joshua Toledo Tan
Abstract:
This paper examines how decentralized energy systems can be enhanced using collaborative Edge Artificial Intelligence. Decentralized grids use local renewable sources to reduce transmission losses and improve energy security. Edge AI enables real-time, privacy-preserving data processing at the network edge. Techniques such as federated learning and distributed control improve demand response, equi…
▽ More
This paper examines how decentralized energy systems can be enhanced using collaborative Edge Artificial Intelligence. Decentralized grids use local renewable sources to reduce transmission losses and improve energy security. Edge AI enables real-time, privacy-preserving data processing at the network edge. Techniques such as federated learning and distributed control improve demand response, equipment maintenance, and energy optimization. The paper discusses key challenges including data privacy, scalability, and interoperability, and suggests solutions such as blockchain integration and adaptive architectures. Examples from virtual power plants and smart grids highlight the potential of these technologies. The paper calls for increased investment, policy support, and collaboration to advance sustainable energy systems.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
Authors:
Yiping Xie,
Bo Zhao,
Mingtong Dai,
Jian-Ping Zhou,
Yue Sun,
Tao Tan,
Weicheng Xie,
Linlin Shen,
Zitong Yu
Abstract:
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To b…
▽ More
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Using Active Learning to Improve Quasar Identification for the DESI Spectra Processing Pipeline
Authors:
Dylan Green,
David Kirkby,
J. Aguilar,
S. Ahlen,
D. M. Alexander,
E. Armengaud,
S. Bailey,
A. Bault,
D. Bianchi,
A. Brodzeller,
D. Brooks,
T. Claybaugh,
R. de Belsunce,
A. de la Macorra,
P. Doel,
V. A. Fawcett,
S. Ferraro,
A. Font-Ribera,
J. E. Forero-Romero,
E. Gaztañaga,
S. Gontcho A Gontcho,
G. Gutierrez,
M. Ishak,
S. Juneau,
R. Kehoe
, et al. (29 additional authors not shown)
Abstract:
The Dark Energy Spectroscopic Instrument (DESI) survey uses an automatic spectral classification pipeline to classify spectra. QuasarNET is a convolutional neural network used as part of this pipeline originally trained using data from the Baryon Oscillation Spectroscopic Survey (BOSS). In this paper we implement an active learning algorithm to optimally select spectra to use for training a new ve…
▽ More
The Dark Energy Spectroscopic Instrument (DESI) survey uses an automatic spectral classification pipeline to classify spectra. QuasarNET is a convolutional neural network used as part of this pipeline originally trained using data from the Baryon Oscillation Spectroscopic Survey (BOSS). In this paper we implement an active learning algorithm to optimally select spectra to use for training a new version of the QuasarNET weights file using only DESI data, specifically to improve classification accuracy. This active learning algorithm includes a novel outlier rejection step using a Self-Organizing Map to ensure we label spectra representative of the larger quasar sample observed in DESI. We perform two iterations of the active learning pipeline, assembling a final dataset of 5600 labeled spectra, a small subset of the approx 1.3 million quasar targets in DESI's Data Release 1. When splitting the spectra into training and validation subsets we meet or exceed the previously trained weights file in completeness and purity calculated on the validation dataset with less than one tenth of the amount of training data. The new weights also more consistently classify objects in the same way when used on unlabeled data compared to the old weights file. In the process of improving QuasarNET's classification accuracy we discovered a systemic error in QuasarNET's redshift estimation and used our findings to improve our understanding of QuasarNET's redshifts.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Separation and Definability in Fragments of Two-Variable First-Order Logic with Counting
Authors:
Louwe Kuijer,
Tony Tan,
Frank Wolter,
Michael Zakharyaschev
Abstract:
For fragments L of first-order logic (FO) with counting quantifiers, we consider the definability problem, which asks whether a given L-formula can be equivalently expressed by a formula in some fragment of L without counting, and the more general separation problem asking whether two mutually exclusive L-formulas can be separated in some counting-free fragment of L. We show that separation is und…
▽ More
For fragments L of first-order logic (FO) with counting quantifiers, we consider the definability problem, which asks whether a given L-formula can be equivalently expressed by a formula in some fragment of L without counting, and the more general separation problem asking whether two mutually exclusive L-formulas can be separated in some counting-free fragment of L. We show that separation is undecidable for the two-variable fragment of FO extended with counting quantifiers and for the graded modal logic with inverse, nominals and universal modality. On the other hand, if inverse or nominals are dropped, separation becomes coNExpTime- or 2ExpTime-complete, depending on whether the universal modality is present. In contrast, definability can often be reduced in polynomial time to validity in L. We also consider uniform separation and show that it often behaves similarly to definability.
△ Less
Submitted 30 April, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
Smallest Intersecting and Enclosing Balls
Authors:
Jiaqi Zheng,
Tiow-Seng Tan
Abstract:
We study the smallest intersecting and enclosing ball problems in Euclidean spaces for input objects that are compact and convex. They link and unify many problems in computational geometry and machine learning. We show that both problems can be modeled as zero-sum games, and propose an approximation algorithm for the former. Specifically, the algorithm produces the first results in high-dimension…
▽ More
We study the smallest intersecting and enclosing ball problems in Euclidean spaces for input objects that are compact and convex. They link and unify many problems in computational geometry and machine learning. We show that both problems can be modeled as zero-sum games, and propose an approximation algorithm for the former. Specifically, the algorithm produces the first results in high-dimensional spaces for various input objects such as convex polytopes, balls, ellipsoids, etc.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Rotational ultrasound and photoacoustic tomography of the human body
Authors:
Yang Zhang,
Shuai Na,
Jonathan J. Russin,
Karteekeya Sastry,
Li Lin,
Junfu Zheng,
Yilin Luo,
Xin Tong,
Yujin An,
Peng Hu,
Konstantin Maslov,
Tze-Woei Tan,
Charles Y. Liu,
Lihong V. Wang
Abstract:
Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic ima…
▽ More
Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic imaging of the vasculature is generally not practical in conventional ultrasonography with limited field-of-view (FOV) probes, and PAT does not provide sufficient scattering-based soft tissue morphological contrast. Complementing each other, fast panoramic rotational ultrasound tomography (RUST) and PAT are integrated for hybrid rotational ultrasound and photoacoustic tomography (RUS-PAT), which obtains 3D ultrasound structural and PAT angiographic images of the human body quasi-simultaneously. The RUST functionality is achieved in a cost-effective manner using a single-element ultrasonic transducer for ultrasound transmission and rotating arc-shaped arrays for 3D panoramic detection. RUST is superior to conventional ultrasonography, which either has a limited FOV with a linear array or is high-cost with a hemispherical array that requires both transmission and receiving. By switching the acoustic source to a light source, the system is conveniently converted to PAT mode to acquire angiographic images in the same region. Using RUS-PAT, we have successfully imaged the human head, breast, hand, and foot with a 10 cm diameter FOV, submillimeter isotropic resolution, and 10 s imaging time for each modality. The 3D RUS-PAT is a powerful tool for high-speed, 3D, dual-contrast imaging of the human body with potential for rapid clinical translation.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes
Authors:
Huanyu Zhang,
Chengzu Li,
Wenshan Wu,
Shaoguang Mao,
Yifan Zhang,
Haochen Tian,
Ivan Vulić,
Zhang Zhang,
Liang Wang,
Tieniu Tan,
Furu Wei
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world, thereby limiting their broader applications. We argue that…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks. However, recent studies have exposed critical limitations in their spatial reasoning capabilities. This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world, thereby limiting their broader applications. We argue that spatial reasoning capabilities will not naturally emerge from merely scaling existing architectures and training methodologies. Instead, this challenge demands dedicated attention to fundamental modifications in the current MLLM development approach. In this position paper, we first establish a comprehensive framework for spatial reasoning within the context of MLLMs. We then elaborate on its pivotal role in real-world applications. Through systematic analysis, we examine how individual components of the current methodology, from training data to reasoning mechanisms, influence spatial reasoning capabilities. This examination reveals critical limitations while simultaneously identifying promising avenues for advancement. Our work aims to direct the AI research community's attention toward these crucial yet underexplored aspects. By highlighting these challenges and opportunities, we seek to catalyze progress toward achieving human-like spatial reasoning capabilities in MLLMs.
△ Less
Submitted 3 June, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Yufei Wang,
Wenhan Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Qiyu Rong,
Hongyuan Jing,
Mengmeng Zhang,
Jinglong Li,
Xiangyu Lu,
Yi Ren,
Yuting Liu,
Meng Zhang,
Xiang Chen,
Qiyuan Guan,
Jiangxin Dong,
Jinshan Pan,
Conglin Gou
, et al. (112 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ…
▽ More
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
△ Less
Submitted 19 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation
Authors:
Jingkun Chen,
Haoran Duan,
Xiao Zhang,
Boyan Gao,
Tao Tan,
Vicente Grau,
Jungong Han
Abstract:
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision require…
▽ More
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Interference-caged quantum many-body scars: the Fock space topological localization and interference zeros
Authors:
Tao-Lin Tan,
Yi-Ping Huang
Abstract:
We propose a general mechanism for realizing athermal finite-energy-density eigenstates -- termed interference-caged quantum many-body scars (ICQMBS) -- which originate from exact many-body destructive interference on the Fock space graph. These eigenstates are strictly localized to specific subsets of vertices, analogous to compact localized states in flat-band systems. Central to our framework i…
▽ More
We propose a general mechanism for realizing athermal finite-energy-density eigenstates -- termed interference-caged quantum many-body scars (ICQMBS) -- which originate from exact many-body destructive interference on the Fock space graph. These eigenstates are strictly localized to specific subsets of vertices, analogous to compact localized states in flat-band systems. Central to our framework is a connection between interference zeros and graph automorphisms, which classify vertices according to the graph's local topology. This connection enables the construction of a new class of topological ICQMBS, whose robustness arises from the local topology of the Fock space graph rather than from conventional conservation laws or dynamical constraints. We demonstrate the effectiveness of this framework by developing a graph-theory-based search algorithm, which identifies ICQMBS in both a one-dimensional spin-1 XY model and two-dimensional quantum link models across distinct gauge sectors. In particular, we discover the proposed topological ICQMBS in the two-dimensional quantum link model and provide an intuitive explanation for previously observed order-by-disorder phenomena in Hilbert space. Our results reveal an unexpected synergy between graph theory, flat-band physics, and quantum many-body dynamics, offering new insights into the structure and stability of nonthermal eigenstates.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
An Intelligent and Privacy-Preserving Digital Twin Model for Aging-in-Place
Authors:
Yongjie Wang,
Jonathan Cyril Leung,
Ming Chen,
Zhiwei Zeng,
Benny Toh Hsiang Tan,
Yang Qiu,
Zhiqi Shen
Abstract:
The population of older adults is steadily increasing, with a strong preference for aging-in-place rather than moving to care facilities. Consequently, supporting this growing demographic has become a significant global challenge. However, facilitating successful aging-in-place is challenging, requiring consideration of multiple factors such as data privacy, health status monitoring, and living en…
▽ More
The population of older adults is steadily increasing, with a strong preference for aging-in-place rather than moving to care facilities. Consequently, supporting this growing demographic has become a significant global challenge. However, facilitating successful aging-in-place is challenging, requiring consideration of multiple factors such as data privacy, health status monitoring, and living environments to improve health outcomes. In this paper, we propose an unobtrusive sensor system designed for installation in older adults' homes. Using data from the sensors, our system constructs a digital twin, a virtual representation of events and activities that occurred in the home. The system uses neural network models and decision rules to capture residents' activities and living environments. This digital twin enables continuous health monitoring by providing actionable insights into residents' well-being. Our system is designed to be low-cost and privacy-preserving, with the aim of providing green and safe monitoring for the health of older adults. We have successfully deployed our system in two homes over a time period of two months, and our findings demonstrate the feasibility and effectiveness of digital twin technology in supporting independent living for older adults. This study highlights that our system could revolutionize elder care by enabling personalized interventions, such as lifestyle adjustments, medical treatments, or modifications to the residential environment, to enhance health outcomes.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Authors:
Wulin Xie,
Yi-Fan Zhang,
Chaoyou Fu,
Yang Shi,
Bingyan Nie,
Hongkai Chen,
Zhang Zhang,
Liang Wang,
Tieniu Tan
Abstract:
Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Ou…
▽ More
Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.
△ Less
Submitted 7 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Authors:
Xin Zhang,
Robby T. Tan
Abstract:
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provid…
▽ More
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.
△ Less
Submitted 15 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion
Authors:
Baozhuo Su,
Qingli Dou,
Kang Liu,
Zhengxian Qu,
Jerry Deng,
Ting Tan,
Yanan Gu
Abstract:
AI computation in healthcare faces significant challenges when clinical datasets are limited and heterogeneous. Integrating datasets from multiple sources and different equipments is critical for effective AI computation but is complicated by their diversity, complexity, and lack of representativeness, so we often need to join multiple datasets for analysis. The currently used method is fusion aft…
▽ More
AI computation in healthcare faces significant challenges when clinical datasets are limited and heterogeneous. Integrating datasets from multiple sources and different equipments is critical for effective AI computation but is complicated by their diversity, complexity, and lack of representativeness, so we often need to join multiple datasets for analysis. The currently used method is fusion after normalization. But when using this method, it can introduce redundant information, decreasing the signal-to-noise ratio and reducing classification accuracy. To tackle this issue, we propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence. Our approach involves initially preprocessing and continuous estimation on the extracted features, followed by employing the gradient descent method to identify the optimal linear parameters that minimize the KL divergence between the feature distributions. Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals by different equipments and verifying our method by using the publicly available PTB-XL dataset which contains 21,837 ECG recordings from 18,885 patients. We employ a Light Gradient Boosting Machine (LGBM) model to do the binary classification. The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets, compared to the normalization method. This data fusion strategy provides a new approach to process heterogeneous datasets for the optimal AI computation results.
△ Less
Submitted 19 March, 2025;
originally announced April 2025.
-
MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
Authors:
Wenzhuo Liu,
Wenshuo Wang,
Yicheng Qiao,
Qiannan Guo,
Jiayin Zhu,
Pengfei Li,
Zilong Chen,
Huiming Yang,
Zhiwei Li,
Lening Wang,
Tiao Tan,
Huaping Liu
Abstract:
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emo…
▽ More
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction
Authors:
Shuaiyu Zhang,
Xun Lin,
Rongxiang Zhang,
Yu Bai,
Yong Xu,
Tao Tan,
Xunbin Zheng,
Zitong Yu
Abstract:
The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive…
▽ More
The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive Multimodal Hierarchical Fusion, a framework designed for efficient, comprehensive, and tailored feature extraction and fusion. AdaMHF is specifically adapted to the uniqueness of medical data, enabling accurate predictions with minimal resource consumption, even under challenging scenarios with missing modalities. Initially, AdaMHF employs an experts expansion and residual structure to activate specialized experts for extracting heterogeneous and sparse features. Extracted tokens undergo refinement via selection and aggregation, reducing the weight of non-dominant features while preserving comprehensive information. Subsequently, the encoded features are hierarchically fused, allowing multi-grained interactions across modalities to be captured. Furthermore, we introduce a survival prediction benchmark designed to resolve scenarios with missing modalities, mirroring real-world clinical conditions. Extensive experiments on TCGA datasets demonstrate that AdaMHF surpasses current state-of-the-art (SOTA) methods, showcasing exceptional performance in both complete and incomplete modality settings.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors
Authors:
Weilong Yan,
Ming Li,
Haipeng Li,
Shuwei Shao,
Robby T. Tan
Abstract:
Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting i…
▽ More
Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
TOI-2005b: An Eccentric Warm Jupiter in Spin-Orbit Alignment
Authors:
Allyson Bieryla,
Jiayin Dong,
George Zhou,
Jason D. Eastman,
L. C. Mayorga,
David W. Latham,
Brad Carter,
Chelsea X. Huang,
Samuel N. Quinn,
Karen A. Collins,
Lyu Abe,
Yuri Beletsky,
Rafael Brahm,
Nicole D. Colón,
Zahra Ensak,
Tristan Guillot,
Thomas Henning,
Melissa J. Hobson,
Keith Horne,
Jon M. Jenkins,
Matías I. Jones,
Andrés Jordán,
David Osip,
George R. Ricker,
Joseph E. Rodriguez
, et al. (14 additional authors not shown)
Abstract:
We report the discovery and characterization of TOI-2005b, a warm Jupiter on an eccentric (e~0.59), 17.3-day orbit around a V_mag = 9.867 rapidly rotating F-star. The object was detected as a candidate by TESS and the planetary nature of TOI-2005b was then confirmed via a series of ground-based photometric, spectroscopic, and diffraction-limited imaging observations. The planet was found to reside…
▽ More
We report the discovery and characterization of TOI-2005b, a warm Jupiter on an eccentric (e~0.59), 17.3-day orbit around a V_mag = 9.867 rapidly rotating F-star. The object was detected as a candidate by TESS and the planetary nature of TOI-2005b was then confirmed via a series of ground-based photometric, spectroscopic, and diffraction-limited imaging observations. The planet was found to reside in a low sky-projected stellar obliquity orbit (lambda = 4.8 degrees) via a transit spectroscopic observation using the Magellan MIKE spectrograph.TOI-2005b is one of a few planets known to have a low-obliquity, high-eccentricity orbit, which may be the result of high-eccentricity coplanar migration. The planet has a periastron equilibrium temperature of ~ 2100 K, similar to some highly irradiated hot Jupiters where atomic metal species have been detected in transmission spectroscopy, and varies by almost 1000 K during its orbit. Future observations of the atmosphere of TOI-2005b can inform us about its radiative timescales thanks to the rapid heating and cooling of the planet.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
3DSwapping: Texture Swapping For 3D Object From Single Reference Image
Authors:
Xiao Cao,
Beibei Lin,
Bo Wang,
Zhiyong Huang,
Robby T. Tan
Abstract:
3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to pres…
▽ More
3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis
Authors:
Yaofei Duan,
Tao Tan,
Zhiyuan Zhu,
Yuhao Huang,
Yuanji Zhang,
Rui Gao,
Patrick Cheong-Iao Pang,
Xinru Gao,
Guowei Tao,
Xiang Cong,
Zhou Li,
Lianying Liang,
Guangzhi He,
Linliang Yin,
Xuedong Deng,
Xin Yang,
Dong Ni
Abstract:
Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses diff…
▽ More
Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Data Release 1 of the Dark Energy Spectroscopic Instrument
Authors:
DESI Collaboration,
M. Abdul-Karim,
A. G. Adame,
D. Aguado,
J. Aguilar,
S. Ahlen,
S. Alam,
G. Aldering,
D. M. Alexander,
R. Alfarsy,
L. Allen,
C. Allende Prieto,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
S. Bailey,
A. Baleato Lizancos,
O. Ballester,
A. Bault,
J. Bautista,
S. BenZvi
, et al. (253 additional authors not shown)
Abstract:
In 2021 May the Dark Energy Spectroscopic Instrument (DESI) collaboration began a 5-year spectroscopic redshift survey to produce a detailed map of the evolving three-dimensional structure of the universe between $z=0$ and $z\approx4$. DESI's principle scientific objectives are to place precise constraints on the equation of state of dark energy, the gravitationally driven growth of large-scale st…
▽ More
In 2021 May the Dark Energy Spectroscopic Instrument (DESI) collaboration began a 5-year spectroscopic redshift survey to produce a detailed map of the evolving three-dimensional structure of the universe between $z=0$ and $z\approx4$. DESI's principle scientific objectives are to place precise constraints on the equation of state of dark energy, the gravitationally driven growth of large-scale structure, and the sum of the neutrino masses, and to explore the observational signatures of primordial inflation. We present DESI Data Release 1 (DR1), which consists of all data acquired during the first 13 months of the DESI main survey, as well as a uniform reprocessing of the DESI Survey Validation data which was previously made public in the DESI Early Data Release. The DR1 main survey includes high-confidence redshifts for 18.7M objects, of which 13.1M are spectroscopically classified as galaxies, 1.6M as quasars, and 4M as stars, making DR1 the largest sample of extragalactic redshifts ever assembled. We summarize the DR1 observations, the spectroscopic data-reduction pipeline and data products, large-scale structure catalogs, value-added catalogs, and describe how to access and interact with the data. In addition to fulfilling its core cosmological objectives with unprecedented precision, we expect DR1 to enable a wide range of transformational astrophysical studies and discoveries.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Constraints on Neutrino Physics from DESI DR2 BAO and DR1 Full Shape
Authors:
W. Elbers,
A. Aviles,
H. E. Noriega,
D. Chebat,
A. Menegas,
C. S. Frenk,
C. Garcia-Quintero,
D. Gonzalez,
M. Ishak,
O. Lahav,
K. Naidoo,
G. Niz,
C. Yèche,
M. Abdul-Karim,
S. Ahlen,
O. Alves,
U. Andrade,
E. Armengaud,
S. BenZvi,
D. Bianchi,
S. Brieden,
A. Brodzeller,
D. Brooks,
E. Burtin,
R. Calderon
, et al. (94 additional authors not shown)
Abstract:
The Dark Energy Spectroscopic Instrument (DESI) Collaboration has obtained robust measurements of baryon acoustic oscillations (BAO) in the redshift range, $0.1 < z < 4.2$, based on the Lyman-$α$ forest and galaxies from Data Release 2 (DR2). We combine these measurements with external cosmic microwave background (CMB) data from Planck and ACT to place our tightest constraints yet on the sum of ne…
▽ More
The Dark Energy Spectroscopic Instrument (DESI) Collaboration has obtained robust measurements of baryon acoustic oscillations (BAO) in the redshift range, $0.1 < z < 4.2$, based on the Lyman-$α$ forest and galaxies from Data Release 2 (DR2). We combine these measurements with external cosmic microwave background (CMB) data from Planck and ACT to place our tightest constraints yet on the sum of neutrino masses. Assuming the cosmological $Λ$CDM model and three degenerate neutrino states, we find $\sum m_ν<0.0642$ eV (95%). When accounting for neutrino oscillation constraints, we find a preference for the normal mass ordering and an upper bound of $m_l < 0.023$ eV (95%) on the lightest neutrino mass. However, we determine using frequentist and Bayesian methods that our constraints are in moderate tension with the lower limits derived from neutrino oscillations. Correcting for the physical boundary at zero mass, we report a 95% Feldman-Cousins upper bound of $\sum m_ν<0.053$ eV, breaching the lower limit from neutrino oscillations. Considering a more general Bayesian analysis with an effective cosmological neutrino mass parameter, $\sum m_{ν,\mathrm{eff}}$, that allows for negative energy densities and removes unsatisfactory prior weight effects, we derive constraints that are in $3σ$ tension with the same oscillation limit. In the absence of unknown systematics, this finding could be interpreted as a hint of new physics not necessarily related to neutrinos. The preference of DESI and CMB data for an evolving dark energy model offers one possible solution. In the $w_0w_a$CDM model, we find $\sum m_ν<0.163$ eV (95%), resolving the neutrino tension. [Abridged]
△ Less
Submitted 3 April, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Extended Dark Energy analysis using DESI DR2 BAO measurements
Authors:
K. Lodha,
R. Calderon,
W. L. Matthewson,
A. Shafieloo,
M. Ishak,
J. Pan,
C. Garcia-Quintero,
D. Huterer,
G. Valogiannis,
L. A. Ureña-López,
N. V. Kamble,
D. Parkinson,
A. G. Kim,
G. B. Zhao,
J. L. Cervantes-Cota,
J. Rohlf,
F. Lozano-Rodríguez,
J. O. Román-Herrera,
M. Abdul-Karim,
J. Aguilar,
S. Ahlen,
O. Alves,
U. Andrade,
E. Armengaud,
A. Aviles
, et al. (100 additional authors not shown)
Abstract:
We conduct an extended analysis of dark energy constraints, in support of the findings of the DESI DR2 cosmology key paper, including DESI data, Planck CMB observations, and three different supernova compilations. Using a broad range of parametric and non-parametric methods, we explore the dark energy phenomenology and find consistent trends across all approaches, in good agreement with the…
▽ More
We conduct an extended analysis of dark energy constraints, in support of the findings of the DESI DR2 cosmology key paper, including DESI data, Planck CMB observations, and three different supernova compilations. Using a broad range of parametric and non-parametric methods, we explore the dark energy phenomenology and find consistent trends across all approaches, in good agreement with the $w_0w_a$CDM key paper results. Even with the additional flexibility introduced by non-parametric approaches, such as binning and Gaussian Processes, we find that extending $Λ$CDM to include a two-parameter $w(z)$ is sufficient to capture the trends present in the data. Finally, we examine three dark energy classes with distinct dynamics, including quintessence scenarios satisfying $w \geq -1$, to explore what underlying physics can explain such deviations. The current data indicate a clear preference for models that feature a phantom crossing; although alternatives lacking this feature are disfavored, they cannot yet be ruled out. Our analysis confirms that the evidence for dynamical dark energy, particularly at low redshift ($z \lesssim 0.3$), is robust and stable under different modeling choices.
△ Less
Submitted 3 April, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Validation of the DESI DR2 Measurements of Baryon Acoustic Oscillations from Galaxies and Quasars
Authors:
U. Andrade,
E. Paillas,
J. Mena-Fernández,
Q. Li,
A. J. Ross,
S. Nadathur,
M. Rashkovetskyi,
A. Pérez-Fernández,
H. Seo,
N. Sanders,
O. Alves,
X. Chen,
N. Deiosso,
A. de Mattia,
M. White,
M. Abdul-Karim,
S. Ahlen,
E. Armengaud,
A. Aviles,
D. Bianchi,
S. Brieden,
A. Brodzeller,
D. Brooks,
E. Burtin,
R. Calderon
, et al. (94 additional authors not shown)
Abstract:
The Dark Energy Spectroscopic Instrument (DESI) data release 2 (DR2) galaxy and quasar clustering data represents a significant expansion of data from DR1, providing improved statistical precision in BAO constraints across multiple tracers, including bright galaxies (BGS), luminous red galaxies (LRGs), emission line galaxies (ELGs), and quasars (QSOs). In this paper, we validate the BAO analysis o…
▽ More
The Dark Energy Spectroscopic Instrument (DESI) data release 2 (DR2) galaxy and quasar clustering data represents a significant expansion of data from DR1, providing improved statistical precision in BAO constraints across multiple tracers, including bright galaxies (BGS), luminous red galaxies (LRGs), emission line galaxies (ELGs), and quasars (QSOs). In this paper, we validate the BAO analysis of DR2. We present the results of robustness tests on the blinded DR2 data and, after unblinding, consistency checks on the unblinded DR2 data. All results are compared to those obtained from a suite of mock catalogs that replicate the selection and clustering properties of the DR2 sample. We confirm the consistency of DR2 BAO measurements with DR1 while achieving a reduction in statistical uncertainties due to the increased survey volume and completeness. We assess the impact of analysis choices, including different data vectors (correlation function vs. power spectrum), modeling approaches and systematics treatments, and an assumption of the Gaussian likelihood, finding that our BAO constraints are stable across these variations and assumptions with a few minor refinements to the baseline setup of the DR1 BAO analysis. We summarize a series of pre-unblinding tests that confirmed the readiness of our analysis pipeline, the final systematic errors, and the DR2 BAO analysis baseline. The successful completion of these tests led to the unblinding of the DR2 BAO measurements, ultimately leading to the DESI DR2 cosmological analysis, with their implications for the expansion history of the Universe and the nature of dark energy presented in the DESI key paper.
△ Less
Submitted 27 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Validation of the DESI DR2 Ly$α$ BAO analysis using synthetic datasets
Authors:
L. Casas,
H. K. Herrera-Alcantar,
J. Chaves-Montero,
A. Cuceu,
A. Font-Ribera,
M. Lokken,
M. Abdul-Karim,
C. Ramírez-Pérez,
J. Aguilar,
S. Ahlen,
U. Andrade,
E. Armengaud,
A. Aviles,
S. Bailey,
S. BenZvi,
D. Bianchi,
A. Brodzeller,
D. Brooks,
R. Canning,
A. Carnero Rosell,
M. Charles,
E. Chaussidon,
T. Claybaugh,
K. S. Dawson,
A. de la Macorra
, et al. (73 additional authors not shown)
Abstract:
The second data release (DR2) of the Dark Energy Spectroscopic Instrument (DESI), containing data from the first three years of observations, doubles the number of Lyman-$α$ (Ly$α$) forest spectra in DR1 and it provides the largest dataset of its kind. To ensure a robust validation of the Baryonic Acoustic Oscillation (BAO) analysis using Ly$α$ forests, we have made significant updates compared to…
▽ More
The second data release (DR2) of the Dark Energy Spectroscopic Instrument (DESI), containing data from the first three years of observations, doubles the number of Lyman-$α$ (Ly$α$) forest spectra in DR1 and it provides the largest dataset of its kind. To ensure a robust validation of the Baryonic Acoustic Oscillation (BAO) analysis using Ly$α$ forests, we have made significant updates compared to DR1 to both the mocks and the analysis framework used in the validation. In particular, we present CoLoRe-QL, a new set of Ly$α$ mocks that use a quasi-linear input power spectrum to incorporate the non-linear broadening of the BAO peak. We have also increased the number of realisations used in the validation to 400, compared to the 150 realisations used in DR1. Finally, we present a detailed study of the impact of quasar redshift errors on the BAO measurement, and we compare different strategies to mask Damped Lyman-$α$ Absorbers (DLAs) in our spectra. The BAO measurement from the Ly$α$ dataset of DESI DR2 is presented in a companion publication.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Construction of the Damped Ly$α$ Absorber Catalog for DESI DR2 Ly$α$ BAO
Authors:
A. Brodzeller,
M. Wolfson,
D. M. Santos,
M. Ho,
T. Tan,
M. M. Pieri,
A. Cuceu,
M. Abdul-Karim,
J. Aguilar,
S. Ahlen,
A. Anand,
U. Andrade,
E. Armengaud,
A. Aviles,
S. Bailey,
A. Bault,
D. Bianchi,
D. Brooks,
R. Canning,
L. Casas,
M. Charles,
E. Chaussidon,
J. Chaves-Montero,
D. Chebat,
T. Claybaugh
, et al. (74 additional authors not shown)
Abstract:
We present the Damped Ly$α$ Toolkit for automated detection and characterization of Damped Ly$α$ absorbers (DLA) in quasar spectra. Our method uses quasar spectral templates with and without absorption from intervening DLAs to reconstruct observed quasar forest regions. The best-fitting model determines whether a DLA is present while estimating the redshift and \texttt{HI} column density. With an…
▽ More
We present the Damped Ly$α$ Toolkit for automated detection and characterization of Damped Ly$α$ absorbers (DLA) in quasar spectra. Our method uses quasar spectral templates with and without absorption from intervening DLAs to reconstruct observed quasar forest regions. The best-fitting model determines whether a DLA is present while estimating the redshift and \texttt{HI} column density. With an optimized quality cut on detection significance ($Δχ_{r}^2>0.03$), the technique achieves an estimated 80\% purity and 79\% completeness when evaluated on simulated spectra with S/N~$>2$ that are free of broad absorption lines (BAL). We provide a catalog containing candidate DLAs from the DLA Toolkit detected in DESI DR1 quasar spectra, of which 21,719 were found in S/N~$>2$ spectra with predicted $\log_{10} (N_\texttt{HI}) > 20.3$ and detection significance $Δχ_{r}^2 >0.03$. We compare the Damped Ly$α$ Toolkit to two alternative DLA finders based on a convolutional neural network (CNN) and Gaussian process (GP) models. We present a strategy for combining these three techniques to produce a high-fidelity DLA catalog from DESI DR2 for the Ly$α$ forest baryon acoustic oscillation measurement. The combined catalog contains 41,152 candidate DLAs with $\log_{10} (N_\texttt{HI}) > 20.3$ from quasar spectra with S/N~$>2$. We estimate this sample to be approximately 85\% pure and 79\% complete when BAL quasars are excluded.
△ Less
Submitted 9 June, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
DESI DR2 Results I: Baryon Acoustic Oscillations from the Lyman Alpha Forest
Authors:
DESI Collaboration,
M. Abdul-Karim,
J. Aguilar,
S. Ahlen,
C. Allende Prieto,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
A. Aviles,
S. Bailey,
A. Bault,
S. BenZvi,
D. Bianchi,
C. Blake,
A. Brodzeller,
D. Brooks,
E. Buckley-Geer,
E. Burtin,
R. Calderon,
R. Canning,
A. Carnero Rosell,
P. Carrilho,
L. Casas,
F. J. Castander
, et al. (124 additional authors not shown)
Abstract:
We present the Baryon Acoustic Oscillation (BAO) measurements with the Lyman-alpha (LyA) forest from the second data release (DR2) of the Dark Energy Spectroscopic Instrument (DESI) survey. Our BAO measurements include both the auto-correlation of the LyA forest absorption observed in the spectra of high-redshift quasars and the cross-correlation of the absorption with the quasar positions. The to…
▽ More
We present the Baryon Acoustic Oscillation (BAO) measurements with the Lyman-alpha (LyA) forest from the second data release (DR2) of the Dark Energy Spectroscopic Instrument (DESI) survey. Our BAO measurements include both the auto-correlation of the LyA forest absorption observed in the spectra of high-redshift quasars and the cross-correlation of the absorption with the quasar positions. The total sample size is approximately a factor of two larger than the DR1 dataset, with forest measurements in over 820,000 quasar spectra and the positions of over 1.2 million quasars. We describe several significant improvements to our analysis in this paper, and two supporting papers describe improvements to the synthetic datasets that we use for validation and how we identify damped LyA absorbers. Our main result is that we have measured the BAO scale with a statistical precision of 1.1% along and 1.3% transverse to the line of sight, for a combined precision of 0.65% on the isotropic BAO scale at $z_{eff} = 2.33$. This excellent precision, combined with recent theoretical studies of the BAO shift due to nonlinear growth, motivated us to include a systematic error term in LyA BAO analysis for the first time. We measure the ratios $D_H(z_{eff})/r_d = 8.632 \pm 0.098 \pm 0.026$ and $D_M(z_{eff})/r_d = 38.99 \pm 0.52 \pm 0.12$, where $D_H = c/H(z)$ is the Hubble distance, $D_M$ is the transverse comoving distance, $r_d$ is the sound horizon at the drag epoch, and we quote both the statistical and the theoretical systematic uncertainty. The companion paper presents the BAO measurements at lower redshifts from the same dataset and the cosmological interpretation.
△ Less
Submitted 26 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
DESI DR2 Results II: Measurements of Baryon Acoustic Oscillations and Cosmological Constraints
Authors:
DESI Collaboration,
M. Abdul-Karim,
J. Aguilar,
S. Ahlen,
S. Alam,
L. Allen,
C. Allende Prieto,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
A. Aviles,
S. Bailey,
C. Baltay,
P. Bansal,
A. Bault,
J. Behera,
S. BenZvi,
D. Bianchi,
C. Blake,
S. Brieden,
A. Brodzeller,
D. Brooks,
E. Buckley-Geer,
E. Burtin
, et al. (162 additional authors not shown)
Abstract:
We present baryon acoustic oscillation (BAO) measurements from more than 14 million galaxies and quasars drawn from the Dark Energy Spectroscopic Instrument (DESI) Data Release 2 (DR2), based on three years of operation. For cosmology inference, these galaxy measurements are combined with DESI Lyman-$α$ forest BAO results presented in a companion paper. The DR2 BAO results are consistent with DESI…
▽ More
We present baryon acoustic oscillation (BAO) measurements from more than 14 million galaxies and quasars drawn from the Dark Energy Spectroscopic Instrument (DESI) Data Release 2 (DR2), based on three years of operation. For cosmology inference, these galaxy measurements are combined with DESI Lyman-$α$ forest BAO results presented in a companion paper. The DR2 BAO results are consistent with DESI DR1 and SDSS, and their distance-redshift relationship matches those from recent compilations of supernovae (SNe) over the same redshift range. The results are well described by a flat $Λ$CDM model, but the parameters preferred by BAO are in mild, $2.3σ$ tension with those determined from the cosmic microwave background (CMB), although the DESI results are consistent with the acoustic angular scale $θ_*$ that is well-measured by Planck. This tension is alleviated by dark energy with a time-evolving equation of state parametrized by $w_0$ and $w_a$, which provides a better fit to the data, with a favored solution in the quadrant with $w_0>-1$ and $w_a<0$. This solution is preferred over $Λ$CDM at $3.1σ$ for the combination of DESI BAO and CMB data. When also including SNe, the preference for a dynamical dark energy model over $Λ$CDM ranges from $2.8-4.2σ$ depending on which SNe sample is used. We present evidence from other data combinations which also favor the same behavior at high significance. From the combination of DESI and CMB we derive 95% upper limits on the sum of neutrino masses, finding $\sum m_ν<0.064$ eV assuming $Λ$CDM and $\sum m_ν<0.16$ eV in the $w_0w_a$ model. Unless there is an unknown systematic error associated with one or more datasets, it is clear that $Λ$CDM is being challenged by the combination of DESI BAO with other measurements and that dynamical dark energy offers a possible solution.
△ Less
Submitted 26 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Aligning Multimodal LLM with Human Preference: A Survey
Authors:
Tao Yu,
Yi-Fan Zhang,
Chaoyou Fu,
Junkang Wu,
Jinda Lu,
Kun Wang,
Xingyu Lu,
Yunhang Shen,
Guibin Zhang,
Dingjie Song,
Yibo Yan,
Tianlong Xu,
Qingsong Wen,
Zhang Zhang,
Yan Huang,
Liang Wang,
Tieniu Tan
Abstract:
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment w…
▽ More
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.
△ Less
Submitted 23 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
A spinless crystal for a high-performance solid-state $^{229}$Th nuclear clock
Authors:
Harry W. T. Morgan,
James E. S. Terhune,
Ricky Elwell,
Hoang Bao Tran Tan,
Udeshika C. Perera,
Andrei Derevianko,
Eric R. Hudson,
Anastassia N. Alexandrova
Abstract:
Solid-state $^{229}$Th nuclear clocks require a host material whose band gap is larger than the 8.4 eV nuclear transition energy. As such, excitation of the $^{229}$Th nuclear state has so far only been demonstrated in metal fluorides, specifically CaF$_2$, LiSrAlF$_6$, and ThF$_4$, where the large electronegativity of the halogen leads to sufficient band gaps. However, it is expected that the nuc…
▽ More
Solid-state $^{229}$Th nuclear clocks require a host material whose band gap is larger than the 8.4 eV nuclear transition energy. As such, excitation of the $^{229}$Th nuclear state has so far only been demonstrated in metal fluorides, specifically CaF$_2$, LiSrAlF$_6$, and ThF$_4$, where the large electronegativity of the halogen leads to sufficient band gaps. However, it is expected that the nuclear magnetic moment of the fluorine gives rise to a leading order broadening mechanism that limits the clock stability. Here, we use concepts of molecular design to identify a polyatomic anion, SO$_4^{2-}$, that is both nuclear spin free and of sufficient electron affinity to result in a high band gap metal sulfate system. Using state-of-the-art calculations, we find that the band gap of Th(SO$_4$)$_2$ is approximately 9 eV, large enough for direct laser excitation of $^{229}$Th. Low concentrations of $^{229}$Th in the otherwise spinless $^{232}$Th(SO$_4$)$_2$ crystal mitigate $^{229}$Th-$^{229}$Th interactions. Furthermore, the introduction of $^{229}$Th does not modify the material band gap nor introduce electronic states associated with nuclear quenching. By removing one of the primary sources of nuclear line broadening in the crystal, the nuclear magnetic dipole-dipole interaction, a nuclear clock with instability as low as $σ= 4.6\times10^{-23}/\sqrtτ$, where $τ$ is the averaging time, may be realized. This is roughly six orders of magnitude lower than previously thought possible.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark
Authors:
Yibin Ye,
Xichao Teng,
Shuo Chen,
Zhang Li,
Leqi Liu,
Qifeng Yu,
Tao Tan
Abstract:
Absolute Visual Localization (AVL) enables Unmanned Aerial Vehicle (UAV) to determine its position in GNSS-denied environments by establishing geometric relationships between UAV images and geo-tagged reference maps. While many previous works have achieved AVL with image retrieval and matching techniques, research in low-altitude multi-view scenarios still remains limited. Low-altitude Multi-view…
▽ More
Absolute Visual Localization (AVL) enables Unmanned Aerial Vehicle (UAV) to determine its position in GNSS-denied environments by establishing geometric relationships between UAV images and geo-tagged reference maps. While many previous works have achieved AVL with image retrieval and matching techniques, research in low-altitude multi-view scenarios still remains limited. Low-altitude Multi-view condition presents greater challenges due to extreme viewpoint changes. To explore the best UAV AVL approach in such condition, we proposed this benchmark. Firstly, a large-scale Low-altitude Multi-view dataset called AnyVisLoc was constructed. This dataset includes 18,000 images captured at multiple scenes and altitudes, along with 2.5D reference maps containing aerial photogrammetry maps and historical satellite maps. Secondly, a unified framework was proposed to integrate the state-of-the-art AVL approaches and comprehensively test their performance. The best combined method was chosen as the baseline and the key factors that influencing localization accuracy are thoroughly analyzed based on it. This baseline achieved a 74.1% localization accuracy within 5m under Low-altitude, Multi-view conditions. In addition, a novel retrieval metric called PDM@K was introduced to better align with the characteristics of the UAV AVL task. Overall, this benchmark revealed the challenges of Low-altitude, Multi-view UAV AVL and provided valuable guidance for future research. The dataset and codes are available at https://github.com/UAV-AVL/Benchmark
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
MoEdit: On Learning Quantity Perception for Multi-object Image Editing
Authors:
Yanfeng Li,
Kahou Chan,
Yue Sun,
Chantong Lam,
Tong Tong,
Zitong Yu,
Keren Fu,
Xiaohong Liu,
Tao Tan
Abstract:
Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both i…
▽ More
Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.