-
Impact of pion tensor force on alpha clustering in $^{20}$Ne
Authors:
Zhao Jing Chen,
Bao Yuan Sun
Abstract:
The nuclear clustering, as a quantum phase transition phenomenon governed by strong interactions, exhibits characteristics that are highly sensitive to the specific features of nuclear forces. Here, we examine how nuclear deformation and tensor forces influence $α$-cluster formation in light nuclei. The axially deformed relativistic Hartree-Fock-Bogoliubov model is utilized to investigate the clus…
▽ More
The nuclear clustering, as a quantum phase transition phenomenon governed by strong interactions, exhibits characteristics that are highly sensitive to the specific features of nuclear forces. Here, we examine how nuclear deformation and tensor forces influence $α$-cluster formation in light nuclei. The axially deformed relativistic Hartree-Fock-Bogoliubov model is utilized to investigate the clustering structure of the $^{20}$Ne nucleus, at both the ground state and the excited state with a superdeformed prolate. The nuclear binding energies and the canonical single particle levels are obtained at different quadruple deformation, and the role of tensor force embedded in the Fock diagram of $π$-pseudovector ($π$-PV) coupling is revealed. It is shown that the level branches from the degenerated spherical orbits at the deformed prolate case are enlarged due to the extra contribution from pion-exchanged tensor force. Correspondingly, the excitation energy in this superdeformed prolate state is reduced due to the noncentral tensor interaction, leading to a predicted value which is much closer to the referred threshold for the $2α$ decay mode of $^{20}$Ne. Possible $α$-clustering configurations in $^{20}$Ne are then characterized by examining the nucleonic localization function. Although the contribution to the ground state is relatively small, the density profile and nucleonic localization are significantly changed by the pion tensor force for the superdeformed prolate excited state, as further evidenced by characterising the level mixing in the spherical basis components. The results reveal the extra role of the tensor force, correlated to the evolved single-particle levels with nuclear deformation, in the formation and stability of nuclear clustering.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Authors:
Jiaqi Chen,
Mingfeng Fan,
Xuefeng Zhang,
Jingsong Liang,
Yuhong Cao,
Guohua Wu,
Guillaume Adrien Sartoretti
Abstract:
Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimod…
▽ More
Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Parton Distributions on a Quantum Computer
Authors:
Jiunn-Wei Chen,
Yu-Ting Chen,
Ghanashyam Meher
Abstract:
We perform the first quantum computation of parton distribution function (PDF) with a real quantum device by calculating the PDF of the lightest positronium in the Schwinger model with IBM quantum computers. The calculation uses 10 qubits for staggered fermions at five spatial sites and one ancillary qubit. The most critical and challenging step is to reduce the number of two-qubit gate depths to…
▽ More
We perform the first quantum computation of parton distribution function (PDF) with a real quantum device by calculating the PDF of the lightest positronium in the Schwinger model with IBM quantum computers. The calculation uses 10 qubits for staggered fermions at five spatial sites and one ancillary qubit. The most critical and challenging step is to reduce the number of two-qubit gate depths to around 500 so that sensible results start to emerge. The resulting lightcone correlators have excellent agreement with the classical simulator result in central values, although the error is still large. Compared with classical approaches, quantum computation has the advantage of not being limited in the accessible range of parton momentum fraction $x$ due to renormalon ambiguity, and the difficulty of accessing non-valence partons. A PDF calculation with 3+1 dimensional QCD near $x=0$ or $x=1$ will be a clear demonstration of the quantum advantage on a problem with great scientific impact.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Omnidirectionally manipulated skyrmions in an orientationally chiral system
Authors:
Jiahao Chen,
Wentao Tang,
Xingzhou Tang,
Yang Ding,
Jie Ni,
Yuxi Chen,
Bingxiang Li,
Rui Zhang,
Juan de Pablo,
Yanqing Lu
Abstract:
Skyrmions, originally from condensed matter physics, have been widely explored in various physical systems, including soft matter. A crucial challenge in manipulating topological solitary waves like skyrmions is controlling their flow on demand. Here, we control the arbitrary moving direction of skyrmions in a chiral liquid crystal system by adjusting the bias of the applied alternate current elec…
▽ More
Skyrmions, originally from condensed matter physics, have been widely explored in various physical systems, including soft matter. A crucial challenge in manipulating topological solitary waves like skyrmions is controlling their flow on demand. Here, we control the arbitrary moving direction of skyrmions in a chiral liquid crystal system by adjusting the bias of the applied alternate current electric field. Specifically, the velocity, including both moving direction and speed can be continuously changed. The motion control of skyrmions originates from the symmetry breaking of the topological structure induced by flexoelectric-polarization effect. The omnidirectional control of topological solitons opens new avenues in light-steering and racetrack memories.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Polar solitons in a nonpolar chiral soft matter system
Authors:
Jiahao Chen,
Xingzhou Tang,
Yang Ding,
Susanta Chakraborty,
Satoshi Aya,
Bingxiang Li,
Yanqing Lu
Abstract:
Polar solitons, i.e., solitonic waves accompanying asymmetry of geometry or phase, have garnered attention in polar systems, such as ferroelectric or magnetoelectric materials, where they play a critical role in topological transitions and nonreciprocal responses to external fields. A key question is whether such polar solitons can emerge in nonpolar systems, where intrinsic polarity is absent. He…
▽ More
Polar solitons, i.e., solitonic waves accompanying asymmetry of geometry or phase, have garnered attention in polar systems, such as ferroelectric or magnetoelectric materials, where they play a critical role in topological transitions and nonreciprocal responses to external fields. A key question is whether such polar solitons can emerge in nonpolar systems, where intrinsic polarity is absent. Here, we demonstrate an unprecedented polar soliton with nematic order in a nonpolar and chiral liquid crystal system by applying an alternating electric field. The soliton is corn-kernel-shaped, displaying a pair of oppositely charged topological defects at its two ends. While head-to-head collision between the solitons leads to repulsion, head-to-tail collision attracts the solitons into a single polar soliton. A rich variety of solitonic kinetics, such as rectilinear translation and circulation motions, can be activated by controlling the voltage and frequency of an electric field. Simulations reveal that the formation of the polar solitons is achieved through balancing the electric and nematic elastic energies, while the flexoelectric effect drives their rotational behaviors. The discovery of polar solitons in nonpolar systems expands the understanding of topological solitons, opening new avenues for dynamic control in soft matter systems, with potential applications in nonreciprocal responsive materials and topological information storage.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Fast solvers for Tokamak fluid models with PETSC -- Part I
Authors:
Mark F. Adams,
Jin Chen,
Benjamin Sturdevant
Abstract:
This work begins the development of fast, scalable solvers for scientific and engineering-relevant magnetohydrodynamics (MHD) models of tokamaks using multigrid methods. These tokamak models are characterized by a distinguished direction in the toroidal coordinate that is partially aligned with the magnetic guide field, which dominates the plasma dynamics. All tokamak models exploit this structure…
▽ More
This work begins the development of fast, scalable solvers for scientific and engineering-relevant magnetohydrodynamics (MHD) models of tokamaks using multigrid methods. These tokamak models are characterized by a distinguished direction in the toroidal coordinate that is partially aligned with the magnetic guide field, which dominates the plasma dynamics. All tokamak models exploit this structure, for example, NIMROD at https://nimrodteam.org uses $2D$, unstructured, high-order finite elements in the poloidal plane with Fourier modes in the toroidal coordinate, and the $3D$, extended MHD code \textit{M3D-C1}\footnote{https://m3dc1.pppl.gov} uses $2D$, unstructured $C^1$ elements in the poloidal plane with cubic Hermite functions in the toroidal direction. This structure suggests addressing the toroidal coordinate first, which \textit{NIMROD} does at the formulation level, but the \textit{M3D-C1} approach leaves in the algebraic system to be solved at each time step in an implicit time integrator. This work addressed the toroidal coordinate in the \textit{M3D-C1} velocity solve by adding semi-coarsening multigrid to the existing PETSC at https://petsc.org -- Portable, Extensible Toolkit for Scientific Computation -- block Jacobi solver, with the addition of little new code that allows for smaller Jacobi subdomains that are better suited to contemporary, highly parallel, hardware. Competitive performance of this new solver configuration is demonstrated on a self-consistent runaway electron model of a SPARC at https://cfs.energy/technology/sparc disruption, and the next steps in the development of this new approach are outlined.
△ Less
Submitted 5 July, 2025; v1 submitted 19 June, 2025;
originally announced June 2025.
-
AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios
Authors:
Yunhao Hou,
Bochao Zou,
Min Zhang,
Ran Chen,
Shangdong Yang,
Yanmei Zhang,
Junbao Zhuo,
Siheng Chen,
Jiansheng Chen,
Huimin Ma
Abstract:
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monito…
▽ More
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
Authors:
Chengyu Bai,
Yuming Li,
Zhongyu Zhao,
Jintao Chen,
Peidong Jia,
Qi She,
Ming Lu,
Shanghang Zhang
Abstract:
Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video gen…
▽ More
Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution
Authors:
Yucheng Jin,
Jinyan Chen,
Ziyue He,
Baojun Han,
Furan An
Abstract:
Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human…
▽ More
Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human pose estimation. Our method features a novel spatial-temporal Transformer with LeakyReLU-modified linear attention, which efficiently captures long-range temporal dependencies. Moreover, it is complemented by an adaptive fusion module that integrates parallel CNN branch for local texture enhancement. We also design a pose-aware compound loss to achieve task-oriented super-resolution. This loss guides the network to reconstruct structural features that are most beneficial for keypoint localization, rather than optimizing purely for visual quality. Extensive experiments on several mainstream video HPE datasets demonstrate that STAR-Pose outperforms existing approaches. It achieves up to 5.2% mAP improvement under extremely low-resolution (64x48) conditions while delivering 2.8x to 4.4x faster inference than cascaded approaches.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling
Authors:
Fei Wang,
Xingchen Wan,
Ruoxi Sun,
Jiefeng Chen,
Sercan Ö. Arık
Abstract:
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-s…
▽ More
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
SimuPanel: A Novel Immersive Multi-Agent System to Simulate Interactive Expert Panel Discussion
Authors:
Xiangyang He,
Jiale Li,
Jiahao Chen,
Yang Yang,
Mingming Fan
Abstract:
Panel discussion allows the audience to learn different perspectives through interactive discussions among experts moderated by a host and a Q&A session with the audience. Despite its benefits, panel discussion in the real world is inaccessible to many who do not have the privilege to participate due to geographical, financial, and time constraints. We present SimuPanel, which simulates panel disc…
▽ More
Panel discussion allows the audience to learn different perspectives through interactive discussions among experts moderated by a host and a Q&A session with the audience. Despite its benefits, panel discussion in the real world is inaccessible to many who do not have the privilege to participate due to geographical, financial, and time constraints. We present SimuPanel, which simulates panel discussions among academic experts through LLM-based multi-agent interaction. It enables users to define topics of interest for the panel, observe the expert discussion, engage in Q&A, and take notes. SimuPanel employs a host-expert architecture where each panel member is simulated by an agent with specialized expertise, and the panel is visualized in an immersive 3D environment to enhance engagement. Traditional dialogue generation struggles to capture the depth and interactivity of real-world panel discussions. To address this limitation, we propose a novel multi-agent interaction framework that simulates authentic panel dynamics by modeling reasoning strategies and personas of experts grounded in multimedia sources. This framework enables agents to dynamically recall and contribute to the discussion based on past experiences from diverse perspectives. Our technical evaluation and the user study with university students show that SimuPanel was able to simulate more in-depth discussions and engage participants to interact with and reflect on the discussions. As a first step in this direction, we offer design implications for future avenues to improve and harness the power of panel discussion for multimedia learning.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Data-Agnostic Cardinality Learning from Imperfect Workloads
Authors:
Peizhi Wu,
Rong Kang,
Tieying Zhang,
Jianjun Chen,
Ryan Marcus,
Zachary G. Ives
Abstract:
Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to…
▽ More
Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP's compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new per-table CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark -- despite operating without any data access and using only 10% of all possible join templates.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior
Authors:
Liangyan Li,
Yimo Ning,
Kevin Le,
Wei Dong,
Yunzhe Li,
Jun Chen,
Xiaohong Liu
Abstract:
This paper introduces a novel framework for image and video demoiréing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoiréing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods.
Traditional supervised learning approaches either fail to remove moiré patterns completely or produce overly smoo…
▽ More
This paper introduces a novel framework for image and video demoiréing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoiréing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods.
Traditional supervised learning approaches either fail to remove moiré patterns completely or produce overly smooth results. This stems from constrained model capacity and scarce training data, which inadequately represent the clean image distribution and hinder accurate reconstruction of ground-truth images. While generative models excel in image restoration for linear degradations, they struggle with nonlinear cases such as demoiréing and often introduce artifacts.
To address these limitations, we propose a hybrid MAP-based framework that integrates two complementary components. The first is a supervised learning model enhanced with efficient linear attention Test-Time Training (TTT) modules, which directly learn nonlinear mappings for RAW-to-sRGB demoiréing. The second is a Truncated Flow Matching Prior (TFMP) that further refines the outputs by aligning them with the clean image distribution, effectively restoring high-frequency details and suppressing artifacts. These two components combine the computational efficiency of linear attention with the refinement abilities of generative models, resulting in improved restoration performance.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
PNCS:Power-Norm Cosine Similarity for Diverse Client Selection in Federated Learning
Authors:
Liangyan Li,
Yangyi Liu,
Yimo Ning,
Stefano Rini,
Jun Chen
Abstract:
Federated Learning (FL) has emerged as a powerful paradigm for leveraging diverse datasets from multiple sources while preserving data privacy by avoiding centralized storage. However, many existing approaches fail to account for the intricate gradient correlations between remote clients, a limitation that becomes especially problematic in data heterogeneity scenarios. In this work, we propose a n…
▽ More
Federated Learning (FL) has emerged as a powerful paradigm for leveraging diverse datasets from multiple sources while preserving data privacy by avoiding centralized storage. However, many existing approaches fail to account for the intricate gradient correlations between remote clients, a limitation that becomes especially problematic in data heterogeneity scenarios. In this work, we propose a novel FL framework utilizing Power-Norm Cosine Similarity (PNCS) to improve client selection for model aggregation. By capturing higher-order gradient moments, PNCS addresses non-IID data challenges, enhancing convergence speed and accuracy. Additionally, we introduce a simple algorithm ensuring diverse client selection through a selection history queue. Experiments with a VGG16 model across varied data partitions demonstrate consistent improvements over state-of-the-art methods.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
UniMate: A Unified Model for Mechanical Metamaterial Generation, Property Prediction, and Condition Confirmation
Authors:
Wangzhi Zhan,
Jianpeng Chen,
Dongqi Fu,
Dawei Zhou
Abstract:
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to con…
▽ More
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UNIMATE, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UNIMATE outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by up to 80.2%, 5.1%, and 50.2%, respectively. We opensource our proposed UNIMATE model and corresponding results at https://github.com/wzhan24/UniMate.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Authors:
Jaehyun Nam,
Jinsung Yoon,
Jiefeng Chen,
Jinwoo Shin,
Sercan Ö. Arık,
Tomas Pfister
Abstract:
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and per…
▽ More
Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 44% of the Kaggle competitions on the MLE-bench, significantly outperforming the best alternative.
△ Less
Submitted 27 May, 2025;
originally announced June 2025.
-
Measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $D^+\to K^+η^{\prime}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (697 additional authors not shown)
Abstract:
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The bra…
▽ More
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The branching fractions are determined to be ${\mathcal B}(D^+\to K^+ π^0) = (1.45 \pm 0.06 \pm 0.06)\times 10^{-4}$, ${\mathcal B}(D^+\to K^+ η) = (1.17 \pm 0.10 \pm 0.03)\times 10^{-4}$ and ${\mathcal B}(D^+\to K^+ η^{\prime}) = (1.88 \pm 0.15 \pm 0.06)\times 10^{-4}$, where the first uncertainties are statistical and the second systematic. These results are consistent with the world average values but with significantly improved precision.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
NTIRE 2025 Image Shadow Removal Challenge Report
Authors:
Florin-Alexandru Vasluianu,
Tim Seizinger,
Zhuyun Zhou,
Cailian Chen,
Zongwei Wu,
Radu Timofte,
Mingjia Li,
Jin Hu,
Hainuo Wang,
Hengxing Liu,
Jiarui Wang,
Qiming Hu,
Xiaojie Guo,
Xin Lu,
Jiarong Yang,
Yuanfei Bao,
Anya Hu,
Zihao Fan,
Kunyu Wang,
Jie Xiao,
Xi Wang,
Xueyang Fu,
Zheng-Jun Zha,
Yu-Fan Lin,
Chia-Ming Lee
, et al. (57 additional authors not shown)
Abstract:
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were e…
▽ More
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Model Predictive Path-Following Control for a Quadrotor
Authors:
David Leprich,
Mario Rosenfelder,
Mario Hermle,
Jingshan Chen,
Peter Eberhard
Abstract:
Automating drone-assisted processes is a complex task. Many solutions rely on trajectory generation and tracking, whereas in contrast, path-following control is a particularly promising approach, offering an intuitive and natural approach to automate tasks for drones and other vehicles. While different solutions to the path-following problem have been proposed, most of them lack the capability to…
▽ More
Automating drone-assisted processes is a complex task. Many solutions rely on trajectory generation and tracking, whereas in contrast, path-following control is a particularly promising approach, offering an intuitive and natural approach to automate tasks for drones and other vehicles. While different solutions to the path-following problem have been proposed, most of them lack the capability to explicitly handle state and input constraints, are formulated in a conservative two-stage approach, or are only applicable to linear systems. To address these challenges, the paper is built upon a Model Predictive Control-based path-following framework and extends its application to the Crazyflie quadrotor, which is investigated in hardware experiments. A cascaded control structure including an underlying attitude controller is included in the Model Predictive Path-Following Control formulation to meet the challenging real-time demands of quadrotor control. The effectiveness of the proposed method is demonstrated through real-world experiments, representing, to the best of the authors' knowledge, a novel application of this MPC-based path-following approach to the quadrotor. Additionally, as an extension to the original method, to allow for deviations of the path in cases where the precise following of the path might be overly restrictive, a corridor path-following approach is presented.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories
Authors:
Qingsong Yan,
Qiang Wang,
Kaiyong Zhao,
Jie Chen,
Bo Li,
Xiaowen Chu,
Fei Deng
Abstract:
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In th…
▽ More
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.
△ Less
Submitted 24 June, 2025; v1 submitted 18 June, 2025;
originally announced June 2025.
-
Advancing Loss Functions in Recommender Systems: A Comparative Study with a Rényi Divergence-Based Solution
Authors:
Shengjia Zhang,
Jiawei Chen,
Changdong Li,
Sheng Zhou,
Qihao Shi,
Yan Feng,
Chun Chen,
Can Wang
Abstract:
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augment…
▽ More
Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths -- both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations -- stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
3D Vision-tactile Reconstruction from Infrared and Visible Images for Robotic Fine-grained Tactile Perception
Authors:
Yuankai Lin,
Xiaofan Lu,
Jiahui Chen,
Hua Yang
Abstract:
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction,…
▽ More
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction, and complex spatial boundary conditions for surface structures. With an end goal of constructing a human-like fingertip, our research (i) develops GelSplitter3D by expanding imaging channels with a prism and a near-infrared (NIR) camera, (ii) proposes a photometric stereo neural network with a CAD-based normal ground truth generation method to calibrate tactile geometry, and (iii) devises a normal integration method with boundary constraints of depth prior information to correcting the cumulative error of surface integrals. We demonstrate better tactile sensing performance, a 40$\%$ improvement in normal estimation accuracy, and the benefits of sensor shapes in grasping and manipulation tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Truncated Proximal Policy Optimization
Authors:
Tiantian Fan,
Lingjun Liu,
Yu Yue,
Jiaze Chen,
Chengyi Wang,
Qiying Yu,
Chi Zhang,
Zhiqi Lin,
Ruofei Zhu,
Yufeng Yuan,
Xiaochen Zuo,
Bole Ma,
Mofan Zhang,
Gaohong Liu,
Ru Zhang,
Haotian Zhou,
Cong Xie,
Ruidong Zhu,
Zhi Zhang,
Xin Liu,
Mingxuan Wang,
Lin Yan,
Yonghui Wu
Abstract:
Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error…
▽ More
Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Interpolation-based reproducing kernel particle method
Authors:
Jennifer E. Fromm,
John A. Evans,
J. S. Chen
Abstract:
Meshfree methods, including the reproducing kernel particle method (RKPM), have been widely used within the computational mechanics community to model physical phenomena in materials undergoing large deformations or extreme topology changes. RKPM shape functions and their derivatives cannot be accurately integrated with the Gauss-quadrature methods widely employed for the finite element method (FE…
▽ More
Meshfree methods, including the reproducing kernel particle method (RKPM), have been widely used within the computational mechanics community to model physical phenomena in materials undergoing large deformations or extreme topology changes. RKPM shape functions and their derivatives cannot be accurately integrated with the Gauss-quadrature methods widely employed for the finite element method (FEM) and typically require sophisticated nodal integration techniques, preventing them from easily being implemented in existing FEM software. Interpolation-based methods have been developed to address similar problems with isogeometric and immersed boundary methods, allowing these techniques to be implemented within open-source finite element software. With interpolation-based methods, background basis functions are represented as linear combinations of Lagrange polynomial foreground basis functions defined upon a boundary-conforming foreground mesh. This work extends the applications of interpolation-based methods to implement RKPM within open-source finite element software. Interpolation-based RKPM is applied to several PDEs, and error convergence rates are equivalent to classic RKPM integrated using high-order Gauss-quadrature schemes. The interpolation-based method is able to exploit the continuity of the RKPM basis to solve higher-order PDEs, demonstrated through the biharmonic problem. The method is extended to multi-material problems through Heaviside enrichment schemes, using local foreground refinement to reduce geometric integration error and achieve high-order accuracy. The computational cost of interpolation-based RKPM is similar to the smoothed gradient nodal integration schemes, offering significant savings over Gauss-quadrature-based meshfree methods while enabling easy implementation within existing finite element software.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework
Authors:
Dahang Wan,
Rongsheng Lu,
Yang Fang,
Xianli Lang,
Shuangbao Shu,
Jingjing Chen,
Siyuan Shen,
Ting Xu,
Zecong Ye
Abstract:
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework,…
▽ More
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
△ Less
Submitted 18 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
DreamLight: Towards Harmonious and Consistent Image Relighting
Authors:
Yong Liu,
Wenpeng Xiao,
Qianqian Wang,
Junlin Chen,
Shiyin Wang,
Yitong Wang,
Xinglong Wu,
Yansong Tang
Abstract:
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on im…
▽ More
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture. While these methods have achieved decent harmonization effects, they struggle to generate realistic and natural light interaction effects between the foreground and background. To alleviate these challenges, we reorganize the input data into a unified format and leverage the semantic prior provided by the pretrained diffusion model to facilitate the generation of natural results. Moreover, we propose a Position-Guided Light Adapter (PGLA) that condenses light information from different directions in the background into designed light query embeddings, and modulates the foreground with direction-biased masked attention. In addition, we present a post-processing module named Spectral Foreground Fixer (SFF) to adaptively reorganize different frequency components of subject and relighted background, which helps enhance the consistency of foreground appearance. Extensive comparisons and user study demonstrate that our DreamLight achieves remarkable relighting performance.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies
Authors:
Jingqi Yang,
Zhilong Song,
Jiawei Chen,
Mingli Song,
Sheng Zhou,
linjun sun,
Xiaogang Ouyang,
Chun Chen,
Can Wang
Abstract:
The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for compre…
▽ More
The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation, explicitly incorporating seven common types of anomalies observed in everyday GUI interactions. Furthermore, we propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools and then generate corresponding step and task descriptions for these actions with the assistance of MLLMs. This paradigm significantly reduces annotation time cost by a factor of over 19 times. Finally, we assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios. We anticipate that our work will highlight the importance of robustness in GUI agents and inspires more future research in this direction. The dataset and code are available at https://github.com/chessbean1/GUI-Robust..
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Compositional Attribute Imbalance in Vision Datasets
Authors:
Jiayi Chen,
Yanbiao Ma,
Andi Zhang,
Weidong Tang,
Wei Dai,
Bowei Liu
Abstract:
Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing b…
▽ More
Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model's ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Measurements of the Diffuse Interstellar Bands at 5780, 5797, and 6614 Å in the Hot Stellar Spectra of the LAMOST LRS DR10
Authors:
Xiao-Xiao Ma,
A-Li Luo,
Jian-Jun Chen,
Jing Chen,
Jun-Chao Liang
Abstract:
Diffuse Interstellar Bands (DIBs) are crucial tracers of the interstellar medium (ISM), yet their carriers remain poorly understood. While large-scale surveys have advanced DIB studies in cool stellar spectra, measurements in hot stellar spectra are still limited. Using 287 277 high signal-to-noise (S/N $>$ 50) hot stellar spectra from the tenth data release of the Large Sky Area Multi-Object Fibe…
▽ More
Diffuse Interstellar Bands (DIBs) are crucial tracers of the interstellar medium (ISM), yet their carriers remain poorly understood. While large-scale surveys have advanced DIB studies in cool stellar spectra, measurements in hot stellar spectra are still limited. Using 287 277 high signal-to-noise (S/N $>$ 50) hot stellar spectra from the tenth data release of the Large Sky Area Multi-Object Fiber Spectroscopic Telescope low-resolution spectroscopic survey (LAMOST LRS DR10), we systematically measured the three prominent optical DIBs at 5780, 5797, and 6614 Å. We published three catalogs containing 285 103, 279 195, and 281 146 valid measurements for the DIBs at 5780, 5797, and 6614 Å, respectively. Among them, 112 479, 25 232, and 71 048 are high-quality samples after rigorous quality control. To our knowledge, these are the largest hot-star DIB datasets in the northern sky. The catalogs provide spectral metadata, added astrometeric information, DIB profiles, and quality metrics. Our methodology and open-source pipeline ensure reproducibility, while the scale and precision of the data support future statistical studies. We anticipate that these catalogs will highlight the LAMOST's role in advancing DIB research and deepening our understanding of the ISM.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models
Authors:
Xinyang Li,
Siqi Liu,
Bochao Zou,
Jiansheng Chen,
Huimin Ma
Abstract:
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based…
▽ More
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model's exhibited ToM by adjusting in the direction of the attention head.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
VideoMAR: Autoregressive Video Generatio with Continuous Tokens
Authors:
Hu Yu,
Biao Gong,
Hangjie Yuan,
DanDan Zheng,
Weilong Chai,
Jingdong Chen,
Kecheng Zheng,
Feng Zhao
Abstract:
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first id…
▽ More
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9.3\%$), training data ($0.5\%$), and GPU resources ($0.2\%$).
△ Less
Submitted 18 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks
Authors:
Ziyuan Tang,
Jie Chen
Abstract:
A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the T…
▽ More
A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the Transformer backbone. A central challenge toward this end is how a sequence model encodes graphs of varying sizes and from different domains. We propose representing a node as multiple random walks, such that the Transformer can extract node representations from sequences, which in turn form edge and graph representations. We develop a novel context prediction loss for these random walks and theoretically analyze their expressive power in distinguishing neighborhoods and graphs. We also demonstrate the pre-training of our model and its adaptation to downstream tasks, showcasing its potential as a foundation for processing and reasoning with graph-structured data.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Theoretically-Grounded and Practical Framework with Synthetic Anomalies
Authors:
Matthew Lau,
Tian-Yi Zhou,
Xiangchi Yuan,
Jizhou Chen,
Wenke Lee,
Xiaoming Huo
Abstract:
Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test tim…
▽ More
Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test time. We propose a theoretically-grounded and empirically effective framework for semi-supervised AD that combines known and synthetic anomalies during training. To analyze semi-supervised AD, we introduce the first mathematical formulation of semi-supervised AD, which generalizes unsupervised AD. Here, we show that synthetic anomalies enable (i) better anomaly modeling in low-density regions and (ii) optimal convergence guarantees for neural network classifiers -- the first theoretical result for semi-supervised AD. We empirically validate our framework on five diverse benchmarks, observing consistent performance gains. These improvements also extend beyond our theoretical framework to other classification-based AD methods, validating the generalizability of the synthetic anomaly principle in AD.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Dynamical quantum phase transition with divergent multipartite entanglement
Authors:
Jie Chen,
Ricardo Costa de Almeida,
Hendrik Weimer
Abstract:
We investigate the nonequilibrium quench dynamics of the one-dimensional transverse-field Ising model in both integrable and nonintegrable regimes. In particular, we report on a novel type of dynamical quantum phase transition (DQPT) that is characterized by a divergent multipartite entanglement at critical times in the post-quench dynamics. We quantify the multipartite entanglement of the state b…
▽ More
We investigate the nonequilibrium quench dynamics of the one-dimensional transverse-field Ising model in both integrable and nonintegrable regimes. In particular, we report on a novel type of dynamical quantum phase transition (DQPT) that is characterized by a divergent multipartite entanglement at critical times in the post-quench dynamics. We quantify the multipartite entanglement of the state by the quantum Fisher information and demonstrate that the DQPT belongs to a different universality class than the ground-state phase transition. Furthermore, we perform a spectral analysis of the DQPT and demonstrate that it is a genuine nonequilibrium transition arising from the constructive interference of excited states of the system during the many-body dynamics. Finally, we discuss potential experimental realizations in Rydberg platforms as well as applications in the context of quantum metrology.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
XGraphRAG: Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation
Authors:
Ke Wang,
Bo Pan,
Yingchaojie Feng,
Yuwei Wu,
Jieyi Chen,
Minfeng Zhu,
Wei Chen
Abstract:
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)'s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usu…
▽ More
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)'s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usually face challenges in analyzing the effectiveness of GraphRAG on their dataset due to GraphRAG's complex information processing pipeline and the overwhelming amount of LLM invocations involved during graph construction and query, which limits GraphRAG interpretability and accessibility. This research proposes a visual analysis framework that helps RAG developers identify critical recalls of GraphRAG and trace these recalls through the GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype system incorporating a set of interactive visualizations to facilitate users' analysis process, boosting failure cases collection and improvement opportunities identification. Our evaluation demonstrates the effectiveness and usability of our approach. Our work is open-sourced and available at https://github.com/Gk0Wk/XGraphRAG.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
Authors:
Wenxuan Song,
Jiayi Chen,
Pengxiang Ding,
Yuxin Huang,
Han Zhao,
Donglin Wang,
Haoang Li
Abstract:
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi de…
▽ More
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
OneRec Technical Report
Authors:
Guorui Zhou,
Jiaxin Deng,
Jinghao Zhang,
Kuo Cai,
Lejian Ren,
Qiang Luo,
Qianqian Wang,
Qigen Hu,
Rui Huang,
Shiyao Wang,
Weifeng Ding,
Wuchao Li,
Xinchen Luo,
Xingmei Wang,
Zexuan Cheng,
Zixing Zhang,
Bin Zhang,
Boxuan Wang,
Chaoyi Ma,
Chengru Song,
Chenhui Wang,
Di Wang,
Dongxue Meng,
Fan Yang,
Fangyu Zhang
, et al. (40 additional authors not shown)
Abstract:
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimizat…
▽ More
Recommender systems have been widely used in various large-scale user-oriented platforms for many years. However, compared to the rapid developments in the AI community, recommendation systems have not achieved a breakthrough in recent years. For instance, they still rely on a multi-stage cascaded architecture rather than an end-to-end approach, leading to computational fragmentation and optimization inconsistencies, and hindering the effective application of key breakthrough technologies from the AI community in recommendation scenarios.
To address these issues, we propose OneRec, which reshapes the recommendation system through an end-to-end generative approach and achieves promising results. Firstly, we have enhanced the computational FLOPs of the current recommendation model by 10 $\times$ and have identified the scaling laws for recommendations within certain boundaries. Secondly, reinforcement learning techniques, previously difficult to apply for optimizing recommendations, show significant potential in this framework. Lastly, through infrastructure optimizations, we have achieved 23.7% and 28.8% Model FLOPs Utilization (MFU) on flagship GPUs during training and inference, respectively, aligning closely with the LLM community. This architecture significantly reduces communication and storage overhead, resulting in operating expense that is only 10.6% of traditional recommendation pipelines. Deployed in Kuaishou/Kuaishou Lite APP, it handles 25% of total queries per second, enhancing overall App Stay Time by 0.54% and 1.24%, respectively. Additionally, we have observed significant increases in metrics such as 7-day Lifetime, which is a crucial indicator of recommendation experience. We also provide practical lessons and insights derived from developing, optimizing, and maintaining a production-scale recommendation system with significant real-world impact.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Seismic Acoustic Impedance Inversion Framework Based on Conditional Latent Generative Diffusion Model
Authors:
Jie Chen,
Hongling Chen,
Jinghuai Gao,
Chuangji Meng,
Tao Yang,
XinXin Liang
Abstract:
Seismic acoustic impedance plays a crucial role in lithological identification and subsurface structure interpretation. However, due to the inherently ill-posed nature of the inversion problem, directly estimating impedance from post-stack seismic data remains highly challenging. Recently, diffusion models have shown great potential in addressing such inverse problems due to their strong prior lea…
▽ More
Seismic acoustic impedance plays a crucial role in lithological identification and subsurface structure interpretation. However, due to the inherently ill-posed nature of the inversion problem, directly estimating impedance from post-stack seismic data remains highly challenging. Recently, diffusion models have shown great potential in addressing such inverse problems due to their strong prior learning and generative capabilities. Nevertheless, most existing methods operate in the pixel domain and require multiple iterations, limiting their applicability to field data. To alleviate these limitations, we propose a novel seismic acoustic impedance inversion framework based on a conditional latent generative diffusion model, where the inversion process is made in latent space. To avoid introducing additional training overhead when embedding conditional inputs, we design a lightweight wavelet-based module into the framework to project seismic data and reuse an encoder trained on impedance to embed low-frequency impedance into the latent space. Furthermore, we propose a model-driven sampling strategy during the inversion process of this framework to enhance accuracy and reduce the number of required diffusion steps. Numerical experiments on a synthetic model demonstrate that the proposed method achieves high inversion accuracy and strong generalization capability within only a few diffusion steps. Moreover, application to field data reveals enhanced geological detail and higher consistency with well-log measurements, validating the effectiveness and practicality of the proposed approach.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Self-Supervised Enhancement for Depth from a Lightweight ToF Sensor with Monocular Images
Authors:
Laiyan Ding,
Hualie Jiang,
Jiwei Chen,
Rui Huang
Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed…
▽ More
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
△ Less
Submitted 17 June, 2025; v1 submitted 16 June, 2025;
originally announced June 2025.
-
VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation
Authors:
Jiaming Chen,
Yiyu Jiang,
Aoshen Huang,
Yang Li,
Wei Pan
Abstract:
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two obje…
▽ More
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process-conditioned by task instructions-generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website https://sites.google.com/view/vlm-sfd/.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
ACM tilting bundles on a Geigle-Lenzing projective plane of type $(2,2,2,p)$
Authors:
Jianmin Chen,
Shiquan Ruan,
Weikang Weng
Abstract:
Let $\mathbb{X}$ be a Geigle-Lenzing projective plane of type $(2,2,2,p)$ and $\mathsf{coh} \mathbb{X}$ the category of coherent sheaves on $\mathbb{X}$. This paper is devoted to study ACM tilting bundles over $\mathbb{X}$, that is, tilting objects in the derived category $\mathsf{D}^{\rm b}(\mathsf{coh} \, \mathbb{X})$ that are also ACM bundles. We show that a tilting bundle consisting of line bu…
▽ More
Let $\mathbb{X}$ be a Geigle-Lenzing projective plane of type $(2,2,2,p)$ and $\mathsf{coh} \mathbb{X}$ the category of coherent sheaves on $\mathbb{X}$. This paper is devoted to study ACM tilting bundles over $\mathbb{X}$, that is, tilting objects in the derived category $\mathsf{D}^{\rm b}(\mathsf{coh} \, \mathbb{X})$ that are also ACM bundles. We show that a tilting bundle consisting of line bundles is the $2$-canonical tilting bundle up to degree shift. We also provide a program to construct ACM tilting bundles, which give a rich source of (almost) $2$-representation infinite algebras. As an application, we give a classification result of ACM tilting bundles.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Measurement of the $Ω_c^0$ and $Ξ_c^0$ baryon lifetimes using hadronic $b$-baryon decays
Authors:
LHCb collaboration,
R. Aaij,
A. S. W. Abdelmotteleb,
C. Abellan Beteta,
F. Abudinén,
T. Ackernley,
A. A. Adefisoye,
B. Adeva,
M. Adinolfi,
P. Adlarson,
C. Agapopoulou,
C. A. Aidala,
Z. Ajaltouni,
S. Akar,
K. Akiba,
P. Albicocco,
J. Albrecht,
F. Alessio,
M. Alexander,
Z. Aliouche,
P. Alvarez Cartelle,
R. Amalric,
S. Amato,
J. L. Amey,
Y. Amhis
, et al. (1141 additional authors not shown)
Abstract:
The lifetimes of the $Ω_c^0$ and $Ξ_c^0$ baryons are measured using a $pp$ collision dataset collected by the LHCb experiment, corresponding to an integrated luminosity of $9~\rm{fb^{-1}}$. The charm baryons are produced in the fully reconstructed decay chains $Ω_b^- \rightarrow Ω_c^0 (\rightarrow pK^-K^-π^+)~π^-$ and $Ξ_b^- \rightarrow Ξ_c^0 (\rightarrow pK^-K^-π^+)~π^-$. The measurement uses top…
▽ More
The lifetimes of the $Ω_c^0$ and $Ξ_c^0$ baryons are measured using a $pp$ collision dataset collected by the LHCb experiment, corresponding to an integrated luminosity of $9~\rm{fb^{-1}}$. The charm baryons are produced in the fully reconstructed decay chains $Ω_b^- \rightarrow Ω_c^0 (\rightarrow pK^-K^-π^+)~π^-$ and $Ξ_b^- \rightarrow Ξ_c^0 (\rightarrow pK^-K^-π^+)~π^-$. The measurement uses topologically and kinematically similar $B^- \rightarrow D^0(\rightarrow K^-K^+π^-π^+)~π^-$ decays for normalisation. The measured lifetimes are
$τ_{Ω_c^0} = 276.3 \pm 19.4~\rm{(stat)} \pm 1.8~\rm{(syst)} \pm 0.7~(τ_{D^0})~\rm{fs}$,
$τ_{Ξ_c^0} = 149.2 \pm ~\,2.5~\rm{(stat)} \pm 0.9~\rm{(syst)} \pm 0.4~(τ_{D^0})~\rm{fs}$,
where the first uncertainty is statistical, the second systematic and the third due to the uncertainty of the $D^0$ lifetime. These results are consistent with previous measurements performed by the LHCb experiment.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs
Authors:
Guoxi Zhang,
Jiawei Chen,
Tianzhuo Yang,
Jiaming Ji,
Yaodong Yang,
Juntao Dai
Abstract:
The increasing prevalence of large language models (LLMs) is influencing global value systems. However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values. This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the d…
▽ More
The increasing prevalence of large language models (LLMs) is influencing global value systems. However, these models frequently exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias due to lack of attention to minority values. This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints, posing challenges for the development of equitable and inclusive AI systems. In this work, we introduce a systematic framework designed to boost fair and robust cross-cultural consensus among LLMs. We model consensus as a Nash Equilibrium and employ a game-theoretic negotiation method based on Policy-Space Response Oracles (PSRO) to simulate an organized cross-cultural negotiation process. To evaluate this approach, we construct regional cultural agents using data transformed from the World Values Survey (WVS). Beyond the conventional model-level evaluation method, We further propose two quantitative metrics, Perplexity-based Acceptence and Values Self-Consistency, to assess consensus outcomes. Experimental results indicate that our approach generates consensus of higher quality while ensuring more balanced compromise compared to baselines. Overall, it mitigates WEIRD bias by guiding agents toward convergence through fair and gradual negotiation steps.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
DVP-MVS++: Synergize Depth-Normal-Edge and Harmonized Visibility Prior for Multi-View Stereo
Authors:
Zhenlong Yuan,
Dapeng Zhang,
Zehao Li,
Chengxuan Qian,
Jianing Chen,
Yinda Chen,
Kehua Chen,
Tianlu Mao,
Zhaoxin Li,
Hao Jiang,
Zhaoqi Wang
Abstract:
Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused b…
▽ More
Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused by edge-skipping and visibility occlusions, which may cause potential estimation deviations. To address these issues, we propose DVP-MVS++, an innovative approach that synergizes both depth-normal-edge aligned and harmonized cross-view priors for robust and visibility-aware patch deformation. Specifically, to avoid edge-skipping, we first apply DepthPro, Metric3Dv2 and Roberts operator to generate coarse depth maps, normal maps and edge maps, respectively. These maps are then aligned via an erosion-dilation strategy to produce fine-grained homogeneous boundaries for facilitating robust patch deformation. Moreover, we reformulate view selection weights as visibility maps, and then implement both an enhanced cross-view depth reprojection and an area-maximization strategy to help reliably restore visible areas and effectively balance deformed patch, thus acquiring harmonized cross-view priors for visibility-aware patch deformation. Additionally, we obtain geometry consistency by adopting both aggregated normals via view selection and projection depth differences via epipolar lines, and then employ SHIQ for highlight correction to enable geometry consistency with highlight-aware perception, thus improving reconstruction quality during propagation and refinement stage. Evaluation results on ETH3D, Tanks & Temples and Strecha datasets exhibit the state-of-the-art performance and robust generalization capability of our proposed method.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Reconfigurable Digital RRAM Logic Enables In-Situ Pruning and Learning for Edge AI
Authors:
Songqi Wang,
Yue Zhang,
Jia Chen,
Xinyuan Zhang,
Yi Li,
Ning Lin,
Yangu He,
Jichang Yang,
Yingjie Yu,
Yi Li,
Zhongrui Wang,
Xiaojuan Qi,
Han Wang
Abstract:
The human brain simultaneously optimizes synaptic weights and topology by growing, pruning, and strengthening synapses while performing all computation entirely in memory. In contrast, modern artificial-intelligence systems separate weight optimization from topology optimization and depend on energy-intensive von Neumann architectures. Here, we present a software-hardware co-design that bridges th…
▽ More
The human brain simultaneously optimizes synaptic weights and topology by growing, pruning, and strengthening synapses while performing all computation entirely in memory. In contrast, modern artificial-intelligence systems separate weight optimization from topology optimization and depend on energy-intensive von Neumann architectures. Here, we present a software-hardware co-design that bridges this gap. On the algorithmic side, we introduce a real-time dynamic weight-pruning strategy that monitors weight similarity during training and removes redundancies on the fly, reducing operations by 26.80% on MNIST and 59.94% on ModelNet10 without sacrificing accuracy (91.44% and 77.75%, respectively). On the hardware side, we fabricate a reconfigurable, fully digital compute-in-memory (CIM) chip based on 180 nm one-transistor-one-resistor (1T1R) RRAM arrays. Each array embeds flexible Boolean logic (NAND, AND, XOR, OR), enabling both convolution and similarity evaluation inside memory and eliminating all ADC/DAC overhead. The digital design achieves zero bit-error, reduces silicon area by 72.30% and overall energy by 57.26% compared to analogue RRAM CIM, and lowers energy by 75.61% and 86.53% on MNIST and ModelNet10, respectively, relative to an NVIDIA RTX 4090. Together, our co-design establishes a scalable brain-inspired paradigm for adaptive, energy-efficient edge intelligence in the future.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Implementing van der Waals forces for polytope particles in DEM simulations of clay
Authors:
Dominik Krengel,
Jian Chen,
Zhipeng Yu,
Hans-Georg Matuttis,
Takashi Matsushima
Abstract:
Clay minerals are non-spherical nano-scale particles that usually form flocculated, house-of-card like structures under the influence of inter-molecular forces. Numerical modeling of clays is still in its infancy as the required inter-particle forces are available only for spherical particles. A polytope approach would allow shape-accurate forces and torques while simultaneously being more perform…
▽ More
Clay minerals are non-spherical nano-scale particles that usually form flocculated, house-of-card like structures under the influence of inter-molecular forces. Numerical modeling of clays is still in its infancy as the required inter-particle forces are available only for spherical particles. A polytope approach would allow shape-accurate forces and torques while simultaneously being more performant. The Anandarajah solution provides an analytical formulation for van der Waals forces for cuboid particles but in its original form is not suitable for implementation in DEM simulations. In this work, we discuss the necessary changes for a functional implementation of the Anandarajah solution in a DEM simulation of rectangular particles and their extension to cuboid particles.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
AFBS:Buffer Gradient Selection in Semi-asynchronous Federated Learning
Authors:
Chaoyi Lu,
Yiding Sun,
Jinqian Chen,
Zhichuan Yang,
Jiangming Pan,
Jihua Zhu
Abstract:
Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as bl…
▽ More
Asynchronous federated learning (AFL) accelerates training by eliminating the need to wait for stragglers, but its asynchronous nature introduces gradient staleness, where outdated gradients degrade performance. Existing solutions address this issue with gradient buffers, forming a semi-asynchronous framework. However, this approach struggles when buffers accumulate numerous stale gradients, as blindly aggregating all gradients can harm training. To address this, we propose AFBS (Asynchronous FL Buffer Selection), the first algorithm to perform gradient selection within buffers while ensuring privacy protection. Specifically, the client sends the random projection encrypted label distribution matrix before training, and the server performs client clustering based on it. During training, server scores and selects gradients within each cluster based on their informational value, discarding low-value gradients to enhance semi-asynchronous federated learning. Extensive experiments in highly heterogeneous system and data environments demonstrate AFBS's superior performance compared to state-of-the-art methods. Notably, on the most challenging task, CIFAR-100, AFBS improves accuracy by up to 4.8% over the previous best algorithm and reduces the time to reach target accuracy by 75%.
△ Less
Submitted 23 June, 2025; v1 submitted 15 June, 2025;
originally announced June 2025.
-
Combining Self-attention and Dilation Convolutional for Semantic Segmentation of Coal Maceral Groups
Authors:
Zhenghao Xi,
Zhengnan Lv,
Yang Zheng,
Xiang Liu,
Zhuang Yu,
Junran Chen,
Jing Hu,
Yaqi Liu
Abstract:
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts mo…
▽ More
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT-based DA-VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA-VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA-VIT and state-of-the-art methods at lots of evaluation metrics. Experimental results show that DA-VIT-Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA-VIT-Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA-VIT are better than other state-of-the-art methods.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Characterization of fiberwise bimeromorphism and specialization of bimeromorphic types I: the non-negative Kodaira dimension case
Authors:
Jian Chen,
Sheng Rao,
I-Hsun Tsai
Abstract:
Inspired by the recent works of M. Kontsevich--Y. Tschinkel and J. Nicaise--J. C. Ottem on specialization of birational types for smooth families (in the scheme category) and J. Koll{á}r's work on fiberwise bimeromorphism, we focus on characterizing the fiberwise bimeromorphism and utilizing the characterization to investigate the specialization of bimeromorphic types for non-smooth families in th…
▽ More
Inspired by the recent works of M. Kontsevich--Y. Tschinkel and J. Nicaise--J. C. Ottem on specialization of birational types for smooth families (in the scheme category) and J. Koll{á}r's work on fiberwise bimeromorphism, we focus on characterizing the fiberwise bimeromorphism and utilizing the characterization to investigate the specialization of bimeromorphic types for non-smooth families in the complex analytic setting. We provide some criteria for a bimeromorphic map between two families over the same base to be fiberwise bimeromorphic. By combining these criteria with ideas by D. Mumford--U. Persson and T. de Fernex--D. Fusi, as well as K. Timmerscheidt's approach via the relative Barlet cycle space theory, we establish the specialization of bimeromorphic types for locally Moishezon families with fibers having only canonical singularities and being of non-negative Kodaira dimension. These specialization results can easily lead to criteria for locally strongly bimeromorphic isotriviality. Throughout this paper, we unveil the connections among the four classical topics in bimeromorphic geometry: the deformation behavior of plurigenera (or even $1$-genus), fiberwise bimeromorphism, specialization of bimeromorphic types, and the bimeromorphic version of the deformation rigidity.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
Authors:
Runhao Zeng,
Qi Deng,
Ronghao Zhang,
Shuaicheng Niu,
Jian Chen,
Xiping Hu,
Victor C. M. Leung
Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio informatio…
▽ More
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.