Search | arXiv e-print repository

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Authors: Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

Abstract: While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding--particularly in the context of structured geometric data--remains unexplored. In this work, we initiate a theoretical study of ICL for regression of Hölder functions on manifolds. By establishing a novel connection between the attention mechanism and classical kern… ▽ More While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding--particularly in the context of structured geometric data--remains unexplored. In this work, we initiate a theoretical study of ICL for regression of Hölder functions on manifolds. By establishing a novel connection between the attention mechanism and classical kernel methods, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10712 [pdf, ps, other]

Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

Authors: Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He

Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first g… ▽ More Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: 16 pages, 7 figures

arXiv:2506.10675 [pdf, ps, other]

ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation

Authors: Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

Abstract: Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations:… ▽ More Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10648 [pdf, ps, other]

Vortex-magnetic competition and regime transitions in antiparallel flux tubes

Authors: Weiyu Shen, Rodolfo Ostilla-Mónico, Xiaojue Zhu

Abstract: Vortex-magnetic interactions shape magnetohydrodynamic (MHD) turbulence, influencing energy transfer in astrophysical, geophysical, and industrial systems. On the Sun, granular-scale vortex flows couple strongly with magnetic fields, channelling energy into the corona. At high Reynolds numbers, vorticity and magnetic fields are nearly frozen into the charged fluid, and MHD flows emerge from the in… ▽ More Vortex-magnetic interactions shape magnetohydrodynamic (MHD) turbulence, influencing energy transfer in astrophysical, geophysical, and industrial systems. On the Sun, granular-scale vortex flows couple strongly with magnetic fields, channelling energy into the corona. At high Reynolds numbers, vorticity and magnetic fields are nearly frozen into the charged fluid, and MHD flows emerge from the interplay between vortex dynamics and Lorentz forces. To probe this competition in a controlled setting, we revisit the canonical problem of two antiparallel flux tubes. By varying the magnetic flux threading each tube--and thus sweeping the interaction parameter $N_i$, which gauges Lorentz-to-inertial force balance--we uncover three distinct regimes: vortex-dominated joint reconnection, instability-triggered cascade, and Lorentz-induced vortex disruption. At low $N_i$, classical vortex dynamics dominate, driving joint vortex-magnetic reconnection and amplifying magnetic energy via a dynamo effect. At moderate $N_i$, the system oscillates between vorticity-driven attraction and magnetic damping, triggering instabilities and nonlinear interactions that spawn secondary filaments and drive an energy cascade. At high $N_i$, Lorentz forces suppress vortex interactions, aligning the tubes axially while disrupting vortex cores and rapidly converting magnetic to kinetic energy. These findings reveal how the inertial-Lorentz balance governs energy transfer and coherent structure formation in MHD turbulence, offering insight into vortex-magnetic coevolution in astrophysical plasmas. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10520 [pdf, ps, other]

Macro Graph of Experts for Billion-Scale Multi-Task Recommendation

Authors: Hongyu Yao, Zijin Hong, Hao Chen, Yuanchen Bei, Zhiqing Li, Qijie Shen, Zuobin Ying, Huan Gong, Feiran Huang

Abstract: Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we intr… ▽ More Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10472 [pdf, ps, other]

Disentangling Electronic and Ionic Nonlinear Polarization Effects in the THz Kerr Response of LaAlO$_{3}$

Authors: Chao Shen, Maximilian Frenzel, Sebastian F. Maehrlein, Zhanybek Alpichshev

Abstract: Nonlinear responses to intense terahertz (THz) fields provide unique insights into complex dynamics of contemporary material systems. However, the interpretation of the obtained data, in particular, distinguishing genuine ionic oscillations from the instantaneous electronic responses in THz Kerr effect remains challenging. Here, we combine two-dimensional Terahertz Kerr effect (2D-TKE) spectroscop… ▽ More Nonlinear responses to intense terahertz (THz) fields provide unique insights into complex dynamics of contemporary material systems. However, the interpretation of the obtained data, in particular, distinguishing genuine ionic oscillations from the instantaneous electronic responses in THz Kerr effect remains challenging. Here, we combine two-dimensional Terahertz Kerr effect (2D-TKE) spectroscopy experiments and their modeling to unravel complex THz-induced temporal oscillations in twinned LaAlO$_3$ crystals at low temperatures. We identify the 1.1 THz mode as $E_g$ Raman phonon, while 0.86 THz and 0.36 THz signals are due to spurious effects resulting from the co- and counter-propagation of THz and optical probe pulses in birefringent twin domains. Furthermore, we determine that the $E_g$ mode is excited via a two-photon process, whereas THz pulse reflections at the sample surface produce a temporal response that can mimic anharmonic phonon coupling. Our findings highlight the importance of propagation effects in nonlinear THz experiments and provide a refined framework for interpreting THz polarization dynamics in birefringent crystals. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10468 [pdf, ps, other]

Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On

Authors: Zaiqiang Wu, Yechen Li, Jingyuan Liu, Yuki Shibata, Takayuki Hori, I-Chao Shen, Takeo Igarashi

Abstract: Existing image-based virtual try-on methods are often limited to the front view and lack real-time performance. While per-garment virtual try-on methods have tackled these issues by capturing per-garment datasets and training per-garment neural networks, they still encounter practical limitations: (1) the robotic mannequin used to capture per-garment datasets is prohibitively expensive for widespr… ▽ More Existing image-based virtual try-on methods are often limited to the front view and lack real-time performance. While per-garment virtual try-on methods have tackled these issues by capturing per-garment datasets and training per-garment neural networks, they still encounter practical limitations: (1) the robotic mannequin used to capture per-garment datasets is prohibitively expensive for widespread adoption and fails to accurately replicate natural human body deformation; (2) the synthesized garments often misalign with the human body. To address these challenges, we propose a low-barrier approach for collecting per-garment datasets using real human bodies, eliminating the necessity for a customized robotic mannequin. We also introduce a hybrid person representation that enhances the existing intermediate representation with a simplified DensePose map. This ensures accurate alignment of synthesized garment images with the human body and enables human-garment interaction without the need for customized wearable devices. We performed qualitative and quantitative evaluations against other state-of-the-art image-based virtual try-on methods and conducted ablation studies to demonstrate the superiority of our method regarding image quality and temporal consistency. Finally, our user study results indicated that most participants found our virtual try-on system helpful for making garment purchasing decisions. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10465 [pdf, ps, other]

MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

Authors: Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang, Wei Shen

Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segment… ▽ More Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: †: Equal contribution

arXiv:2506.10365 [pdf]

AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

Authors: Shuyang Hou, Zhangxiao Shen, Huayi Wu, Haoyue Jiao, Ziqi Liu, Lutong Xie, Chang Liu, Jianyuan Liang, Yaxian Qing, Xiaopu Zhang, Dehua Peng, Zhipeng Gui, Xuefeng Guan

Abstract: Geospatial code generation is becoming a key frontier in integrating artificial intelligence with geo-scientific analysis, yet standardised automated evaluation tools for this task remain absent. This study presents AutoGEEval++, an enhanced framework building on AutoGEEval, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine… ▽ More Geospatial code generation is becoming a key frontier in integrating artificial intelligence with geo-scientific analysis, yet standardised automated evaluation tools for this task remain absent. This study presents AutoGEEval++, an enhanced framework building on AutoGEEval, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE). It supports diverse data modalities and varying task complexities. Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench-with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests. It includes a submission programme and a judge module to realise an end-to-end automated evaluation pipeline from code generation to execution-based validation. The framework adopts multi-dimensional metrics-accuracy, resource usage, run-time efficiency, and error types-balancing hallucination control and efficiency, and enabling boundary testing and error pattern analysis. Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs (as of June 2025), including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models. Results reveal clear performance, stability, and error differences across task types, model designs, and deployment settings, confirming AutoGEEval++'s practical value and scalability in vertical-domain code generation. This work establishes the first standardised evaluation protocol and foundational benchmark for GEE-based LLM code generation, providing a unified basis for performance comparison and a methodological framework for systematic, domain-specific code evaluation. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10337 [pdf, ps, other]

GeoCAD: Local Geometry-Controllable CAD Generation

Authors: Zhanwei Zhang, Kaiyuan Liu, Junjie Liu, Wenxiao Wang, Binbin Lin, Liang Xie, Chen Shen, Deng Cai

Abstract: Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this… ▽ More Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: 18 pages, 12 figures

arXiv:2506.10333 [pdf, ps, other]

Nonlinear Néel Spin-Orbit Torque in Centrosymmetric Antiferromagnets

Authors: Jin Cao, Weikang Wu, Huiying Liu, Shen Lai, Cong Xiao, X. C. Xie, Shengyuan A. Yang

Abstract: Electric control of Néel vector is a central task of antiferromagnetic (AFM) spintronics. The major scheme so far relies on the linear Néel torque, which however is restricted to AFMs with broken inversion symmetry. Here, we propose a nonlinear Néel spin-orbit torque, uniquely enabling electric control in the vast class of centrosymmetric AFMs, where the existing scheme fails. Importantly, its int… ▽ More Electric control of Néel vector is a central task of antiferromagnetic (AFM) spintronics. The major scheme so far relies on the linear Néel torque, which however is restricted to AFMs with broken inversion symmetry. Here, we propose a nonlinear Néel spin-orbit torque, uniquely enabling electric control in the vast class of centrosymmetric AFMs, where the existing scheme fails. Importantly, its intrinsic component, rooted in sublattice-resolved band quantum geometry, offers two additional advantages: It operates also in $\mathcal{PT}$-symmetric AFM insulators, where linear torque is forbidden; and it has anti-damping character, making it more efficient in driving magnetic dynamics. Combined with first-principles calculations, we predict large effect in MnRh and MnBi$_{2}$Te$_{4}$, which can be readily detected in experiment. Our work unveils a new fundamental effect, offers a new strategy of electric control in AFM systems beyond the existing paradigm, and opens the door to the field of nonlinear AFM spintronics. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 6 pages, 4 figures and 1 table

arXiv:2506.10329 [pdf, ps, other]

Context-Adaptive Graph Neural Networks for Next POI Recommendation

Authors: Yu Lei, Limin Shen, Zhu Sun, Tiantian He, Yew-Soon Ong

Abstract: Next Point-of-Interest (POI) recommendation is a critical task in location-based services, aiming to predict users' next visits based on their check-in histories. While many existing methods leverage Graph Neural Networks (GNNs) to incorporate collaborative information and improve recommendation accuracy, most of them model each type of context using separate graphs, treating different factors in… ▽ More Next Point-of-Interest (POI) recommendation is a critical task in location-based services, aiming to predict users' next visits based on their check-in histories. While many existing methods leverage Graph Neural Networks (GNNs) to incorporate collaborative information and improve recommendation accuracy, most of them model each type of context using separate graphs, treating different factors in isolation. This limits their ability to model the co-influence of multiple contextual factors on user transitions during message propagation, resulting in suboptimal attention weights and recommendation performance. Furthermore, they often prioritize sequential components as the primary predictor, potentially undermining the semantic and structural information encoded in the POI embeddings learned by GNNs. To address these limitations, we propose a Context-Adaptive Graph Neural Networks (CAGNN) for next POI recommendation, which dynamically adjusts attention weights using edge-specific contextual factors and enables mutual enhancement between graph-based and sequential components. Specifically, CAGNN introduces (1) a context-adaptive attention mechanism that jointly incorporates different types of contextual factors into the attention computation during graph propagation, enabling the model to dynamically capture collaborative and context-dependent transition patterns; (2) a graph-sequential mutual enhancement module, which aligns the outputs of the graph- and sequential-based modules via the KL divergence, enabling mutual enhancement of both components. Experimental results on three real-world datasets demonstrate that CAGNN consistently outperforms state-of-the-art methods. Meanwhile, theoretical guarantees are provided that our context-adaptive attention mechanism improves the expressiveness of POI representations. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 12 pages, 6 figures

arXiv:2506.10316 [pdf, ps, other]

Search for sub-GeV invisible particles in inclusive decays of $J/ψ$ to $φ$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (704 additional authors not shown)

Abstract: A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the… ▽ More A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the rest of the events, and the recoil mass against the $φ$ is obtained precisely from the kinematic constraint in the event. No significant signal is observed in the investigated region and the upper limit on the inclusive branching fraction of $J/ψ\rightarrowφ+ X$ is determined to be $7.5\times10^{-8}$ at 90% confidence level. Upper limits at a 90% confidence level are also given for this branching fraction as a function of the invisible particle mass, varying from $9\times10^{-9}$ to $4\times10^{-8}$ over the investigated mass range. Additionally, a 90% confidence level upper limit on the branching fraction of $η\rightarrow \rm{invisible}$ is determined to $2.6\times10^{-5}$, which improves the previous best results by more than four times. The analysis technique in this work offers a clean window to search for sub-GeV invisible particles, which can be adapted for other $J/ψ$ decays and direct $e^+e^-$ annihilation experiments in future studies, and improve the sensitivity by orders of magnitude. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 10 pages, 3 figures

arXiv:2506.10282 [pdf, ps, other]

Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning

Authors: Jiajin Liu, Dongzhe Fan, Jiacheng Shen, Chuanhao Ji, Daochen Zha, Qiaoyu Tan

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applica… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 16 pages, 4 figures

arXiv:2506.10264 [pdf, ps, other]

WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

Authors: Qiyue Yin, Pei Xu, Qiaozhe Li, Shengda Liu, Shengqi Shen, Tong Wang, Yihong Han, Xiaonan Zhao, Likun Yang, Shiyue Cao, Shiyu Qiu, Yuxuan Liu, Shizhao Yu, Lei Cui, Chengxin Yan, Jie Sun, Xiangquan Tang, Kaiqi Huang

Abstract: Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic envir… ▽ More Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic environments, formulate action plans, and adapt strategies, has yet to be systematically evaluated or modeled. To address this gap, this paper introduces WGSR-Bench, the first strategy reasoning benchmark for LLMs using wargame as its evaluation environment. Wargame, a quintessential high-complexity strategic scenario, integrates environmental uncertainty, adversarial dynamics, and non-unique strategic choices, making it an effective testbed for assessing LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning. WGSR-Bench designs test samples around three core tasks, i.e., Environmental situation awareness, Opponent risk modeling and Policy generation, which serve as the core S-POE architecture, to systematically assess main abilities of strategic reasoning. Finally, an LLM-based wargame agent is designed to integrate these parts for a comprehensive strategy reasoning assessment. With WGSR-Bench, we hope to assess the strengths and limitations of state-of-the-art LLMs in game-theoretic strategic reasoning and to advance research in large model-driven strategic intelligence. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 15 pages, 17 figures

arXiv:2506.10235 [pdf, ps, other]

LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Authors: Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang

Abstract: Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from lo… ▽ More Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O(|V |), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted at 42nd International Conference on Machine Learning (ICML) 2025

arXiv:2506.10220 [pdf, ps, other]

Kinematic Confirmation of a Remarkable Linear Trail of Galaxies in the NGC 1052 Field, Consistent with Formation in a High-Speed Bullet Dwarf Collision

Authors: Michael A. Keim, Pieter van Dokkum, Zili Shen, Harrison Souchereau, Imad Pasha, Shany Danieli, Roberto Abraham, Aaron J. Romanowsky, Yimeng Tang

Abstract: A unique linear trail of diffuse galaxies was recently identified in the NGC 1052 field. This trail includes the remarkable, ultra-diffuse galaxies DF2 and DF4 which lack dark matter and host unusually luminous globular clusters. It has been proposed that the trail formed via a high-speed collision between two gas-rich dwarf galaxies. This scenario predicts that the trail galaxies are kinematicall… ▽ More A unique linear trail of diffuse galaxies was recently identified in the NGC 1052 field. This trail includes the remarkable, ultra-diffuse galaxies DF2 and DF4 which lack dark matter and host unusually luminous globular clusters. It has been proposed that the trail formed via a high-speed collision between two gas-rich dwarf galaxies. This scenario predicts that the trail galaxies are kinematically connected and follow a specific trend in radial velocity as a function of position, based on the known velocities and positions of DF2 and DF4. To test this hypothesis, we measured radial velocities for seven additional galaxies on the trail. While the galaxies' low surface brightnesses presented observational challenges, we employ several methods to obtain measurements for galaxies with effective surface brightnesses up to 28.6 mag arcsec$^{-2}$, including a narrow slit placed over globular clusters and a novel wide slit mode on Keck/LRIS, as well as a 'light bucket' mode on Keck/KCWI. We find that five of our seven targets follow the precise velocity trend predicted by DF2 and DF4, to a degree with just a 2% chance of randomly occurring. Moreover, the trail galaxies' radial velocities are significantly higher than those of the NGC 1052 group, setting it apart as a separate, kinematically connected system. Our findings support the theory that this trail of galaxies, including DF2 and DF4, formed together in a single event. A 'bullet dwarf' collision remains the only known explanation for all the unusual properties of DF2, DF4, and the associated trail of galaxies. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted for publication in ApJ

arXiv:2506.09807 [pdf, ps, other]

doi 10.1109/TIFS.2025.3570118

Physical Layer-Based Device Fingerprinting for Wireless Security: From Theory to Practice

Authors: Junqing Zhang, Francesco Ardizzon, Mattia Piana, Guanxiong Shen, Stefano Tomasin

Abstract: The identification of the devices from which a message is received is part of security mechanisms to ensure authentication in wireless communications. Conventional authentication approaches are cryptography-based, which, however, are usually computationally expensive and not adequate in the Internet of Things (IoT), where devices tend to be low-cost and with limited resources. This paper provides… ▽ More The identification of the devices from which a message is received is part of security mechanisms to ensure authentication in wireless communications. Conventional authentication approaches are cryptography-based, which, however, are usually computationally expensive and not adequate in the Internet of Things (IoT), where devices tend to be low-cost and with limited resources. This paper provides a comprehensive survey of physical layer-based device fingerprinting, which is an emerging device authentication for wireless security. In particular, this article focuses on hardware impairment-based identity authentication and channel features-based authentication. They are passive techniques that are readily applicable to legacy IoT devices. Their intrinsic hardware and channel features, algorithm design methodologies, application scenarios, and key research questions are extensively reviewed here. The remaining research challenges are discussed, and future work is suggested that can further enhance the physical layer-based device fingerprinting. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09665 [pdf, ps, other]

VideoMat: Extracting PBR Materials from Video Diffusion Models

Authors: Jacob Munkberg, Zian Wang, Ruofan Liang, Tianchang Shen, Jon Hasselgren

Abstract: We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Seco… ▽ More We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09656 [pdf, ps, other]

Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives

Authors: Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong

Abstract: The ongoing evolution of AI paradigms has propelled AI research into the Agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasingly situat… ▽ More The ongoing evolution of AI paradigms has propelled AI research into the Agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasingly situational and systemic risks. This has brought significant attention to value alignment for AI agents, which aims to ensure that an agent's goals, preferences, and behaviors align with human values and societal norms. This paper reviews value alignment in agent systems within specific application scenarios. It integrates the advancements in AI driven by large models with the demands of social governance. Our review covers value principles, agent system application scenarios, and agent value alignment evaluation. Specifically, value principles are organized hierarchically from a top-down perspective, encompassing macro, meso, and micro levels. Agent system application scenarios are categorized and reviewed from a general-to-specific viewpoint. Agent value alignment evaluation systematically examines datasets for value alignment assessment and relevant value alignment methods. Additionally, we delve into value coordination among multiple agents within agent systems. Finally, we propose several potential research directions in this field. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09645 [pdf, ps, other]

Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

Authors: Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang

Abstract: Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledg… ▽ More Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 32 pages, 28 figures

ACM Class: I.2.6

arXiv:2506.09386 [pdf, ps, other]

Search for the charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (705 additional authors not shown)

Abstract: Based on $(10087\pm44)\times 10^6$ $J/ψ$ events recorded with the BESIII detector, we search for the rare charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$ No signal is observed, and upper limits on the branching fractions at the $90\%$ confidence level are set as $\mathcal{B}(J/ψ\to D_{s}^{-}ρ^{+}+c.c.)<8.0\times10^{-7}$ and… ▽ More Based on $(10087\pm44)\times 10^6$ $J/ψ$ events recorded with the BESIII detector, we search for the rare charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$ No signal is observed, and upper limits on the branching fractions at the $90\%$ confidence level are set as $\mathcal{B}(J/ψ\to D_{s}^{-}ρ^{+}+c.c.)<8.0\times10^{-7}$ and $\mathcal{B}(J/ψ\to D_{s}^{-}π^{+}+c.c.)<4.1\times10^{-7}$. Our results provide the most stringent experimental constraints on these decays. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 18 pages, 3 figures

arXiv:2506.09369 [pdf, ps, other]

ScaleLSD: Scalable Deep Line Segment Detection Streamlined

Authors: Zeran Ke, Bin Tan, Xianwei Zheng, Yujun Shen, Tianfu Wu, Nan Xue

Abstract: This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and effici… ▽ More This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at https://github.com/ant-research/scalelsd △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: accepted to CVPR 2025; 17 pages, appendices included

arXiv:2506.09351 [pdf, ps, other]

DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

Authors: Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang

Abstract: Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconst… ▽ More Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: ACL 2025

arXiv:2506.09344 [pdf, ps, other]

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 18 pages,8 figures

arXiv:2506.09260 [pdf, ps, other]

ThinkQE: Query Expansion via an Evolving Thinking Process

Authors: Yibin Lei, Tao Shen, Andrew Yates

Abstract: Effective query expansion for web search benefits from promoting both exploration and result diversity to capture multiple interpretations and facets of a query. While recent LLM-based methods have improved retrieval performance and demonstrate strong domain generalization without additional training, they often generate narrowly focused expansions that overlook these desiderata. We propose ThinkQ… ▽ More Effective query expansion for web search benefits from promoting both exploration and result diversity to capture multiple interpretations and facets of a query. While recent LLM-based methods have improved retrieval performance and demonstrate strong domain generalization without additional training, they often generate narrowly focused expansions that overlook these desiderata. We propose ThinkQE, a test-time query expansion framework addressing this limitation through two key components: a thinking-based expansion process that encourages deeper and comprehensive semantic exploration, and a corpus-interaction strategy that iteratively refines expansions using retrieval feedback from the corpus. Experiments on diverse web search benchmarks (DL19, DL20, and BRIGHT) show ThinkQE consistently outperforms prior approaches, including training-intensive dense retrievers and rerankers. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.09182 [pdf, ps, other]

Towards Full-Scenario Safety Evaluation of Automated Vehicles: A Volume-Based Method

Authors: Hang Zhou, Chengyuan Ma, Shiyu Shen, Xiaopeng Li

Abstract: With the rapid development of automated vehicles (AVs) in recent years, commercially available AVs are increasingly demonstrating high-level automation capabilities. However, most existing AV safety evaluation methods are primarily designed for simple maneuvers such as car-following and lane-changing. While suitable for basic tests, these methods are insufficient for assessing high-level automatio… ▽ More With the rapid development of automated vehicles (AVs) in recent years, commercially available AVs are increasingly demonstrating high-level automation capabilities. However, most existing AV safety evaluation methods are primarily designed for simple maneuvers such as car-following and lane-changing. While suitable for basic tests, these methods are insufficient for assessing high-level automation functions deployed in more complex environments. First, these methods typically use crash rate as the evaluation metric, whose accuracy heavily depends on the quality and completeness of naturalistic driving environment data used to estimate scenario probabilities. Such data is often difficult and expensive to collect. Second, when applied to diverse scenarios, these methods suffer from the curse of dimensionality, making large-scale evaluation computationally intractable. To address these challenges, this paper proposes a novel framework for full-scenario AV safety evaluation. A unified model is first introduced to standardize the representation of diverse driving scenarios. This modeling approach constrains the dimension of most scenarios to a regular highway setting with three lanes and six surrounding background vehicles, significantly reducing dimensionality. To further avoid the limitations of probability-based method, we propose a volume-based evaluation method that quantifies the proportion of risky scenarios within the entire scenario space. For car-following scenarios, we prove that the set of safe scenarios is convex under specific settings, enabling exact volume computation. Experimental results validate the effectiveness of the proposed volume-based method using both AV behavior models from existing literature and six production AV models calibrated from field-test trajectory data in the Ultra-AV dataset. Code and data will be made publicly available upon acceptance of this paper. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: NA

arXiv:2506.09042 [pdf, ps, other]

Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Authors: Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, Huan Ling

Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generat… ▽ More Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams △ Less

Submitted 11 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: Only the core contributors are listed. The full list of contributors can be found in Appendix A of this paper

arXiv:2506.08989 [pdf, ps, other]

SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Authors: Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in exist… ▽ More Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model's capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: Reinforcement Learning; Large Language Models; LLM Reasoning

arXiv:2506.08979 [pdf, other]

Rethinking Range-View LiDAR Segmentation in Adverse Weather

Authors: Longyu Yang, Ping Hu, Lu Zhang, Jun Liu, Yap-Peng Tan, Heng Tao Shen, Xiaofeng Zhu

Abstract: LiDAR segmentation has emerged as an important task to enrich multimedia experiences and analysis. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, w… ▽ More LiDAR segmentation has emerged as an important task to enrich multimedia experiences and analysis. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08887 [pdf, ps, other]

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Authors: Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Jungong Han, Guiguang Ding

Abstract: The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, exis… ▽ More The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: CVPR 2025

arXiv:2506.08873 [pdf]

Identifying vortex lattice in type-II superconductors via the dynamic magnetostrictive effect

Authors: Peipei Lu, Mengju Yuan, Jing Zhang, Qiang Gao, Shuang Liu, Yugang Zhang, Shipeng Shen, Long Zhang, Jun Lu, Xiaoyuan Zhou, Mingquan He, Aifeng Wang, Yang Li, Wenshan Hong, Shiliang Li, Huiqian Luo, Xingjiang Zhou, Xianhui Chen, Young Sun, Yisheng Chai

Abstract: In type-I superconductors, zero electrical resistivity and perfect diamagnetism define two fundamental criteria for superconducting behavior. In contrast, type-II superconductors exhibit more complex mixed-state physics, where magnetic flux penetrates the material above the lower critical field Hc1 in the form of quantized vortices, each carrying a single flux quantum. These vortices form a two-di… ▽ More In type-I superconductors, zero electrical resistivity and perfect diamagnetism define two fundamental criteria for superconducting behavior. In contrast, type-II superconductors exhibit more complex mixed-state physics, where magnetic flux penetrates the material above the lower critical field Hc1 in the form of quantized vortices, each carrying a single flux quantum. These vortices form a two-dimensional lattice which persists up to another irreversible field (Hirr) and then melts into a dissipative liquid phase. The vortex lattice is fundamental to the magnetic and electrical properties of type-II superconductors, a third definitive criterion-beyond resistivity and magnetization-for identifying this phase has remained elusive. Here, we report the discovery of a dynamic magnetostrictive effect, wherein the geometry of the superconductor oscillates only under an applied alternating magnetic field due to the disturbance of the vortex lattice. This effect is detected by a thin piezoelectric transducer, which converts the excited geometric deformation into an in-phase ac voltage. Notably, we find a direct and nearly linear relationship between the signal amplitude and the vortex density in lattice across several representative type-II superconductors. In the vortex liquid phase above Hirr, the signal amplitude rapidly decays to zero near the upper critical field (Hc2), accompanied by a pronounced out-of-phase component due to enhanced dissipation. This dynamic magnetostrictive effect not only reveals an unexplored magnetoelastic property of the vortex lattice but also establishes a fundamental criterion for identifying the type-II superconductors. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 27 pages, 8 figures, submitted

arXiv:2506.08863 [pdf, ps, other]

Responses of a Coronal Hole to a Fast Flare-Driven Coronal Wave

Authors: Xiaofan Zhang, Huadong Chen, Guiping Zhou, Li Feng, Yang Su, Jinhan Guo, Leping Li, Wei Lin, Suli Ma, Yuandeng Shen, Ruisheng Zheng, Suo Liu, Xianyong Bai, Yuanyong Deng, Jingxiu Wang

Abstract: Coronal waves, significant solar phenomena, act as diagnostic tools for scientists studying solar atmosphere properties. Here, we present a novel observation detailing how a coronal wave event, associated with an X5.0 class flare, influenced the properties of an adjacent coronal hole through interaction. The coronal wave was observed in both extreme ultraviolet observations from the Atmospheric Im… ▽ More Coronal waves, significant solar phenomena, act as diagnostic tools for scientists studying solar atmosphere properties. Here, we present a novel observation detailing how a coronal wave event, associated with an X5.0 class flare, influenced the properties of an adjacent coronal hole through interaction. The coronal wave was observed in both extreme ultraviolet observations from the Atmospheric Imaging Assembly aboard the Solar Dynamics Observatory and Lyman-alpha observations from the Solar Disk Imager aboard the Advanced Space-based Solar Observatory. Utilizing the method of differential emission measure, we found that as the coronal wave passed through, the adjacent coronal hole experienced an increase in temperature from 1.31 to 1.43 MK and a rise in density from $\sim$1.62$\times10^{8}$ to 1.76$\times10^{8}$ cm$^{-3}$ within the rising period of $\sim$7 minutes. Subsequently, after the wave passed, the entire coronal hole transitioned to a new state with a slight temperature increase and a 14$\%$ decrease in density, with more pronounced changes observed at the coronal hole's boundary. Taking into account the impacts of radiative loss and heat conduction, the coronal wave was estimated to provide an average energy of 2.2$\times10^{8}$ erg cm$^{-2}$ to the coronal hole during the short rising period. This study highlights the identification of the coronal wave in both extreme ultraviolet and Lyman-alpha observations, shedding light on the significant energy input, particularly within the coronal hole. These findings provide new insights into better understanding kinematics of fast coronal waves, energy transfer processes open versus closed magnetic topologies, and the possible acceleration of solar winds. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures, Accepted in ApJL

arXiv:2506.08640 [pdf, ps, other]

Orientation Matters: Making 3D Generative Models Orientation-Aligned

Authors: Yichong Lu, Yuzhuo Tian, Zijin Jiang, Yikun Zhao, Yuanbo Yang, Hao Ouyang, Haoji Hu, Huimin Yu, Yujun Shen, Yiyi Liao

Abstract: Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single i… ▽ More Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: Project Page: https://xdimlab.github.io/Orientation_Matters

arXiv:2506.08632 [pdf, other]

RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods t… ▽ More Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08624 [pdf, ps, other]

Measurement of $ψ(2S)$ to $J/ψ$ cross-section ratio as function of multiplicity in $p$Pb collisions at$\sqrt{s_{NN}} = 8.16$ TeV

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis, L. An , et al. (1137 additional authors not shown)

Abstract: The production ratio of $ψ(2S)$ to $J/ψ$ charmonium states is presented as a function of multiplicity in proton-lead collisions at a centre-of-mass energy of $\sqrt{s_{NN}}=8.16$ TeV, for both prompt and nonprompt sources. The total luminosity recorded by the LHCb experiment corresponds to 13.6 $pb^{-1}$ for $p$Pb collisions and 20.8 $pb^{-1}$ for Pb$p$ collisions, where the first particle indicat… ▽ More The production ratio of $ψ(2S)$ to $J/ψ$ charmonium states is presented as a function of multiplicity in proton-lead collisions at a centre-of-mass energy of $\sqrt{s_{NN}}=8.16$ TeV, for both prompt and nonprompt sources. The total luminosity recorded by the LHCb experiment corresponds to 13.6 $pb^{-1}$ for $p$Pb collisions and 20.8 $pb^{-1}$ for Pb$p$ collisions, where the first particle indicates the forward direction of the detector. Measurements are performed in the dimuon final state at forward (backward) centre-of-mass rapidity $1.5<y^*<4.0$ ($-5.0<y^*<-2.5$) for $p$Pb (Pb$p$) collisions.A multiplicity dependence of the prompt production ratio is observed in $p$Pb collisions, whereas no dependence is found in nonprompt production, nor in either prompt or nonprompt production in Pb$p$ collisions. These results suggest that in the Pb-going direction additional suppression mechanisms beyond comover effects may be present, possibly related to the formation of quark-gluon plasma. This highlights a transition from small to large collision systems and provides important insight into the suppression of charmonia in proton-nucleus collisions. △ Less

Submitted 12 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: All figures and tables, along with machine-readable versions and any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/4177/ (LHCb public pages)

Report number: LHCb-PAPER-2025-011, CERN-EP-2025-114

arXiv:2506.08591 [pdf, ps, other]

Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Authors: Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang

Abstract: Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability… ▽ More Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08576 [pdf, ps, other]

Measurement of the $η$ transition form factor through $η' \rightarrow π^+π^-η$ decay

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Based on a sample of $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected at BESIII, the transition form factor of the $η$ meson is extracted by analyzing $J/ψ\toγη',~η'\toπ^+π^-η,~η\toγl^+l^-$ ($l$=$e$, $μ$) events. The measured slope of the transition form factor is $Λ^{-2}=1.645\pm0.093_{\rm stat.}\pm {0.024_{\rm sys.}}$ (GeV/$c^2$)$^{-2}$ for the di-electron channel and… ▽ More Based on a sample of $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected at BESIII, the transition form factor of the $η$ meson is extracted by analyzing $J/ψ\toγη',~η'\toπ^+π^-η,~η\toγl^+l^-$ ($l$=$e$, $μ$) events. The measured slope of the transition form factor is $Λ^{-2}=1.645\pm0.093_{\rm stat.}\pm {0.024_{\rm sys.}}$ (GeV/$c^2$)$^{-2}$ for the di-electron channel and $Λ^{-2}=1.645\pm0.343_{\rm stat.}\pm0.017_{\rm sys.}$ (GeV/$c^2$)$^{-2}$ for the di-muon channel. The branching fractions for $η\rightarrowγe^+e^-$ and $η\rightarrowγμ^+μ^-$ are measured to be $\mathcal{B}(η\toγe^+e^-)=(6.79\pm0.04_{\rm stat.}\pm0.36_{\rm sys.})\times 10^{-3}$ and $\mathcal{B}(η\toγμ^+μ^-)=(2.97\pm0.11_{\rm stat.}\pm0.07_{\rm sys.})\times 10^{-4}$. By combining with the results based on the $J/ψ\toγη,~η\toγe^+e^-$ events from the previous BESIII measurement, we determine $Λ^{-2}=1.707\pm0.076_{\rm stat.}\pm0.029_{\rm sys.}$ (GeV/$c^2$)$^{-2}$ and $\mathcal{B}(η\toγe^+e^-)=(6.93\pm0.28_{\rm tot.})\times 10^{-3}$. In addition, we search for the dark photon ($A'$) using the combined events. No significant signal is observed, and the upper limits on $\mathcal{B}(η\toγA',~A'\to e^+e^-)$ are set at 90\% confidence level for different $A'$ mass hypotheses. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08470 [pdf, ps, other]

MARMOT: Masked Autoencoder for Modeling Transient Imaging

Authors: Siyuan Shen, Ziheng Wang, Xingyue Peng, Suan Xia, Ruiqian Li, Shiying Li, Jingyi Yu

Abstract: Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects… ▽ More Pretrained models have demonstrated impressive success in many modalities such as language and vision. Recent works facilitate the pretraining paradigm in imaging research. Transients are a novel modality, which are captured for an object as photon counts versus arrival times using a precisely time-resolved sensor. In particular for non-line-of-sight (NLOS) scenarios, transients of hidden objects are measured beyond the sensor's direct line of sight. Using NLOS transients, the majority of previous works optimize volume density or surfaces to reconstruct the hidden objects and do not transfer priors learned from datasets. In this work, we present a masked autoencoder for modeling transient imaging, or MARMOT, to facilitate NLOS applications. Our MARMOT is a self-supervised model pretrianed on massive and diverse NLOS transient datasets. Using a Transformer-based encoder-decoder, MARMOT learns features from partially masked transients via a scanning pattern mask (SPM), where the unmasked subset is functionally equivalent to arbitrary sampling, and predicts full measurements. Pretrained on TransVerse-a synthesized transient dataset of 500K 3D models-MARMOT adapts to downstream imaging tasks using direct feature transfer or decoder finetuning. Comprehensive experiments are carried out in comparisons with state-of-the-art methods. Quantitative and qualitative results demonstrate the efficiency of our MARMOT. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08421 [pdf]

High-precision Beam Optics Calculation of the HIAF-BRing Using Measured Fields

Authors: Ke Wang, Li-Na Sheng, Geng Wang, Wei-Ping Chai, You-Jin Yuan, Jian-Cheng Yang, Guo-Dong Shen, Liang Lu

Abstract: The construction of the High Intensity heavy ion Accelerator Facility (HIAF) has been completed, with current efforts focused on subsystem commissioning. Beam commissioning is scheduled for autumn 2025, marking a critical milestone in HIAF's operational readiness. This paper presents high-precision optics calculations for the Booster Ring (BRing) of HIAF, a key component for achieving stable heavy… ▽ More The construction of the High Intensity heavy ion Accelerator Facility (HIAF) has been completed, with current efforts focused on subsystem commissioning. Beam commissioning is scheduled for autumn 2025, marking a critical milestone in HIAF's operational readiness. This paper presents high-precision optics calculations for the Booster Ring (BRing) of HIAF, a key component for achieving stable heavy-ion beam acceleration. Leveraging high-precision magnetic field data, each magnet is divided into hundreds of segmentations, thus establishing a high-precision sliced optics model for BRing. Detailed calculations of BRing's optics are presented in this work. Critical parameters including tunes and betatron functions of the lattice based on the measured magnetic fields and those of the ideal lattice have been compared. The results highlight the impact of realistic magnetic field on beam dynamics and provide essential insights for accelerator tuning and optimization. These findings serve as a fundamental reference for BRing's commissioning and long-term operation, ensuring beam stability and performance reproducibility in HIAF. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08390 [pdf, ps, other]

On Reasoning Strength Planning in Large Reasoning Models

Authors: Leheng Sheng, An Zhang, Zijian Wu, Weixiang Zhao, Changshuo Shen, Yi Zhang, Xiang Wang, Tat-Seng Chua

Abstract: Recent studies empirically reveal that large reasoning models (LRMs) can automatically allocate more reasoning strengths (i.e., the number of reasoning tokens) for harder problems, exhibiting difficulty-awareness for better task performance. While this automatic reasoning strength allocation phenomenon has been widely observed, its underlying mechanism remains largely unexplored. To this end, we p… ▽ More Recent studies empirically reveal that large reasoning models (LRMs) can automatically allocate more reasoning strengths (i.e., the number of reasoning tokens) for harder problems, exhibiting difficulty-awareness for better task performance. While this automatic reasoning strength allocation phenomenon has been widely observed, its underlying mechanism remains largely unexplored. To this end, we provide explanations for this phenomenon from the perspective of model activations. We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation, with this reasoning strength causally controlled by the magnitude of a pre-allocated directional vector. Specifically, we show that the number of reasoning tokens is predictable solely based on the question activations using linear probes, indicating that LRMs estimate the required reasoning strength in advance. We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model, where the vector's magnitude modulates the reasoning strength. Subtracting this vector can lead to reduced reasoning token number and performance, while adding this vector can lead to increased reasoning token number and even improved performance. We further reveal that this direction vector consistently yields positive reasoning length prediction, and it modifies the logits of end-of-reasoning token </think> to affect the reasoning length. Finally, we demonstrate two potential applications of our findings: overthinking behavior detection and enabling efficient reasoning on simple problems. Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors. Our code is available at https://github.com/AlphaLab-USTC/LRM-plans-CoT. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08371 [pdf, ps, other]

Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding

Authors: Zikai Xiao, Ziyang Wang, Wen Ma, Yan Zhang, Wei Shen, Yan Wang, Luqi Gong, Zuozhu Liu

Abstract: While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-tex… ▽ More While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08368 [pdf, ps, other]

Prospects for Time-Domain and Multi-Messenger Science with eXTP

Authors: Shu-Xu Yi, Wen Zhao, Ren-Xin Xu, Xue-Feng Wu, Giulia Stratta, Simone Dall'Osso, Yan-Jun Xu, Andrea Santangelo, Silvia Zane, Shuang-Nan Zhang, Hua Feng, Huan Yang, Junjie Mao, Junqiang Ge, Lijing Shao, Mi-Xiang Lan, He Gao, Lin Lin, Ning Jiang, Qingwen Wu, Tong Liu, Yun-Wei Yu, Xiang-Yu Wang, Jin Zhang, Dafne Guetta , et al. (49 additional authors not shown)

Abstract: In this new era of time-domain and multi-messenger astronomy, various new transients and new phenomena are constantly being discovered thanks to the rapid advances in observations, which provide the excellent opportunity to study the physics in the extreme environments. The enhanced X-ray Timing and Polarimetry mission (eXTP), planned to be launched in 2030, has several key advantages, including a… ▽ More In this new era of time-domain and multi-messenger astronomy, various new transients and new phenomena are constantly being discovered thanks to the rapid advances in observations, which provide the excellent opportunity to study the physics in the extreme environments. The enhanced X-ray Timing and Polarimetry mission (eXTP), planned to be launched in 2030, has several key advantages, including advanced polarimetry, high sensitivity & large effective area, and wide energy range coverage, which make it a groundbreaking project in high-energy astrophysics. In this article, we briefly introduce the potential time-domain and multi-messenger targets for eXTP, including gravitational-wave (GW) counterparts, gamma-ray bursts (GRBs), magnetars and fast radio bursts (FRBs), tidal disruption events (TDEs), supernovae, high energy neutrinos and TeV active galactic nucleus (AGNs), and so on. We discuss the advantages of future eXTP observations for detecting these sources, their detection capabilities, the abilities to distinguish theoretical models, and their applications in gravity and cosmology. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: Submitted to the SCIENCE CHINA Physics, Mechanics & Astronomy

arXiv:2506.08352 [pdf, ps, other]

Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models

Authors: Wentao Shi, Yiqing Shen

Abstract: Large language models (LLMs) can face factual limitations when responding to time-sensitive queries about recent events that arise after their knowledge thresholds in the training corpus. Existing search-augmented approaches fall into two categories, each with distinct limitations: multi-agent search frameworks incur substantial computational overhead by separating search planning and response syn… ▽ More Large language models (LLMs) can face factual limitations when responding to time-sensitive queries about recent events that arise after their knowledge thresholds in the training corpus. Existing search-augmented approaches fall into two categories, each with distinct limitations: multi-agent search frameworks incur substantial computational overhead by separating search planning and response synthesis across multiple LLMs, while single-LLM tool-calling methods restrict themselves to sequential planned, single-query searches from sole search sources. We present Reasoning-Search (R-Search), a single-LLM search framework that unifies multi-step planning, multi-source search execution, and answer synthesis within one coherent inference process. Innovatively, it structure the output into four explicitly defined components, including reasoning steps that guide the search process (<think>), a natural-language directed acyclic graph that represents the search plans with respect to diverse sources (<search>), retrieved results from executing the search plans (<result>), and synthesized final answers (<answer>). To enable effective generation of these structured outputs, we propose a specialized Reinforcement Fine-Tuning (ReFT) method based on GRPO, together with a multi-component reward function that optimizes LLM's answer correctness, structural validity of the generated DAG, and adherence to the defined output format. Experimental evaluation on FinSearchBench-24, SearchExpertBench-25, and seven Q and A benchmarks demonstrates that R-Search outperforms state-of-the-art methods, while achieving substantial efficiency gains through 70% reduction in context token usage and approximately 50% decrease in execution latency. Code is available at https://github.com/wentao0429/Reasoning-search. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08326 [pdf, other]

Graph Prompting for Graph Learning Models: Recent Advances and Future Directions

Authors: Xingbo Fu, Zehong Wang, Zihan Chen, Jiazheng Li, Yaochen Zhu, Zhenyu Lei, Cong Shen, Yanfang Ye, Chuxu Zhang, Jundong Li

Abstract: Graph learning models have demonstrated great prowess in learning expressive representations from large-scale graph data in a wide variety of real-world scenarios. As a prevalent strategy for training powerful graph learning models, the "pre-training, adaptation" scheme first pre-trains graph learning models on unlabeled graph data in a self-supervised manner and then adapts them to specific downs… ▽ More Graph learning models have demonstrated great prowess in learning expressive representations from large-scale graph data in a wide variety of real-world scenarios. As a prevalent strategy for training powerful graph learning models, the "pre-training, adaptation" scheme first pre-trains graph learning models on unlabeled graph data in a self-supervised manner and then adapts them to specific downstream tasks. During the adaptation phase, graph prompting emerges as a promising approach that learns trainable prompts while keeping the pre-trained graph learning models unchanged. In this paper, we present a systematic review of recent advancements in graph prompting. First, we introduce representative graph pre-training methods that serve as the foundation step of graph prompting. Next, we review mainstream techniques in graph prompting and elaborate on how they design learnable prompts for graph prompting. Furthermore, we summarize the real-world applications of graph prompting from different domains. Finally, we discuss several open challenges in existing studies with promising future directions in this field. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: Accepted by KDD 2025 Tutorial/Survey Track

arXiv:2506.08319 [pdf, ps, other]

DEKC: Data-Enable Control for Tethered Space Robot Deployment in the Presence of Uncertainty via Koopman Operator Theory

Authors: Ao Jin, Qinyi Wang, Sijie Wen, Ya Liu, Ganghui Shen, Panfeng Huang, Fan Zhang

Abstract: This work focuses the deployment of tethered space robot in the presence of unknown uncertainty. A data-enable framework called DEKC which contains offline training part and online execution part is proposed to deploy tethered space robot in the presence of uncertainty. The main idea of this work is modeling the unknown uncertainty as a dynamical system, which enables high accuracy and convergence… ▽ More This work focuses the deployment of tethered space robot in the presence of unknown uncertainty. A data-enable framework called DEKC which contains offline training part and online execution part is proposed to deploy tethered space robot in the presence of uncertainty. The main idea of this work is modeling the unknown uncertainty as a dynamical system, which enables high accuracy and convergence of capturing uncertainty. The core part of proposed framework is a proxy model of uncertainty, which is derived from data-driven Koopman theory and is separated with controller design. In the offline stage, the lifting functions associated with Koopman operator are parameterized with deep neural networks. Then by solving an optimization problem, the lifting functions are learned from sampling data. In the online execution stage, the proxy model cooperates the learned lifting functions obtained in the offline phase to capture the unknown uncertainty. Then the output of proxy model is compensated to the baseline controller such that the effect of uncertainty can be attenuated or even eliminated. Furthermore, considering some scenarios in which the performance of proxy model may weaken, a receding-horizon scheme is proposed to update the proxy model online. Finally, the extensive numerical simulations demonstrate the effectiveness of our proposed framework. The implementation of proposed DEKC framework is publicly available at https://github.com/NPU-RCIR/DEKC.git. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 12 pages

arXiv:2506.08228 [pdf, ps, other]

Scaling Laws of Motion Forecasting and Planning -- A Technical Report

Authors: Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, Rui Wang, Vinutha Kallem, Sergio Casas, Rami Al-Rfou, Benjamin Sapp, Dragomir Anguelov

Abstract: We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation b… ▽ More We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation between model training loss and model evaluation metrics. Most interestingly, closed-loop metrics also improve with scaling, which has important implications for the suitability of open-loop metrics for model development and hill climbing. We also study the optimal scaling of the number of transformer parameters and the training data size for a training compute-optimal model. We find that as the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size. We also study inference-time compute scaling, where we observe that sampling and clustering the output of smaller models makes them competitive with larger models, up to a crossover point beyond which a larger models becomes more inference-compute efficient. Overall, our experimental results demonstrate that optimizing the training and inference-time scaling properties of motion forecasting and planning models is a key lever for improving their performance to address a wide variety of driving scenarios. Finally, we briefly study the utility of training on general logged driving data of other agents to improve the performance of the ego-agent, an important research area to address the scarcity of robotics data for large capacity models training. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08123 [pdf, ps, other]

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Authors: Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce… ▽ More Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08105 [pdf, ps, other]

Probing the Strong Gravity Region of Black Holes with eXTP

Authors: Qingcui Bu, Cosimo Bambi, Lijun Gou, Yanjun Xu, Phil Uttley, Alessandra De Rosa, Andrea Santangelo, Silvia Zane, Hua Feng, Shuang-Nan Zhang, Chichuan Jin, Haiwu Pan, Xinwen Shu, Francesco Ursini, Yanan Wang, Jianfeng Wu, Bei You, Yefei Yuan, Wenda Zhang, Stefano Bianchi, Lixin Dai, Tiziana Di Salvo, Michal Dovciak, Yuan Feng, Hengxiao Guo , et al. (18 additional authors not shown)

Abstract: We present the novel capabilities of the enhanced X-ray Timing and Polarimetry (eXTP) mission to study the strong gravity region around stellar-mass black holes in X-ray binary systems and supermassive black holes in active galactic nuclei. eXTP can combine X-ray spectral, timing, and polarimetric techniques to study the accretion process near black holes, measure black hole masses and spins, and… ▽ More We present the novel capabilities of the enhanced X-ray Timing and Polarimetry (eXTP) mission to study the strong gravity region around stellar-mass black holes in X-ray binary systems and supermassive black holes in active galactic nuclei. eXTP can combine X-ray spectral, timing, and polarimetric techniques to study the accretion process near black holes, measure black hole masses and spins, and test Einstein's theory of General Relativity in the strong field regime. We show how eXTP can improve the current measurements of black holes of existing X-ray missions and we discuss the scientific questions that can be addressed. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: submitted to the SCIENCE CHINA Physics, Mechanics & Astronomy

arXiv:2506.08104 [pdf, ps, other]

Dense Matter in Neutron Stars with eXTP

Authors: Ang Li, Anna L. Watts, Guobao Zhang, Sebastien Guillot, Yanjun Xu, Andrea Santangelo, Silvia Zane, Hua Feng, Shuang-Nan Zhang, Mingyu Ge, Liqiang Qi, Tuomo Salmi, Bas Dorsman, Zhiqiang Miao, Zhonghao Tu, Yuri Cavecchi, Xia Zhou, Xiaoping Zheng, Weihua Wang, Quan Cheng, Xuezhi Liu, Yining Wei, Wei Wang, Yujing Xu, Shanshan Weng , et al. (58 additional authors not shown)

Abstract: In this White Paper, we present the potential of the enhanced X-ray Timing and Polarimetry (eXTP) mission to constrain the equation of state of dense matter in neutron stars, exploring regimes not directly accessible to terrestrial experiments. By observing a diverse population of neutron stars - including isolated objects, X-ray bursters, and accreting systems - eXTP's unique combination of timin… ▽ More In this White Paper, we present the potential of the enhanced X-ray Timing and Polarimetry (eXTP) mission to constrain the equation of state of dense matter in neutron stars, exploring regimes not directly accessible to terrestrial experiments. By observing a diverse population of neutron stars - including isolated objects, X-ray bursters, and accreting systems - eXTP's unique combination of timing, spectroscopy, and polarimetry enables high-precision measurements of compactness, spin, surface temperature, polarimetric signals, and timing irregularity. These multifaceted observations, combined with advances in theoretical modeling, pave the way toward a comprehensive description of the properties and phases of dense matter from the crust to the core of neutron stars. Under development by an international Consortium led by the Institute of High Energy Physics of the Chinese Academy of Sciences, the eXTP mission is planned to be launched in early 2030. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: submitted to the SCIENCE CHINA Physics, Mechanics & Astronomy

Showing 1–50 of 18,777 results for author: Shen