Search | arXiv e-print repository

HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction

Authors: Jiaqi Cui, Lu Wen, Yuchen Fei, Bo Liu, Luping Zhou, Dinggang Shen, Yan Wang

Abstract: Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion.… ▽ More Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: Accepted by MICCAI2025

arXiv:2506.22586 [pdf, ps, other]

Sensitivity of nEXO to $^{136}$Xe Charged-Current Interactions: Background-free Searches for Solar Neutrinos and Fermionic Dark Matter

Authors: G. Richardson, B. G. Lenardo, D. Gallacher, R. Saldanha, P. Acharya, S. Al Kharusi, A. Amy, E. Angelico, A. Anker, I. J. Arnquist, A. Atencio, J. Bane, V. Belov, E. P. Bernard, T. Bhatta, A. Bolotnikov, J. Breslin, P. A. Breur, J. P. Brodsky, S. Bron, E. Brown, T. Brunner, B. Burnell, E. Caden, G. F. Cao , et al. (113 additional authors not shown)

Abstract: We study the sensitivity of nEXO to solar neutrino charged-current interactions, $ν_e + ^{136}$Xe$\rightarrow ^{136}$Cs$^* + e^-$, as well as analogous interactions predicted by models of fermionic dark matter. Due to the recently observed low-lying isomeric states of $^{136}$Cs, these interactions will create a time-delayed coincident signal observable in the scintillation channel. Here we develo… ▽ More We study the sensitivity of nEXO to solar neutrino charged-current interactions, $ν_e + ^{136}$Xe$\rightarrow ^{136}$Cs$^* + e^-$, as well as analogous interactions predicted by models of fermionic dark matter. Due to the recently observed low-lying isomeric states of $^{136}$Cs, these interactions will create a time-delayed coincident signal observable in the scintillation channel. Here we develop a detailed Monte Carlo of scintillation emission, propagation, and detection in the nEXO detector to model these signals under different assumptions about the timing resolution of the photosensor readout. We show this correlated signal can be used to achieve background discrimination on the order of $10^{-9}$, enabling nEXO to make background-free measurements of solar neutrinos above the reaction threshold of 0.668 MeV. We project that nEXO could measure the flux of CNO solar neutrinos with a statistical uncertainty of 25%, thus contributing a novel and competitive measurement towards addressing the solar metallicity problem. Additionally, nEXO could measure the mean energy of the $^7$Be neutrinos with a precision of $σ\leq 1.5$ keV and could determine the survival probability of $^{7}$Be and $pep$ solar $ν_e$ with precision comparable to state-of-the-art. These quantities are sensitive to the Sun's core temperature and to non-standard neutrino interactions, respectively. Furthermore, the strong background suppression would allow nEXO to search for for charged-current interactions of fermionic dark matter in the mass range $m_χ$ = $0.668$-$7$ MeV with a sensitivity up to three orders of magnitude better than current limits. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.21786 [pdf, ps, other]

Estimating Average Causal Effects with Incomplete Exposure and Confounders

Authors: Lan Wen, Glen McGee

Abstract: Standard methods for estimating average causal effects require complete observations of the exposure and confounders. In observational studies, however, missing data are ubiquitous. Motivated by a study on the effect of prescription opioids on mortality, we propose methods for estimating average causal effects when exposures and potential confounders may be missing. We consider missingness at rand… ▽ More Standard methods for estimating average causal effects require complete observations of the exposure and confounders. In observational studies, however, missing data are ubiquitous. Motivated by a study on the effect of prescription opioids on mortality, we propose methods for estimating average causal effects when exposures and potential confounders may be missing. We consider missingness at random and additionally propose several specific missing not at random (MNAR) assumptions. Under our proposed MNAR assumptions, we show that the average causal effects are identified from the observed data and derive corresponding influence functions in a nonparametric model, which form the basis of our proposed estimators. Our simulations show that standard multiple imputation techniques paired with a complete data estimator is unbiased when data are missing at random (MAR) but can be biased otherwise. For each of the MNAR assumptions, we instead propose doubly robust targeted maximum likelihood estimators (TMLE), allowing misspecification of either (i) the outcome models or (ii) the exposure and missingness models. The proposed methods are suitable for any outcome types, and we apply them to a motivating study that examines the effect of prescription opioid usage on all-cause mortality using data from the National Health and Nutrition Examination Survey (NHANES). △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21250 [pdf, ps, other]

ACTLLM: Action Consistency Tuned Large Language Model

Authors: Jing Bi, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

Abstract: This paper introduces ACTLLM (Action Consistency Tuned Large Language Model), a novel approach for robot manipulation in dynamic environments. Traditional vision-based systems often struggle to learn visual representations that excel in both task execution and spatial reasoning, thereby limiting their adaptability in dynamic environments. ACTLLM addresses these challenges by harnessing language to… ▽ More This paper introduces ACTLLM (Action Consistency Tuned Large Language Model), a novel approach for robot manipulation in dynamic environments. Traditional vision-based systems often struggle to learn visual representations that excel in both task execution and spatial reasoning, thereby limiting their adaptability in dynamic environments. ACTLLM addresses these challenges by harnessing language to craft structured scene descriptors, providing a uniform interface for both spatial understanding and task performance through flexible language instructions. Moreover, we introduce a novel action consistency constraint that aligns visual perception with corresponding actions, thereby enhancing the learning of actionable visual representations. Additionally, we have reformulated the Markov decision process for manipulation tasks into a multi-turn visual dialogue framework. This approach enables the modeling of long-term task execution with enhanced contextual relevance derived from the history of task execution. During our evaluation, ACTLLM excels in diverse scenarios, proving its effectiveness on challenging vision-based robot manipulation tasks. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.19826 [pdf, ps, other]

doi 10.3847/2041-8213/ade66d

Evolution of Cluster Alignments as Evidence of Large-scale Structure Formation in the Universe

Authors: Michael J. West, Roberto De Propris, Maret Einasto, Z. L. Wen, J. L. Han

Abstract: The universe's large-scale structure forms a vast, interconnected network of filaments, sheets, and voids known as the cosmic web. For decades, astronomers have observed that the orientations of neighboring galaxy clusters within these elongated structures are often aligned over separations of tens of Mpc. Using the largest available catalog of galaxy clusters, we show for the first time that clus… ▽ More The universe's large-scale structure forms a vast, interconnected network of filaments, sheets, and voids known as the cosmic web. For decades, astronomers have observed that the orientations of neighboring galaxy clusters within these elongated structures are often aligned over separations of tens of Mpc. Using the largest available catalog of galaxy clusters, we show for the first time that clusters orientations are correlated over even larger scales, up to 200-300 comoving Mpc, and such alignments are seen to redshifts of at least z = 1. Comparison with numerical simulations suggests that coherent structures on similar scales may be expected in LCDM models. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 13 pages, 7 figures. Accepted for publication in ApJ Letters

arXiv:2506.18800 [pdf, ps, other]

Electromagnetic polarizabilities of the spin-$\frac{3}{2}$ baryons in heavy baryon chiral perturbation theory

Authors: Liang-Zhen Wen, Yan-Ke Chen, Lu Meng, Shi-Lin Zhu

Abstract: We employ Heavy Baryon Chiral Perturbation Theory (HB$χ$PT), a non-relativistic effective field theory that treats baryons as heavy static sources, to calculate the electromagnetic polarizabilities of spin-3/2 baryons in two sectors: the light-flavor decuplet baryons and singly heavy sextet baryons. We derive the analytical expressions up to $\mathcal{O}\left(p^3\right)$. Our results indicate that… ▽ More We employ Heavy Baryon Chiral Perturbation Theory (HB$χ$PT), a non-relativistic effective field theory that treats baryons as heavy static sources, to calculate the electromagnetic polarizabilities of spin-3/2 baryons in two sectors: the light-flavor decuplet baryons and singly heavy sextet baryons. We derive the analytical expressions up to $\mathcal{O}\left(p^3\right)$. Our results indicate that the long-range chiral corrections provide substantial contributions to the polarizabilities. In addition, magnetic dipole (M1) transitions of the baryons can significantly affect the magnetic polarizabilities and may even reverse their signs. For the decuplet baryons, the $Δ^+$ and $Δ^0$ exhibit the largest electric polarizabilities. Their values, $α_E(Δ^+) = (17.4 \pm 9.5)\times 10^{-4} \, \mathrm{fm}^3$ and $α_E(Δ^0) = (16.9 \pm 9.2)\times 10^{-4} \, \mathrm{fm}^3$, significantly exceed those typically observed for nucleons. Meanwhile, the electric polarizabilities of spin-3/2 singly heavy baryons are comparable to those of their spin-1/2 partners. △ Less

Submitted 25 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: 22 pages, 1 figures, 9 tables. Comments are welcome. arXiv admin note: substantial text overlap with arXiv:2412.02297

arXiv:2506.16749 [pdf, ps, other]

Giant Magneto-Optical Effects in Two-Dimensional Flat-Band Antiferromagnets

Authors: Ping Yang, Wanxiang Feng, Siyuan Liu, Shan Guan, Liwei Wen, Wei Jiang, Gui-Bin Liu, Yugui Yao

Abstract: In this work, we reveal giant magneto-optical responses in two-dimensional(2D) antiferromagnets with nearly flat electronic bands, based on first-principles calculations and group-theoretical analysis. We identify a record-large second-order magneto-optical Schafer-Hubert(SH) effect, featuring a polarization rotation angle of 28 degree, in monolayer antiferromagnetic RuOCl2, driven by flatband-enh… ▽ More In this work, we reveal giant magneto-optical responses in two-dimensional(2D) antiferromagnets with nearly flat electronic bands, based on first-principles calculations and group-theoretical analysis. We identify a record-large second-order magneto-optical Schafer-Hubert(SH) effect, featuring a polarization rotation angle of 28 degree, in monolayer antiferromagnetic RuOCl2, driven by flatband-enhanced interband optical transitions. Both the valence and conduction bands exhibit pronounced directional flatness, giving rise to highly anisotropic optical absorption and broadband hyperbolic frequency windows spanning the entire visible spectrum. This anisotropy leads to an exceptionally strong linear dichroism (LD) reaching 50%, far exceeding values reported in other 2D magnetic systems. Remarkably, the giant SH effect and LD appear at distinct photon energies, reflecting a momentum-direction-dependent crossover between flat and dispersive bands. Both responses are further amplified with increasing RuOCl2 film thickness. Our results establish flat-band antiferromagnets as a fertile platform for realizing giant nonlinear magneto-optical effects and open new avenues for 2D opto-spintronic device applications. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 6 pages, 3 figures

arXiv:2506.07971 [pdf, ps, other]

CyberV: Cybernetics for Test-time Scaling in Video Understanding

Authors: Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen

Abstract: Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cyber… ▽ More Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.00783 [pdf, other]

KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

Authors: Rong Wu, Pinlong Cai, Jianbiao Mei, Licheng Wen, Tao Hu, Xuemeng Yang, Daocheng Fu, Botian Shi

Abstract: Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Know… ▽ More Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at https://github.com/Edaizi/KG-TRACES. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: 23 pages, 13 figures

arXiv:2505.21027 [pdf, ps, other]

TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data

Authors: Zhipeng He, Chun Ouyang, Lijie Wen, Cong Liu, Catarina Moreira

Abstract: Adversarial attacks pose a significant threat to machine learning models by inducing incorrect predictions through imperceptible perturbations to input data. While these attacks have been extensively studied in unstructured data like images, their application to tabular data presents new challenges. These challenges arise from the inherent heterogeneity and complex feature interdependencies in tab… ▽ More Adversarial attacks pose a significant threat to machine learning models by inducing incorrect predictions through imperceptible perturbations to input data. While these attacks have been extensively studied in unstructured data like images, their application to tabular data presents new challenges. These challenges arise from the inherent heterogeneity and complex feature interdependencies in tabular data, which differ significantly from those in image data. To address these differences, it is crucial to consider imperceptibility as a key criterion specific to tabular data. Most current research focuses primarily on achieving effective adversarial attacks, often overlooking the importance of maintaining imperceptibility. To address this gap, we propose a new benchmark for adversarial attacks on tabular data that evaluates both effectiveness and imperceptibility. In this study, we assess the effectiveness and imperceptibility of five adversarial attacks across four models using eleven tabular datasets, including both mixed and numerical-only datasets. Our analysis explores how these factors interact and influence the overall performance of the attacks. We also compare the results across different dataset types to understand the broader implications of these findings. The findings from this benchmark provide valuable insights for improving the design of adversarial attack algorithms, thereby advancing the field of adversarial machine learning on tabular data. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 63 pages, 22 figures, 6 tables

arXiv:2505.16582 [pdf, ps, other]

O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Authors: Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao

Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking… ▽ More Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones. △ Less

Submitted 26 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: 25 pages, 9 figures

arXiv:2505.12627 [pdf, ps, other]

Efficient Heuristics Generation for Solving Combinatorial Optimization Problems Using Large Language Models

Authors: Xuan Wu, Di Wang, Chunguo Wu, Lijie Wen, Chunyan Miao, Yubin Xiao, You Zhou

Abstract: Recent studies exploited Large Language Models (LLMs) to autonomously generate heuristics for solving Combinatorial Optimization Problems (COPs), by prompting LLMs to first provide search directions and then derive heuristics accordingly. However, the absence of task-specific knowledge in prompts often leads LLMs to provide unspecific search directions, obstructing the derivation of well-performin… ▽ More Recent studies exploited Large Language Models (LLMs) to autonomously generate heuristics for solving Combinatorial Optimization Problems (COPs), by prompting LLMs to first provide search directions and then derive heuristics accordingly. However, the absence of task-specific knowledge in prompts often leads LLMs to provide unspecific search directions, obstructing the derivation of well-performing heuristics. Moreover, evaluating the derived heuristics remains resource-intensive, especially for those semantically equivalent ones, often requiring omissible resource expenditure. To enable LLMs to provide specific search directions, we propose the Hercules algorithm, which leverages our designed Core Abstraction Prompting (CAP) method to abstract the core components from elite heuristics and incorporate them as prior knowledge in prompts. We theoretically prove the effectiveness of CAP in reducing unspecificity and provide empirical results in this work. To reduce computing resources required for evaluating the derived heuristics, we propose few-shot Performance Prediction Prompting (PPP), a first-of-its-kind method for the Heuristic Generation (HG) task. PPP leverages LLMs to predict the fitness values of newly derived heuristics by analyzing their semantic similarity to previously evaluated ones. We further develop two tailored mechanisms for PPP to enhance predictive accuracy and determine unreliable predictions, respectively. The use of PPP makes Hercules more resource-efficient and we name this variant Hercules-P. Extensive experiments across four HG tasks, five COPs, and eight LLMs demonstrate that Hercules outperforms the state-of-the-art LLM-based HG algorithms, while Hercules-P excels at minimizing required computing resources. In addition, we illustrate the effectiveness of CAP, PPP, and the other proposed mechanisms by conducting relevant ablation studies. △ Less

Submitted 11 June, 2025; v1 submitted 18 May, 2025; originally announced May 2025.

Comments: Accepted by SIGKDD 2025

arXiv:2505.02500 [pdf, other]

Automating Automotive Software Development: A Synergy of Generative AI and Formal Methods

Authors: Fengjunjie Pan, Yinglei Song, Long Wen, Nenad Petrovic, Krzysztof Lebioda, Alois Knoll

Abstract: As the automotive industry shifts its focus toward software-defined vehicles, the need for faster and reliable software development continues to grow. However, traditional methods show their limitations. The rise of Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), introduces new opportunities to automate automotive software development tasks such as requiremen… ▽ More As the automotive industry shifts its focus toward software-defined vehicles, the need for faster and reliable software development continues to grow. However, traditional methods show their limitations. The rise of Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), introduces new opportunities to automate automotive software development tasks such as requirement analysis and code generation. However, due to the complexity of automotive systems, where software components must interact with each other seamlessly, challenges remain in software integration and system-level validation. In this paper, we propose to combine GenAI with model-driven engineering to automate automotive software development. Our approach uses LLMs to convert free-text requirements into event chain descriptions and to generate platform-independent software components that realize the required functionality. At the same time, formal models are created based on event chain descriptions to support system validation and the generation of integration code for integrating generated software components in the whole vehicle system through middleware. This approach increases development automation while enabling formal analysis to improve system reliability. As a proof of concept, we used GPT-4o to implement our method and tested it in the CARLA simulation environment with ROS2 middleware. We evaluated the system in a simple Autonomous Emergency Braking scenario. △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2505.02370 [pdf, other]

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Authors: Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu

Abstract: Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks,… ▽ More Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: Code, Data and Models are available at: https://github.com/bytedance/SuperEdit

arXiv:2505.00359 [pdf, other]

TNStream: Applying Tightest Neighbors to Micro-Clusters to Define Multi-Density Clusters in Streaming Data

Authors: Qifen Zeng, Haomin Bao, Yuanzhuo Hu, Zirui Zhang, Yuheng Zheng, Luosheng Wen

Abstract: In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies compl… ▽ More In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies complexly. This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. The algorithm adaptively determines the clustering radius based on local similarity, summarizing the evolution of multi-density data streams in micro-clusters. It then applies a Tightest Neighbors-based clustering algorithm to form final clusters. To improve efficiency in high-dimensional cases, Locality-Sensitive Hashing (LSH) is employed to structure micro-clusters, addressing the challenge of storing k-nearest neighbors. TNStream is evaluated on various synthetic and real-world datasets using different clustering metrics. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory. △ Less

Submitted 1 May, 2025; originally announced May 2025.

Comments: 21 pages, 9 figures, 8 tables, under review at Expert Systems with Applications (ESWA)

MSC Class: 68T05; 68W20 ACM Class: H.2.8; I.5.3

arXiv:2505.00063 [pdf, other]

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

Authors: Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Bin Fu, Pinlong Cai, Licheng Wen, Botian Shi, Yong Liu, Xinyu Cai, Yu Qiao

Abstract: The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic im… ▽ More The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench. △ Less

Submitted 22 May, 2025; v1 submitted 30 April, 2025; originally announced May 2025.

arXiv:2504.15681 [pdf, other]

Vidi: Large Multimodal Models for Video Understanding and Editing

Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components… ▽ More Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios. △ Less

Submitted 24 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.15464 [pdf, other]

Ultra-sensitive radon assay using an electrostatic chamber in a recirculating system

Authors: nEXO Collaboration, A. Anker, P. A. Breur, B. Mong, P. Acharya, A. Amy, E. Angelico, I. J. Arnquist, A. Atencio, J. Bane, V. Belov, E. P. Bernard, T. Bhatta, A. Bolotnikov, J. Breslin, J. P. Brodsky, S. Bron, E. Brown, T. Brunner, B. Burnell, E. Caden, L. Q. Cao, G. F. Cao, D. Cesmecioglu, D. Chernyak , et al. (116 additional authors not shown)

Abstract: Rare event searches such as neutrinoless double beta decay and Weakly Interacting Massive Particle detection require ultra-low background detectors. Radon contamination is a significant challenge for these experiments, which employ highly sensitive radon assay techniques to identify and select low-emission materials. This work presents the development of ultra-sensitive electrostatic chamber (ESC)… ▽ More Rare event searches such as neutrinoless double beta decay and Weakly Interacting Massive Particle detection require ultra-low background detectors. Radon contamination is a significant challenge for these experiments, which employ highly sensitive radon assay techniques to identify and select low-emission materials. This work presents the development of ultra-sensitive electrostatic chamber (ESC) instruments designed to measure radon emanation in a recirculating gas loop, for future lower background experiments. Unlike traditional methods that separate emanation and detection steps, this system allows continuous radon transport and detection. This is made possible with a custom-built recirculation pump. A Python-based analysis framework, PyDAn, was developed to process and fit time-dependent radon decay data. Radon emanation rates are given for various materials measured with this instrument. A radon source of known activity provides an absolute calibration, enabling statistically-limited minimal detectable activities of 20 $μ$Bq. These devices are powerful tools for screening materials in the development of low-background particle physics experiments. △ Less

Submitted 24 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

Comments: 14 pages, 9 figures, 1 table

arXiv:2504.09665 [pdf, ps, other]

CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering

Authors: Liqiang Wen, Guanming Xiong, Tong Mo, Bing Li, Weiping Li, Wen Zhao

Abstract: This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework t… ▽ More This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction. △ Less

Submitted 13 April, 2025; originally announced April 2025.

Comments: This work has been accepted by the IJCNN 2025 main track

arXiv:2504.07369 [pdf]

Ultrahigh room-temperature hole conductivity in a perovskite cuprate with vanishing electron-correlation

Authors: Meng Wang, Jianbing Zhang, Liang Si, Sijie Wu, Caiyong Li, Wenfeng Wu, Xiaodong Zhang, Cong Li, Lu Wang, Fachao Li, Lingzhi Wen, Yang Liu, Jinling Zhou, Masahiro Sawada, Nianpeng Lu, Qing He, Peng Gao, Tian Liang, Shuyun Zhou, Yeliang Wang, Fumitaka Kagawa, Pu Yu

Abstract: Electron-correlated two-dimensional (2D) cuprates have been extensively studied since the discovery of high-Tc superconductivity, in contrast, the three-dimensional (3D) counterpart perovskite cuprates remain largely unexplored due to their chemical instability and synthesis challenges. Herein, we develop an efficient two-step approach that combines symmetry-selective growth and topotactic oxidiza… ▽ More Electron-correlated two-dimensional (2D) cuprates have been extensively studied since the discovery of high-Tc superconductivity, in contrast, the three-dimensional (3D) counterpart perovskite cuprates remain largely unexplored due to their chemical instability and synthesis challenges. Herein, we develop an efficient two-step approach that combines symmetry-selective growth and topotactic oxidization to synthesize high-quality perovskite LaCuO3 films, and furthermore reveal its exotic electronic states. The compressively strained LaCuO3 films exhibit an unexpected ultrahigh p-type conductivity of ~1.5*10^5 S/cm with a hole mobility of ~30 cm2 V-1 s-1 at room-temperature. X-ray absorption spectra and first-principles calculations unveil a ligand-hole state of p-d hybridization with degenerate eg orbitals and light effective mass, indicating nearly-vanishing electron-correlation. These features contrast sharply with 2D cuprates and offer physical insights into the design of high-performance electronic devices. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: 5 figures

arXiv:2504.07089 [pdf, ps, other]

OmniCaptioner: One Captioner to Rule Them All

Authors: Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, Bo Zhang

Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g.… ▽ More We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities. △ Less

Submitted 2 June, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

Comments: More visualizations on Homepage: https://alpha-innovator.github.io/OmniCaptioner-project-page and Official code: https://github.com/Alpha-Innovator/OmniCaptioner

arXiv:2504.03294 [pdf, ps, other]

Relativistic dynamics of charmonia in strong magnetic fields

Authors: Liuyuan Wen, Meijian Li, Yiyu Zhou, Yang Li, James P. Vary

Abstract: We investigate the properties of charmonium systems in strong external magnetic fields using a relativistic light-front Hamiltonian approach within the Basis Light-Front Quantization (BLFQ) framework. By solving the eigenvalue problem for the invariant mass squared operator with confinement potentials and one-gluon-exchange interactions, we obtain the mass spectrum and wave functions under varying… ▽ More We investigate the properties of charmonium systems in strong external magnetic fields using a relativistic light-front Hamiltonian approach within the Basis Light-Front Quantization (BLFQ) framework. By solving the eigenvalue problem for the invariant mass squared operator with confinement potentials and one-gluon-exchange interactions, we obtain the mass spectrum and wave functions under varying magnetic fields. Our results reveal significant spectral modifications via the Zeeman effect, including $η_c$-$J/ψ$ mixing and magnetic sublevel splitting. Momentum density analysis demonstrates wave function deformation, with transverse momentum broadening and longitudinal narrowing under strong fields, alongside structural shifts in parton distributions such as double-hump profiles in excited states. Relativistic corrections and center-of-mass coupling critically drive these dynamics, highlighting the necessity of a relativistic framework for QCD bound states in extreme magnetic environments. △ Less

Submitted 18 June, 2025; v1 submitted 4 April, 2025; originally announced April 2025.

Comments: 24 pages, 9 figures. Added the derivation of the quantum many-body Hamiltonian from the minimally coupled Lagrangian in Appendix A. To appear in Phys. Rev. D

arXiv:2504.03151 [pdf, other]

Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Authors: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu

Abstract: Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a… ▽ More Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2504.00679 [pdf, other]

QUEST: A Quantized Energy-Aware SNN Training Framework for Multi-State Neuromorphic Devices

Authors: Sai Li, Linliang Chen, Yihao Zhang, Zhongkui Zhang, Ao Du, Biao Pan, Zhaohao Wang, Lianggong Wen, Weisheng Zhao

Abstract: Neuromorphic devices, leveraging novel physical phenomena, offer a promising path toward energy-efficient hardware beyond CMOS technology by emulating brain-inspired computation. However, their progress is often limited to proof-of-concept studies due to the lack of flexible spiking neural network (SNN) algorithm frameworks tailored to device-specific characteristics, posing a significant challeng… ▽ More Neuromorphic devices, leveraging novel physical phenomena, offer a promising path toward energy-efficient hardware beyond CMOS technology by emulating brain-inspired computation. However, their progress is often limited to proof-of-concept studies due to the lack of flexible spiking neural network (SNN) algorithm frameworks tailored to device-specific characteristics, posing a significant challenge to scalability and practical deployment. To address this, we propose QUEST, a unified co-design framework that directly trains SNN for emerging devices featuring multilevel resistances. With Skyrmionic Magnetic Tunnel Junction (Sk-MTJ) as a case study, experimental results on the CIFAR-10 dataset demonstrate the framework's ability to enable scalable on-device SNN training with minimal energy consumption during both feedforward and backpropagation. By introducing device mapping pattern and activation operation sparsity, QUEST achieves effective trade-offs among high accuracy (89.6%), low bit precision (2-bit), and energy efficiency (93 times improvement over the ANNs). QUEST offers practical design guidelines for both the device and algorithm communities, providing insights to build energy-efficient and large-scale neuromorphic systems. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.22587 [pdf, other]

LLM-enabled Instance Model Generation

Authors: Fengjunjie Pan, Nenad Petrovic, Vahid Zolfaghari, Long Wen, Alois Knoll

Abstract: In the domain of model-based engineering, models are essential components that enable system design and analysis. Traditionally, the creation of these models has been a manual process requiring not only deep modeling expertise but also substantial domain knowledge of target systems. With the rapid advancement of generative artificial intelligence, large language models (LLMs) show potential for au… ▽ More In the domain of model-based engineering, models are essential components that enable system design and analysis. Traditionally, the creation of these models has been a manual process requiring not only deep modeling expertise but also substantial domain knowledge of target systems. With the rapid advancement of generative artificial intelligence, large language models (LLMs) show potential for automating model generation. This work explores the generation of instance models using LLMs, focusing specifically on producing XMI-based instance models from Ecore metamodels and natural language specifications. We observe that current LLMs struggle to directly generate valid XMI models. To address this, we propose a two-step approach: first, using LLMs to produce a simplified structured output containing all necessary instance model information, namely a conceptual instance model, and then compiling this intermediate representation into a valid XMI file. The conceptual instance model is format-independent, allowing it to be transformed into various modeling formats via different compilers. The feasibility of the proposed method has been demonstrated using several LLMs, including GPT-4o, o1-preview, Llama 3.1 (8B and 70B). Results show that the proposed method significantly improves the usability of LLMs for instance model generation tasks. Notably, the smaller open-source model, Llama 3.1 70B, demonstrated performance comparable to proprietary GPT models within the proposed framework. △ Less

Submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.21699 [pdf, other]

MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX

Authors: Liuyue Xie, George Z. Wei, Avik Kuthiala, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, László A. Jeni

Abstract: Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Eva… ▽ More Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.21353 [pdf, ps, other]

Neutrino type identification for atmospheric neutrinos in a large homogeneous liquid scintillation detector

Authors: Jiaxi Liu, Fanrui Zeng, Hongyue Duyang, Wanlei Guo, Xinhai He, Teng Li, Zhen Liu, Wuming Luo, Wing Yan Ma, Xiaohan Tan, Liangjian Wen, Zekun Yang, Yongpeng Zhang

Abstract: Atmospheric neutrino oscillations are important to the study of neutrino properties, including the neutrino mass ordering problem. A good capability to identify neutrinos' flavor and neutrinos against antineutrinos is crucial in such measurements. In this paper, we present a machine-learning-based approach for identifying atmospheric neutrino events in a large homogeneous liquid scintillator detec… ▽ More Atmospheric neutrino oscillations are important to the study of neutrino properties, including the neutrino mass ordering problem. A good capability to identify neutrinos' flavor and neutrinos against antineutrinos is crucial in such measurements. In this paper, we present a machine-learning-based approach for identifying atmospheric neutrino events in a large homogeneous liquid scintillator detector. This method identifies features of PMT waveforms that reflect event topologies and uses them as input to machine learning models. In addition, neutron-capture information is utilized to achieve neutrino versus antineutrino discrimination. Preliminary performances based on Monte Carlo simulations are presented, which demonstrate such a detector's potential in future measurements of atmospheric neutrinos such as the one planned for the JUNO experiment. △ Less

Submitted 13 June, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.13891 [pdf, other]

Where do Large Vision-Language Models Look at when Answering Questions?

Authors: Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu

Abstract: Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to thei… ▽ More Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.11938 [pdf, ps, other]

doi 10.1103/mvqk-n377

The ${φNN,J/ψNN,η_c NN}$ systems based on HAL QCD interactions

Authors: Liang-Zhen Wen, Yao Ma, Lu Meng, Shi-Lin Zhu

Abstract: We investigate the existence of bound states and resonances in the ${φNN, J/ψNN, η_c NN}$ systems using HAL QCD interactions for ${φN, J/ψN}$, and ${η_c N}$. We employ the Gaussian expansion method to solve the complex-scaled Schrödinger equation and find no resonances or bound states in the ${J/ψNN}$ and ${η_c NN}$ systems. We estimate the interaction between charmonium and nuclei, concluding tha… ▽ More We investigate the existence of bound states and resonances in the ${φNN, J/ψNN, η_c NN}$ systems using HAL QCD interactions for ${φN, J/ψN}$, and ${η_c N}$. We employ the Gaussian expansion method to solve the complex-scaled Schrödinger equation and find no resonances or bound states in the ${J/ψNN}$ and ${η_c NN}$ systems. We estimate the interaction between charmonium and nuclei, concluding that the $J/ψ$ or $η_c$ is likely to bind with ${}^3\mathrm{H}$, ${}^3\mathrm{He}$, ${}^4\mathrm{He}$, and heavier nuclei. For the $φNN$ system, the lattice QCD $φN\left({ }^2 S_{1 / 2}\right)$ interaction is absent. We combine the $φp$ correlation function analysis and HAL QCD results in Model A. We assume the spin-spin interactions for $J/ψN$ and $φN$ systems are inversely proportional to their masses in Model B. Model A predicts a stronger $φN({}^2 S_{1/2})$ interaction and permits a two-body bound state, whereas Model B suggests the interaction is attractive but too weak to form a bound state. Both models predict bound states for the $I(J^P) = 0(0^-)$ and $0(1^-)$ $φNN$ systems. In Model A, these states are deeply bound with binding energies exceeding 15 MeV and remain existent when considering parameter uncertainties. In contrast, these states are very loosely bound in Model B, with binding energies below 1 MeV and an existent probability of about 60\% when parameter uncertainties are considered. In both models, there exist very loosely bound $I(J^P) = 0(2^-)$ three-body states which resemble a $φ$-d atom with the $φ$ meson surrounding the deuteron, but their existences are sensitive to parameter uncertainties. No bound states or resonances are found in the isovector $I(J^P) = 1(1^-)$ $φNN$ system. △ Less

Submitted 2 July, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: 16 pages, 9 figures. Comments are welcome

Journal ref: Phys. Rev. D 111, 114004 (2025)

arXiv:2503.10460 [pdf, other]

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Authors: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

Abstract: This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our… ▽ More This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1. △ Less

Submitted 28 May, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

Comments: v4: ACL'25 industry track camera ready; v3: minor modifications; v2: better writing & format for later submission; all release at https://github.com/Qihoo360/Light-R1

arXiv:2503.05180 [pdf, other]

Safety-Critical Traffic Simulation with Adversarial Transfer of Driving Intentions

Authors: Zherui Huang, Xing Gao, Guanjie Zheng, Licheng Wen, Xuemeng Yang, Xiao Sun

Abstract: Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the fut… ▽ More Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios. △ Less

Submitted 7 March, 2025; originally announced March 2025.

Comments: Accepted by ICRA 2025

arXiv:2503.04636 [pdf, other]

Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking

Authors: Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong

Abstract: As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs fo… ▽ More As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction. △ Less

Submitted 15 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: Accepted by the ICLR 2025 Workshop on GenAI Watermarking

arXiv:2503.00968 [pdf, other]

Simulation of the Background from $^{13}$C$(α, n)^{16}$O Reaction in the JUNO Scintillator

Authors: JUNO Collaboration, Thomas Adam, Kai Adamowicz, Shakeel Ahmad, Rizwan Ahmed, Sebastiano Aiello, Fengpeng An, Costas Andreopoulos, Giuseppe Andronico, Nikolay Anfimov, Vito Antonelli, Tatiana Antoshkina, João Pedro Athayde Marcondes de André, Didier Auguste, Weidong Bai, Nikita Balashov, Andrea Barresi, Davide Basilico, Eric Baussan, Marco Beretta, Antonio Bergnoli, Nikita Bessonov, Daniel Bick, Lukas Bieger, Svetlana Biktemerova , et al. (608 additional authors not shown)

Abstract: Large-scale organic liquid scintillator detectors are highly efficient in the detection of MeV-scale electron antineutrinos. These signal events can be detected through inverse beta decay on protons, which produce a positron accompanied by a neutron. A noteworthy background for antineutrinos coming from nuclear power reactors and from the depths of the Earth (geoneutrinos) is generated by ($α, n$)… ▽ More Large-scale organic liquid scintillator detectors are highly efficient in the detection of MeV-scale electron antineutrinos. These signal events can be detected through inverse beta decay on protons, which produce a positron accompanied by a neutron. A noteworthy background for antineutrinos coming from nuclear power reactors and from the depths of the Earth (geoneutrinos) is generated by ($α, n$) reactions. In organic liquid scintillator detectors, $α$ particles emitted from intrinsic contaminants such as $^{238}$U, $^{232}$Th, and $^{210}$Pb/$^{210}$Po, can be captured on $^{13}$C nuclei, followed by the emission of a MeV-scale neutron. Three distinct interaction mechanisms can produce prompt energy depositions preceding the delayed neutron capture, leading to a pair of events correlated in space and time within the detector. Thus, ($α, n$) reactions represent an indistinguishable background in liquid scintillator-based antineutrino detectors, where their expected rate and energy spectrum are typically evaluated via Monte Carlo simulations. This work presents results from the open-source SaG4n software, used to calculate the expected energy depositions from the neutron and any associated de-excitation products. Also simulated is a detailed detector response to these interactions, using a dedicated Geant4-based simulation software from the JUNO experiment. An expected measurable $^{13}$C$(α, n)^{16}$O event rate and reconstructed prompt energy spectrum with associated uncertainties, are presented in the context of JUNO, however, the methods and results are applicable and relevant to other organic liquid scintillator neutrino detectors. △ Less

Submitted 2 May, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: 25 pages, 14 figures, 4 tables

arXiv:2502.17852 [pdf, other]

Sketch-1-to-3: One Single Sketch to 3D Detailed Face Reconstruction

Authors: Liting Wen, Zimo Yang, Xianlin Zhang, Chi Ding, Yue Zhang, Mingdao Wang, Xueming Li

Abstract: 3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a hig… ▽ More 3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a high-performing model with limited data. In this paper, we propose Sketch-1-to-3, a novel framework for realistic 3D face reconstruction from a single sketch, to address these challenges. Specifically, we first introduce the Geometric Contour and Texture Detail (GCTD) module, which enhances the extraction of geometric contours and texture details from facial sketches. Additionally, we design a deep learning architecture with a domain adaptation module and a tailored loss function to align sketches with the 3D facial space, enabling high-fidelity expression and texture reconstruction. To facilitate evaluation and further research, we construct SketchFaces, a real hand-drawn facial sketch dataset, and Syn-SketchFaces, a synthetic facial sketch dataset. Extensive experiments demonstrate that Sketch-1-to-3 achieves state-of-the-art performance in sketch-based 3D face reconstruction. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.13367 [pdf, ps, other]

Asymptotic Freedom of Two Heavy Impurities in a Bose-Einstein Condensate

Authors: Dong-Chen Zheng, Lin Wen, Renyuan Liao

Abstract: We consider two heavy impurities immersed in a Bose-Einstein condensate, and calculate the self-energy using the Wilsonian renormalization. The polaron energy, quasiparticle residue and damping rate are extracted from the self-energy. We demonstrate that various effective potentials emerge from the polaron energy under the specific conditions. In the limit of large separation between the impuritie… ▽ More We consider two heavy impurities immersed in a Bose-Einstein condensate, and calculate the self-energy using the Wilsonian renormalization. The polaron energy, quasiparticle residue and damping rate are extracted from the self-energy. We demonstrate that various effective potentials emerge from the polaron energy under the specific conditions. In the limit of large separation between the impurities, the polaron spectrum converges to the results for a single impurity, exhibiting an attractive-repulsive crossover across the Feshbach resonance. The boundary of this crossover is identified through the analysis of the damping rate. We highlight that repulsive-dominant polarons can exist as long as the impurities are sufficiently close, even when the impurity-boson interactions are attractive. Additionally, we observe that the two impurities become asymptotically free in the repulsive polaron regime. These results are verifiable and offer a fresh perspective on the interaction dynamics between two polarons. △ Less

Submitted 3 March, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

Comments: 7 pages, 5 figures

arXiv:2502.11598 [pdf, other]

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Authors: Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu

Abstract: The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investig… ▽ More The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack. △ Less

Submitted 24 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: Accepted by ACL 2025 (Main)

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2502.09269 [pdf, other]

Memory-based Ensemble Learning in CMR Semantic Segmentation

Authors: Yiwei Liu, Ziyi Wu, Liang Zhong, Lingyi Wen, Yuankai Wu

Abstract: Existing models typically segment either the entire 3D frame or 2D slices independently to derive clinical functional metrics from ventricular segmentation in cardiac cine sequences. While performing well overall, they struggle at the end slices. To address this, we leverage spatial continuity to extract global uncertainty from segmentation variance and use it as memory in our ensemble learning me… ▽ More Existing models typically segment either the entire 3D frame or 2D slices independently to derive clinical functional metrics from ventricular segmentation in cardiac cine sequences. While performing well overall, they struggle at the end slices. To address this, we leverage spatial continuity to extract global uncertainty from segmentation variance and use it as memory in our ensemble learning method, Streaming, for classifier weighting, balancing overall and end-slice performance. Additionally, we introduce the End Coefficient (EC) to quantify end-slice accuracy. Experiments on ACDC and M&Ms datasets show that our framework achieves near-state-of-the-art Dice Similarity Coefficient (DSC) and outperforms all models on end-slice performance, improving patient-specific segmentation accuracy. △ Less

Submitted 17 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

arXiv:2502.09170 [pdf, other]

LimSim Series: An Autonomous Driving Simulation Platform for Validation and Enhancement

Authors: Daocheng Fu, Naiting Zhong, Xu Han, Pinlong Cai, Licheng Wen, Song Mao, Botian Shi, Yu Qiao

Abstract: Closed-loop simulation environments play a crucial role in the validation and enhancement of autonomous driving systems (ADS). However, certain challenges warrant significant attention, including balancing simulation accuracy with duration, reconciling functionality with practicality, and establishing comprehensive evaluation mechanisms. This paper addresses these challenges by introducing the Lim… ▽ More Closed-loop simulation environments play a crucial role in the validation and enhancement of autonomous driving systems (ADS). However, certain challenges warrant significant attention, including balancing simulation accuracy with duration, reconciling functionality with practicality, and establishing comprehensive evaluation mechanisms. This paper addresses these challenges by introducing the LimSim Series, a comprehensive simulation platform designed to support the rapid deployment and efficient iteration of ADS. The LimSim Series integrates multi-type information from road networks, employs human-like decision-making and planning algorithms for background vehicles, and introduces the concept of the Area of Interest (AoI) to optimize computational resources. The platform offers a variety of baseline algorithms and user-friendly interfaces, facilitating flexible validation of multiple technical pipelines. Additionally, the LimSim Series incorporates multi-dimensional evaluation metrics, delivering thorough insights into system performance, thus enabling researchers to promptly identify issues for further improvements. Experiments demonstrate that the LimSim Series is compatible with modular, end-to-end, and VLM-based knowledge-driven systems. It can assist in the iteration and updating of ADS by evaluating performance across various scenarios. The code of the LimSim Series is released at: https://github.com/PJLab-ADG/LimSim. △ Less

Submitted 13 February, 2025; originally announced February 2025.

arXiv:2502.06950 [pdf, other]

Cryoscope: A Cryogenic Infrared Survey Telescope in Antarctica

Authors: Mansi M. Kasliwal, Nicholas Earley, Roger Smith, Tristan Guillot, Tony Travouillon, Jason Fucik, Lyu Abe, Timothee Greffe, Abdelkrim Agabi, Michael C. B. Ashley, Amaury H. M. J. Triaud, Samaporn Tinyanont, Sarah Antier, Philippe Bendjoya, Rohan Bhattarai, Rob Bertz, James Brugger, Artem Burdanov, Ilaria Caiazzo, Benoit Carry, Luca Casagrande, Brad Cenko, Jeff Cooke, Kishalay De, Richard Dekany , et al. (36 additional authors not shown)

Abstract: We present Cryoscope--a new 50 deg$^2$ field-of-view, 1.2 m aperture, $K_{dark}$ survey telescope to be located at Dome C, Antarctica. Cryoscope has an innovative optical-thermal design wherein the entire telescope is cryogenically cooled. Cryoscope also explores new detector technology to cost-effectively tile the full focal plane. Leveraging the dark Antarctic sky and minimizing telescope therma… ▽ More We present Cryoscope--a new 50 deg$^2$ field-of-view, 1.2 m aperture, $K_{dark}$ survey telescope to be located at Dome C, Antarctica. Cryoscope has an innovative optical-thermal design wherein the entire telescope is cryogenically cooled. Cryoscope also explores new detector technology to cost-effectively tile the full focal plane. Leveraging the dark Antarctic sky and minimizing telescope thermal emission, Cryoscope achieves unprecedented deep, wide, fast and red observations, matching and exceeding volumetric survey speeds from the Ultraviolet Explorer, Vera Rubin Observatory, Nancy Grace Roman Space Telescope, SPHEREx, and NEO Surveyor. By providing coverage beyond wavelengths of 2 $μ$m, we aim to create the most comprehensive dynamic movie of the most obscured reaches of the Universe. Cryoscope will be a dedicated discovery engine for electromagnetic emission from coalescing compact binaries, Earth-like exoplanets orbiting cold stars, and multiple facets of time-domain, stellar and solar system science. In this paper, we describe the scientific drivers and technical innovations for this new discovery engine operating in the $K_{dark}$ passband, why we choose to deploy it in Antarctica, and the status of a fifth-scale prototype designed as a Pathfinder to retire technological risks prior to full-scale implementation. We plan to deploy the Cryoscope Pathfinder to Dome C in December 2026 and the full-scale telescope by 2030. △ Less

Submitted 21 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

Comments: 40 pages, 19 figures, 4 tables; accepted for publication in PASP on 2025-03-21

arXiv:2502.01906 [pdf, other]

Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

Authors: Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

Abstract: Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel metho… ▽ More Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $\mathcal{O}(|V|^2)$ to $\mathcal{O}(|V|)$ for $|V|$ visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an $α$-weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available. △ Less

Submitted 3 February, 2025; originally announced February 2025.

arXiv:2502.01141 [pdf, other]

Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations

Authors: Qian Chen, Stefanie Rinderle-Ma, Lijie Wen

Abstract: Most existing process compliance monitoring approaches detect compliance violations in an ex post manner. Only predicate prediction focuses on predicting them. However, predicate prediction provides a binary yes/no notion of compliance, lacking the ability to measure to which extent an ongoing process instance deviates from the desired state as specified in constraints. Here, being able to quantif… ▽ More Most existing process compliance monitoring approaches detect compliance violations in an ex post manner. Only predicate prediction focuses on predicting them. However, predicate prediction provides a binary yes/no notion of compliance, lacking the ability to measure to which extent an ongoing process instance deviates from the desired state as specified in constraints. Here, being able to quantify the magnitude of violation would provide organizations with deeper insights into their operational performance, enabling informed decision making to reduce or mitigate the risk of non-compliance. Thus, we propose two predictive compliance monitoring approaches to close this research gap. The first approach reformulates the binary classification problem as a hybrid task that considers both classification and regression, while the second employs a multi-task learning method to explicitly predict the compliance status and the magnitude of violation for deviant cases simultaneously. In this work, we focus on temporal constraints as they are significant in almost any application domain, e.g., health care. The evaluation on synthetic and real-world event logs demonstrates that our approaches are capable of quantifying the magnitude of violations while maintaining comparable performance for compliance predictions achieved by state-of-the-art approaches. △ Less

Submitted 3 February, 2025; originally announced February 2025.

arXiv:2501.12583 [pdf, other]

Chasing price drains liquidity

Authors: Yizhou Cao, Yepeng Ding, Ruichao Jiang, Long Wen

Abstract: Assuming that the price in a Uniswap v3 style Automated Market Maker (AMM) follows a Geometric Brownian Motion (GBM), we prove that the strategy that adjusts the position of liquidity to track the current price leads to a deterministic and exponentially fast decay of liquidity. Next, assuming that there is a Centralized Exchange (CEX), in which the price follows a GBM and the AMM price mean revert… ▽ More Assuming that the price in a Uniswap v3 style Automated Market Maker (AMM) follows a Geometric Brownian Motion (GBM), we prove that the strategy that adjusts the position of liquidity to track the current price leads to a deterministic and exponentially fast decay of liquidity. Next, assuming that there is a Centralized Exchange (CEX), in which the price follows a GBM and the AMM price mean reverts to the CEX price, we show numerically that the same strategy still leads to decay. Last, we propose a strategy that increases the liquidity even without compounding fees earned through liquidity provision. △ Less

Submitted 21 January, 2025; originally announced January 2025.

arXiv:2501.08168 [pdf, other]

LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking

Authors: Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, Yong Liu

Abstract: While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process… ▽ More While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/. △ Less

Submitted 14 January, 2025; originally announced January 2025.

arXiv:2501.06555 [pdf, ps, other]

Chiral supersolid and dissipative time crystal in Rydberg-dressed Bose-Einstein condensates with Raman-induced spin-orbit coupling

Authors: Xianghua Su, Xiping Fu, Yang He, Ying Shang, Kaiyuan Ji, Linghua Wen

Abstract: Spin-orbit coupling (SOC) is one of the key factors that affect the chiral symmetry of matter by causing the spatial symmetry breaking of the system. We find that Raman-induced SOC can induce a chiral supersolid phase with a helical antiskyrmion lattice in balanced Rydberg-dressed two-component Bose-Einstein condensates (BECs) in a harmonic trap by modulating the Raman coupling strength, strong co… ▽ More Spin-orbit coupling (SOC) is one of the key factors that affect the chiral symmetry of matter by causing the spatial symmetry breaking of the system. We find that Raman-induced SOC can induce a chiral supersolid phase with a helical antiskyrmion lattice in balanced Rydberg-dressed two-component Bose-Einstein condensates (BECs) in a harmonic trap by modulating the Raman coupling strength, strong contrast with the mirror symmetric supersolid phase containing skyrmion-antiskyrmion lattice pair for the case of Rashba SOC. Two ground-state phase diagrams are presented as a function of the Rydberg interaction strength and the SOC strength, as well as that of the Rydberg interaction strength and the Raman coupling strength, respectively. It is shown that the interplay among Raman-induced SOC, soft-core long-range Rydberg interactions, and contact interactions favors rich ground-state structures including half-quantum vortex phase, stripe supersolid phase, toroidal stripe phase with a central Anderson-Toulouse coreless vortex, checkerboard supersolid phase, mirror symmetric supersolid phase, chiral supersolid phase and standing-wave supersolid phase. In addition, the effects of rotation and in-plane quadrupole magnetic field on the ground state of the system are analyzed. In these two cases, the chiral supersolid phase is broken and the ground state tends to form a miscible phase. Furthermore, the stability and superfluid properties of the two-component BECs with Raman-induced SOC and Rydberg interactions in free space are revealed by solving the Bogoliubov-de Gennes equation. Finally, we demonstrate that when the initial state is a chiral supersolid phase the rotating harmonic trapped system sustains dissipative continuous time crystal by studying the rotational dynamic behaviors of the system. △ Less

Submitted 11 January, 2025; originally announced January 2025.

Comments: 13 pages,5 figures

arXiv:2501.03580 [pdf]

BASIC: Semi-supervised Multi-organ Segmentation with Balanced Subclass Regularization and Semantic-conflict Penalty

Authors: Zhenghao Feng, Lu Wen, Yuanyuan Xu, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang

Abstract: Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS caused by the substantial variations in organ size exacerbates the learning difficulty of the SSL network. To address this issu… ▽ More Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS caused by the substantial variations in organ size exacerbates the learning difficulty of the SSL network. To address this issue, in this paper, we propose an innovative semi-supervised network with BAlanced Subclass regularIzation and semantic-Conflict penalty mechanism (BASIC) to effectively learn the unbiased knowledge for semi-supervised MoS. Concretely, we construct a novel auxiliary subclass segmentation (SCS) task based on priorly generated balanced subclasses, thus deeply excavating the unbiased information for the main MoS task with the fashion of multi-task learning. Additionally, based on a mean teacher framework, we elaborately design a balanced subclass regularization to utilize the teacher predictions of SCS task to supervise the student predictions of MoS task, thus effectively transferring unbiased knowledge to the MoS subnetwork and alleviating the influence of the class-imbalance problem. Considering the similar semantic information inside the subclasses and their corresponding original classes (i.e., parent classes), we devise a semantic-conflict penalty mechanism to give heavier punishments to the conflicting SCS predictions with wrong parent classes and provide a more accurate constraint to the MoS predictions. Extensive experiments conducted on two publicly available datasets, i.e., the WORD dataset and the MICCAI FLARE 2022 dataset, have verified the superior performance of our proposed BASIC compared to other state-of-the-art methods. △ Less

Submitted 7 January, 2025; originally announced January 2025.

arXiv:2501.01495 [pdf, other]

doi 10.3847/1538-4357/adb3a0

Search for continuous gravitational waves from known pulsars in the first part of the fourth LIGO-Virgo-KAGRA observing run

Authors: The LIGO Scientific Collaboration, the Virgo Collaboration, the KAGRA Collaboration, A. G. Abac, R. Abbott, I. Abouelfettouh, F. Acernese, K. Ackley, S. Adhicary, N. Adhikari, R. X. Adhikari, V. K. Adkins, D. Agarwal, M. Agathos, M. Aghaei Abchouyeh, O. D. Aguiar, I. Aguilar, L. Aiello, A. Ain, P. Ajith, T. Akutsu, S. Albanesi, R. A. Alfaidi, A. Al-Jodah, C. Alléné , et al. (1794 additional authors not shown)

Abstract: Continuous gravitational waves (CWs) emission from neutron stars carries information about their internal structure and equation of state, and it can provide tests of General Relativity. We present a search for CWs from a set of 45 known pulsars in the first part of the fourth LIGO--Virgo--KAGRA observing run, known as O4a. We conducted a targeted search for each pulsar using three independent ana… ▽ More Continuous gravitational waves (CWs) emission from neutron stars carries information about their internal structure and equation of state, and it can provide tests of General Relativity. We present a search for CWs from a set of 45 known pulsars in the first part of the fourth LIGO--Virgo--KAGRA observing run, known as O4a. We conducted a targeted search for each pulsar using three independent analysis methods considering the single-harmonic and the dual-harmonic emission models. We find no evidence of a CW signal in O4a data for both models and set upper limits on the signal amplitude and on the ellipticity, which quantifies the asymmetry in the neutron star mass distribution. For the single-harmonic emission model, 29 targets have the upper limit on the amplitude below the theoretical spin-down limit. The lowest upper limit on the amplitude is $6.4\!\times\!10^{-27}$ for the young energetic pulsar J0537-6910, while the lowest constraint on the ellipticity is $8.8\!\times\!10^{-9}$ for the bright nearby millisecond pulsar J0437-4715. Additionally, for a subset of 16 targets we performed a narrowband search that is more robust regarding the emission model, with no evidence of a signal. We also found no evidence of non-standard polarizations as predicted by the Brans-Dicke theory. △ Less

Submitted 2 January, 2025; originally announced January 2025.

Comments: main paper: 12 pages, 6 figures, 4 tables

Report number: LIGO-P2400315

Journal ref: Astrophys.J. 983 (2025) 2, 99

arXiv:2501.00929 [pdf]

doi 10.1126/sciadv.adq7445

Gradient polaritonic surface with space-variant switchable light-matter interactions in 2D moire superlattices

Authors: Zhen-Bing Dai, Hua Fan, Vyacheslav Semenenko, Xinyu Lv, Lu Wen, Zhen Zhang, Shijie Fang, Vasili Perebeinos, Yue Zhao, Zhiqiang Li

Abstract: Polaritons in two-dimensional (2D) materials provide unique opportunities for controlling light at nanoscales. Tailoring these polaritons via gradient polaritonic surfaces with space-variant response can enable versatile light-matter interaction platforms with advanced functionalities. However, experimental progress has been hampered by the optical losses and poor light confinement of conventional… ▽ More Polaritons in two-dimensional (2D) materials provide unique opportunities for controlling light at nanoscales. Tailoring these polaritons via gradient polaritonic surfaces with space-variant response can enable versatile light-matter interaction platforms with advanced functionalities. However, experimental progress has been hampered by the optical losses and poor light confinement of conventionally used artificial nanostructures. Here, we demonstrate natural gradient polaritonic surfaces based on superlattices of solitons-localized structural deformations-in a prototypical moire system, twisted bilayer graphene on boron nitride. We demonstrate on-off switching and continuous modulation of local polariton-soliton interactions, which results from marked modifications of topological and conventional soliton states through variation of local strain direction. Furthermore, we reveal the capability of these structures to spatially modify the near-field profile, phase, and propagation direction of polaritons in record-small footprints, enabling generation and electrical switching of directional polaritons. Our findings open up new avenues toward nanoscale manipulation of light-matter interactions and spatial polariton engineering through gradient moire superlattices. △ Less

Submitted 1 January, 2025; originally announced January 2025.

Comments: 18 pages, 4 figures

Journal ref: Science Advance, 10,eadq7445(2024)

arXiv:2501.00871 [pdf, other]

doi 10.1103/PhysRevD.111.073001

Trilepton and tetralepton bound and resonant states: the QED counterpart of multiquark states

Authors: Yao Ma, Lu Meng, Liang-Zhen Wen, Shi-Lin Zhu

Abstract: This work presents the first prediction of tetralepton resonant states containing muons, extending beyond the simplest tetralepton system, dipositronium ($\mathrm{Ps}_2$). With the rapid advancements in experimental facilities, the production and study of these intriguing states may be within reach. We perform a comprehensive analysis of S-wave trilepton and tetralepton systems within the framewor… ▽ More This work presents the first prediction of tetralepton resonant states containing muons, extending beyond the simplest tetralepton system, dipositronium ($\mathrm{Ps}_2$). With the rapid advancements in experimental facilities, the production and study of these intriguing states may be within reach. We perform a comprehensive analysis of S-wave trilepton and tetralepton systems within the framework of a QED Coulomb potential. We employ the Gaussian expansion method to solve the three- or four-body Schrödinger equation and utilize the complex scaling method to identify resonant states. We uncover a series of bound and resonant states in the trilepton systems $e^+e^+e^-$, $μ^+μ^+μ^-$, $e^+e^+μ^-$, and $μ^+μ^+e^-$, as well as the tetralepton systems $e^+e^+e^-e^-$, $μ^+μ^+μ^-μ^-$, and $μ^+μ^+e^-e^-$. The energies of these states range from $-30$ eV to $-1$ eV below the total mass of three or four leptons, with their widths varying from less than $0.01$ eV to approximately $0.07$ eV. Additionally, we calculate the spin configurations and root mean square radii of these states, providing insight into their spatial structures. No bound or resonant states are found in the trilepton $e^+μ^+e^-$, $μ^+e^+μ^-$ systems, nor in the tetralepton $μ^+e^+μ^-e^-$ system. A comparison with fully heavy tetraquark systems reveals that the additional color degree of freedom in QCD results in the absence of low-energy bound and resonant states. However, this extra degree of freedom allows for a broader range of $J^{PC}$ quantum numbers to produce resonant states, highlighting the rich complexity of QCD systems. △ Less

Submitted 4 April, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

Comments: 14 pages, 11 figures. Comments are welcome

Journal ref: Phys. Rev. D 111, 073001 (2025)

arXiv:2501.00746 [pdf, other]

doi 10.1103/PhysRevLett.134.201802

Comprehensive Measurement of the Reactor Antineutrino Spectrum and Flux at Daya Bay

Authors: F. P. An, W. D. Bai, A. B. Balantekin, M. Bishai, S. Blyth, G. F. Cao, J. Cao, J. F. Chang, Y. Chang, H. S. Chen, H. Y. Chen, S. M. Chen, Y. Chen, Y. X. Chen, Z. Y. Chen, J. Cheng, J. Cheng, Y. -C. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, J. P. Cummings, O. Dalager, F. S. Deng, X. Y. Ding , et al. (177 additional authors not shown)

Abstract: This Letter reports the precise measurement of reactor antineutrino spectrum and flux based on the full data set of 4.7 million inverse-beta-decay (IBD) candidates collected at Daya Bay near detectors. Expressed in terms of the IBD yield per fission, the antineutrino spectra from all reactor fissile isotopes and the specific $\mathrm{^{235}U}$ and $\mathrm{^{239}Pu}$ isotopes are measured with 1.3… ▽ More This Letter reports the precise measurement of reactor antineutrino spectrum and flux based on the full data set of 4.7 million inverse-beta-decay (IBD) candidates collected at Daya Bay near detectors. Expressed in terms of the IBD yield per fission, the antineutrino spectra from all reactor fissile isotopes and the specific $\mathrm{^{235}U}$ and $\mathrm{^{239}Pu}$ isotopes are measured with 1.3$\%$, 3$\%$ and 8$\%$ uncertainties respectively near the 3 MeV spectrum peak in reconstructed energy, reaching the best precision in the world. The total antineutrino flux and isotopic $\mathrm{^{235}U}$ and $\mathrm{^{239}Pu}$ fluxes are precisely measured to be $5.84\pm0.07$, $6.16\pm0.12$ and $4.16\pm0.21$ in units of $10^{-43} \mathrm{cm^2/fission}$. These measurements are compared with the Huber-Mueller (HM) model, the reevaluated conversion model based on the Kurchatov Institute (KI) measurement and the latest Summation Model (SM2023). The Daya Bay flux shows good consistency with KI and SM2023 models, but disagrees with HM model. The Daya Bay spectrum, however, disagrees with all model predictions. △ Less

Submitted 22 May, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

arXiv:2412.18108 [pdf, other]

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Authors: Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 m… ▽ More Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities. △ Less

Submitted 23 December, 2024; originally announced December 2024.

Journal ref: CVPR 2025 (IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025)

Showing 1–50 of 898 results for author: Wen, L