-
Real-time Terrain Analysis for Off-road Autonomous Vehicles
Authors:
Edwina Lewis,
Aditya Parameshwaran,
Laura Redmond,
Yue Wang
Abstract:
This research addresses critical autonomous vehicle control challenges arising from road roughness variation, which induces course deviations and potential loss of road contact during steering operations. We present a novel real-time road roughness estimation system employing Bayesian calibration methodology that processes axle accelerations to predict terrain roughness with quantifiable confidenc…
▽ More
This research addresses critical autonomous vehicle control challenges arising from road roughness variation, which induces course deviations and potential loss of road contact during steering operations. We present a novel real-time road roughness estimation system employing Bayesian calibration methodology that processes axle accelerations to predict terrain roughness with quantifiable confidence measures. The technical framework integrates a Gaussian process surrogate model with a simulated half-vehicle model, systematically processing vehicle velocity and road surface roughness parameters to generate corresponding axle acceleration responses. The Bayesian calibration routine performs inverse estimation of road roughness from observed accelerations and velocities, yielding posterior distributions that quantify prediction uncertainty for adaptive risk management. Training data generation utilizes Latin Hypercube sampling across comprehensive velocity and roughness parameter spaces, while the calibrated model integrates seamlessly with a Simplex controller architecture to dynamically adjust velocity limits based on real-time roughness predictions. Experimental validation on stochastically generated surfaces featuring varying roughness regions demonstrates robust real-time characterization capabilities, with the integrated Simplex control strategy effectively enhancing autonomous vehicle operational safety through proactive surface condition response. This innovative Bayesian framework establishes a comprehensive foundation for mitigating roughness-related operational risks while simultaneously improving efficiency and safety margins in autonomous vehicle systems.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Dynamic Focusing to Suppress Emittance Transfer in Crab-Crossing Flat Beam Collisions
Authors:
Derong Xu,
J Scott Berg,
Michael M Blaskiewicz,
Yue Hao,
Yun Luo,
Christoph Montag,
Sergei Nagaitsev,
Boris Podobedov,
Vadim Ptitsyn,
Ferdinand Willeke,
Binping Xiao
Abstract:
Flat hadron beam collisions, though expected to enhance peak luminosity by about an order of magnitude, have not yet been demonstrated. Our study reveals a critical limitation: realistic fluctuations, when amplified by synchro-betatron resonance, lead to transverse emittance transfer in flat-beam collisions. Using beam-beam simulations based on Electron-Ion Collider design parameters, we show that…
▽ More
Flat hadron beam collisions, though expected to enhance peak luminosity by about an order of magnitude, have not yet been demonstrated. Our study reveals a critical limitation: realistic fluctuations, when amplified by synchro-betatron resonance, lead to transverse emittance transfer in flat-beam collisions. Using beam-beam simulations based on Electron-Ion Collider design parameters, we show that this effect leads to vertical emittance growth, which can distort the flat-beam profile and degrade luminosity. We propose a dynamic focusing scheme that combines sextupoles with crab cavities to suppress the hourglass-induced resonance. This approach increases tolerance to fluctuations and improves the robustness of flat-beam collisions. This practical mitigation facilitates the adoption of flat-beam collisions in next-generation lepton-hadron colliders.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
STEP Planner: Constructing cross-hierarchical subgoal tree as an embodied long-horizon task planner
Authors:
Zhou Tianxing,
Wang Zhirui,
Ao Haojia,
Chen Guangyan,
Xing Boyang,
Cheng Jingwen,
Yang Yi,
Yue Yufeng
Abstract:
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models…
▽ More
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Statistical Strong Lensing as a Test of Conformal Gravity
Authors:
Li-Xue Yue,
Da-Ming Chen
Abstract:
As an alternative gravitational theory to General Relativity (GR), Conformal Gravity (CG) can be verified through astronomical observations. Currently, Mannheim and Kazanas have provided vacuum solutions for cosmological and local gravitational systems, and these solutions may resolve the dark matter and dark energy issues encountered in GR, making them particularly valuable. For static, spherical…
▽ More
As an alternative gravitational theory to General Relativity (GR), Conformal Gravity (CG) can be verified through astronomical observations. Currently, Mannheim and Kazanas have provided vacuum solutions for cosmological and local gravitational systems, and these solutions may resolve the dark matter and dark energy issues encountered in GR, making them particularly valuable. For static, spherically symmetric systems, CG predicts an additional linear potential generated by luminous matter in addition to the conventional Newtonian potential. This extra potential is expected to account for the observations of galaxies and galaxy clusters without the need of dark matter. It is characterized by the parameter $γ^*$, which corresponds to the linear potential generated by the unit of the solar mass, and it is thus a universal constant. The value of $γ^\ast$ was determined by fitting the rotation curve data of spiral galaxies. These predictions of CG should also be verified by the observations of strong gravitational lensing. In this study, building upon the previous research, we tested CG via strong lensing statistics. We used a well-defined sample that consisted of both galaxies and galaxy clusters. This allowed us to test CG through statistical strong lensing in a way similar to the conventional approach in GR. As anticipated, our results were consistent with previous studies, namely that the fitted $γ^*$ is much larger than that from rotation curves. Intriguingly, we further discovered that, in order to fit the strong lensing data of another sample, the value of $γ^*$ cannot be a constant, as is required in CG. Instead, we derived a formula for $γ^*$ as a function of the stellar mass $M_*$ of the galaxies or galaxy clusters. It was found that $γ^*$ decreases as $M_*$ increases.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes
Authors:
Ruijia Zhang,
Zhengling Qi,
Yue Wu,
Xiangyu Zhang,
Yanxun Xu
Abstract:
Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline rei…
▽ More
Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search
Authors:
Zhigong Zhou,
Ning Ding,
Xiaochuan Fan,
Yue Shang,
Yiming Qiu,
Jingwei Zhuo,
Zhiwei Ge,
Songlin Wang,
Lin Liu,
Sulong Xu,
Han Zhang
Abstract:
Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval perform…
▽ More
Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval performance. Though learning from cross-modality data has been studied extensively in tasks such as visual question answering or media summarization, multimodal retrieval remains a non-trivial and unsolved problem especially in the asymmetric scenario where the query is unimodal while the item is multimodal. In this paper, we propose a novel model named SMAR, which stands for Semantic-enhanced Modality-Asymmetric Retrieval, to tackle the problem of modality fusion and alignment in this kind of asymmetric scenario. Extensive experimental results on an industrial dataset show that the proposed model outperforms baseline models significantly in retrieval accuracy. We have open sourced our industrial dataset for the sake of reproducibility and future research works.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
The role of preprints in open science: Accelerating knowledge transfer from science to technology
Authors:
Zhiqi Wang,
Yue Chen,
Chun Yang
Abstract:
Preprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric ana…
▽ More
Preprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric analysis of patent citations to bioRxiv preprints submitted between 2013 and 2021, measuring and accessing the contribution of preprints in accelerating knowledge transfer from science to technology. Our findings reveal a growing trend of patent citations to bioRxiv preprints, with a notable surge in 2020, primarily driven by the COVID-19 pandemic. Preprints play a critical role in accelerating innovation, not only expedite the dissemination of scientific knowledge into technological innovation but also enhance the visibility of early research results in the patenting process, while journals remain essential for academic rigor and reliability. The substantial number of post-online-publication patent citations highlights the critical role of the open science model-particularly the "open access" effect of preprints-in amplifying the impact of science on technological innovation. This study provides empirical evidence that open science policies encouraging the early sharing of research outputs, such as preprints, contribute to more efficient linkage between science and technology, suggesting an acceleration in the pace of innovation, higher innovation quality, and economic benefits.
△ Less
Submitted 26 June, 2025; v1 submitted 25 June, 2025;
originally announced June 2025.
-
SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs
Authors:
Fengze Li,
Yue Wang,
Yangle Liu,
Ming Huang,
Dou Hong,
Jieming Ma
Abstract:
Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remai…
▽ More
Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED's role in addressing the structural-semantic modeling gap.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
QCD Axion Domain Walls from Super-Cooling First Order Phase Transition
Authors:
Kun-Feng Lyu,
Yue Zhao
Abstract:
The QCD axion is a well-motivated hypothetical particle beyond the Standard Model (SM) and a compelling dark matter candidate. Its relic abundance is highly sensitive to the thermal history of the universe when the temperature is around the QCD confinement scale. Meanwhile, the NANOGrav Collaboration has reported evidence for a stochastic gravitational wave background, which could originate from a…
▽ More
The QCD axion is a well-motivated hypothetical particle beyond the Standard Model (SM) and a compelling dark matter candidate. Its relic abundance is highly sensitive to the thermal history of the universe when the temperature is around the QCD confinement scale. Meanwhile, the NANOGrav Collaboration has reported evidence for a stochastic gravitational wave background, which could originate from a supercooled first-order phase transition (FOPT) with a nucleation temperature around the O(MeV-GeV) scale. We explore how such an FOPT might alter the evolution of the QCD axion. Our findings suggest that it could induce the axion to go through a short stage of mini kinetic misalignment. Moreover, in some parameter regime, the formation of QCD axion domain walls becomes generically expected. This has intriguing implications for both the existence of the QCD axion and the FOPT interpretation of the NANOGrav signal.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation
Authors:
Yue Zhou,
Yuan Bi,
Wenjuan Tong,
Wei Wang,
Nassir Navab,
Zhongliang Jiang
Abstract:
Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations,…
▽ More
Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty
Authors:
Jinwen He,
Yiyang Lu,
Zijin Lin,
Kai Chen,
Yue Zhao
Abstract:
Large Language Models (LLMs) are widely used in sensitive domains, including healthcare, finance, and legal services, raising concerns about potential private information leaks during inference. Privacy extraction attacks, such as jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the models to output sensitive information. However, these attacks cannot verify whether the e…
▽ More
Large Language Models (LLMs) are widely used in sensitive domains, including healthcare, finance, and legal services, raising concerns about potential private information leaks during inference. Privacy extraction attacks, such as jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the models to output sensitive information. However, these attacks cannot verify whether the extracted private information is accurate, as no public datasets exist for cross-validation, leaving a critical gap in private information detection during inference. To address this, we propose PrivacyXray, a novel framework detecting privacy breaches by analyzing LLM inner states. Our analysis reveals that LLMs exhibit higher semantic coherence and probabilistic certainty when generating correct private outputs. Based on this, PrivacyXray detects privacy breaches using four metrics: intra-layer and inter-layer semantic similarity, token-level and sentence-level probability distributions. PrivacyXray addresses critical challenges in private information detection by overcoming the lack of open-source private datasets and eliminating reliance on external data for validation. It achieves this through the synthesis of realistic private data and a detection mechanism based on the inner states of LLMs. Experiments show that PrivacyXray achieves consistent performance, with an average accuracy of 92.69% across five LLMs. Compared to state-of-the-art methods, PrivacyXray achieves significant improvements, with an average accuracy increase of 20.06%, highlighting its stability and practical utility in real-world applications.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis
Authors:
Xin Zhang,
Liangxiu Han,
Yue Shi,
Yanlin Zheng,
Alam Uazman,
Maryam Ferdousi,
Rayaz Malik
Abstract:
Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corn…
▽ More
Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
From High-SNR Radar Signal to ECG: A Transfer Learning Model with Cardio-Focusing Algorithm for Scenarios with Limited Data
Authors:
Yuanyuan Zhang,
Haocheng Zhao,
Sijie Xiong,
Rui Yang,
Eng Gee Lim,
Yutao Yue
Abstract:
Electrocardiogram (ECG), as a crucial find-grained cardiac feature, has been successfully recovered from radar signals in the literature, but the performance heavily relies on the high-quality radar signal and numerous radar-ECG pairs for training, restricting the applications in new scenarios due to data scarcity. Therefore, this work will focus on radar-based ECG recovery in new scenarios with l…
▽ More
Electrocardiogram (ECG), as a crucial find-grained cardiac feature, has been successfully recovered from radar signals in the literature, but the performance heavily relies on the high-quality radar signal and numerous radar-ECG pairs for training, restricting the applications in new scenarios due to data scarcity. Therefore, this work will focus on radar-based ECG recovery in new scenarios with limited data and propose a cardio-focusing and -tracking (CFT) algorithm to precisely track the cardiac location to ensure an efficient acquisition of high-quality radar signals. Furthermore, a transfer learning model (RFcardi) is proposed to extract cardio-related information from the radar signal without ECG ground truth based on the intrinsic sparsity of cardiac features, and only a few synchronous radar-ECG pairs are required to fine-tune the pre-trained model for the ECG recovery. The experimental results reveal that the proposed CFT can dynamically identify the cardiac location, and the RFcardi model can effectively generate faithful ECG recoveries after using a small number of radar-ECG pairs for training. The code and dataset are available after the publication.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning
Authors:
Mingcheng Qu,
Guang Yang,
Donglin Di,
Yue Gao,
Tonghua Su,
Yang Song,
Lei Fan
Abstract:
Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality…
▽ More
Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding
Authors:
Runwei Guan,
Ningwei Ouyang,
Tianhao Xu,
Shaofeng Liang,
Wei Dai,
Yafeng Sun,
Shang Gao,
Songning Lai,
Shanliang Yao,
Xuming Hu,
Ryan Wen Liu,
Yutao Yue,
Hui Xiong
Abstract:
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to…
▽ More
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Authors:
Yinan Xia,
Yilei Jiang,
Yingshui Tan,
Xiaoyong Zhu,
Xiangyu Yue,
Bo Zheng
Abstract:
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal l…
▽ More
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
The MOTIF Hand: A Robotic Hand for Multimodal Observations with Thermal, Inertial, and Force Sensors
Authors:
Hanyang Zhou,
Haozhe Lou,
Wenhao Liu,
Enyu Zhao,
Yue Wang,
Daniel Seita
Abstract:
Advancing dexterous manipulation with multi-fingered robotic hands requires rich sensory capabilities, while existing designs lack onboard thermal and torque sensing. In this work, we propose the MOTIF hand, a novel multimodal and versatile robotic hand that extends the LEAP hand by integrating: (i) dense tactile information across the fingers, (ii) a depth sensor, (iii) a thermal camera, (iv), IM…
▽ More
Advancing dexterous manipulation with multi-fingered robotic hands requires rich sensory capabilities, while existing designs lack onboard thermal and torque sensing. In this work, we propose the MOTIF hand, a novel multimodal and versatile robotic hand that extends the LEAP hand by integrating: (i) dense tactile information across the fingers, (ii) a depth sensor, (iii) a thermal camera, (iv), IMU sensors, and (v) a visual sensor. The MOTIF hand is designed to be relatively low-cost (under 4000 USD) and easily reproducible. We validate our hand design through experiments that leverage its multimodal sensing for two representative tasks. First, we integrate thermal sensing into 3D reconstruction to guide temperature-aware, safe grasping. Second, we show how our hand can distinguish objects with identical appearance but different masses - a capability beyond methods that use vision only.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
Authors:
Jinyang Li,
Xiaolong Li,
Ge Qu,
Per Jacobsson,
Bowen Qin,
Binyuan Hui,
Shuzheng Si,
Nan Huo,
Xiaohan Xu,
Yue Zhang,
Ziwei Tang,
Yuanshuai Li,
Florensia Widjaja,
Xintong Zhu,
Feige Zhou,
Yongfeng Huang,
Yannis Papakonstantinou,
Fatma Ozcan,
Chenhao Ma,
Reynold Cheng
Abstract:
Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITI…
▽ More
Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Authors:
Jiaming Han,
Hao Chen,
Yang Zhao,
Hanyu Wang,
Qi Zhao,
Ziyan Yang,
Hao He,
Xiangyu Yue,
Lu Jiang
Abstract:
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expan…
▽ More
This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways
Authors:
Shanliang Yao,
Runwei Guan,
Yi Ni,
Sen Xu,
Yong Yue,
Xiaohui Zhu,
Ryan Wen Liu
Abstract:
Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensor…
▽ More
Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on https://usvtrack.github.io.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Airalogy: AI-empowered universal data digitization for research automation
Authors:
Zijie Yang,
Qiji Zhou,
Fang Guo,
Sijie Zhang,
Yexun Xi,
Jinglei Nie,
Yudian Zhu,
Liping Huang,
Chou Wu,
Yonghe Xia,
Xiaoyu Ma,
Yingming Pu,
Panzhong Lu,
Junshu Pan,
Mingtao Chen,
Tiannan Guo,
Yanmei Dou,
Hongyu Chen,
Anping Zeng,
Jiaxing Huang,
Tian Xu,
Yue Zhang
Abstract:
Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, a…
▽ More
Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (https://airalogy.com), the world's first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent Q&A, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Authors:
Yuting Zhang,
Kaishen Yuan,
Hao Lu,
Yutao Yue,
Jintai Chen,
Kaishun Wu
Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to…
▽ More
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Doping-induced Polyamorphic Transitions in Fluorite Oxides
Authors:
Hao Yang,
Qiaotong Luan,
Qing Zhang,
Yuhao Yue,
Yawen Xu,
Xiaohui Liu,
Zheng Wen,
Zhaoru Sun
Abstract:
Fluorite oxides such as HfO$_2$ exhibit rich and tunable phase behavior, making them promising candidates for next generation electronic devices. A key challenge is to design amorphous HfO$_2$-based high-$k$ materials with both structural and performance stability. Here, using molecular dynamics simulations supported by experimental measurements, we reveal that Ba doping stimulates a polyamorphic…
▽ More
Fluorite oxides such as HfO$_2$ exhibit rich and tunable phase behavior, making them promising candidates for next generation electronic devices. A key challenge is to design amorphous HfO$_2$-based high-$k$ materials with both structural and performance stability. Here, using molecular dynamics simulations supported by experimental measurements, we reveal that Ba doping stimulates a polyamorphic transition in HfO$_2$, yielding a semi-ordered amorphous (SA) phase characterized by disordered oxygens embedded within an ordered metal sublattice. We find that this phase arises from degenerate short-range symmetry breaking modes, consistent with Pauling's parsimony rule. Notably, the SA structure is thermodynamically stable and displays a wider bandgap and higher dielectric constant than conventional random-packing amorphous structure, owing to suppressed subgap states and increased Born effective charges. We further demonstrate that this structural motif generalizes to Ba-, Sr-, and Ca-doped HfO$_2$ and ZrO$_2$, establishing a broadly applicable strategy for designing high-performance amorphous dielectrics.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Programmable electro-optic frequency comb empowers integrated parallel convolution processing
Authors:
Jinze He,
Junzhe Qiang,
Yiying Dong,
Jingyi Wang,
Tian Dong,
Gongcheng Yue,
Rongjin Zhuang,
Mingze Lv,
Siyuan Yu,
Zhongjin Lin,
Xinlun Cai,
Yuanmu Yang,
Guanhao Wu,
Yang Li
Abstract:
Integrated photonic convolution processors make optical neural networks (ONNs) a transformative solution for artificial intelligence applications such as machine vision. To enhance the parallelism, throughput, and energy efficiency of ONNs, wavelength multiplexing is widely applied. However, it often encounters the challenges of low compactness, limited scalability, and high weight reconstruction…
▽ More
Integrated photonic convolution processors make optical neural networks (ONNs) a transformative solution for artificial intelligence applications such as machine vision. To enhance the parallelism, throughput, and energy efficiency of ONNs, wavelength multiplexing is widely applied. However, it often encounters the challenges of low compactness, limited scalability, and high weight reconstruction latency. Here, we proposed and demonstrated an integrated photonic processing unit with a parallel convolution computing speed of 1.62 trillion operations per second (TOPS) and a weight reconstruction speed exceeding 38 GHz. This processing unit simultaneously achieves, for the first time, multi-wavelength generation and weight mapping via a single programmable electro-optic (EO) frequency comb, featuring unprecedented compactness, device-footprint independent scalability, and near-unity optical power conversion efficiency (conversion efficiency from input optical power to output weighted comb lines). To demonstrate the reconfigurability and functionality of this processing unit, we implemented image edge detection and object classification based on EO combs obtained using the particle swarm algorithm and an EO comb neural network training framework, respectively. Our programmable EO comb-based processing framework establishes a new paradigm towards the development of low-latency monolithic photonic processors, promising real-time in-sensor learning for autonomous vehicles, intelligent robotics, and drones.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Learning Causal Graphs at Scale: A Foundation Model Approach
Authors:
Naiyu Yin,
Tian Gao,
Yue Yu
Abstract:
Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges,…
▽ More
Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges, in this work we leverage the recent success of linear transformers and develop a foundation model approach for discovering multiple order-consistent DAGs across tasks. In particular, we propose Attention-DAG (ADAG), a novel attention-mechanism-based architecture for learning multiple linear Structural Equation Models (SEMs). ADAG learns the mapping from observed data to both graph structure and parameters via a nonlinear attention-based kernel, enabling efficient multi-task estimation of the underlying linear SEMs. By formulating the learning process across multiple tasks as a continuous optimization problem, the pre-trained ADAG model captures the common structural properties as a shared low-dimensional prior, thereby reducing the ill-posedness of downstream DAG learning tasks in small-sample regimes. We evaluate our proposed approach on benchmark synthetic datasets and find that ADAG achieves substantial improvements in both DAG learning accuracy and zero-shot inference efficiency. To the best of our knowledge, this is the first practical approach for pre-training a foundation model specifically designed for DAG learning, representing a step toward more efficient and generalizable down-stream applications in causal discovery.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
Authors:
Yue Li,
Meng Tian,
Dechang Zhu,
Jiangtong Zhu,
Zhenyu Lin,
Zhiwei Xiong,
Xinhai Zhao
Abstract:
Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-of…
▽ More
Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-ofthought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose Drive-R1 designed to bridges the scenario reasoning and motion planning for AD. Drive-R1 first undergoes the supervised finetuning on a elaborate dataset containing both long and short COT data. Drive-R1 is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, Drive-R1 is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 achieves superior performance compared to existing state-of-the-art VLMs. We believe that Drive-R1 presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
The mechanism of tornadogenesis from the perspective of vortex tubes
Authors:
Peng Yue,
Y. Charles Li,
Jiamin Dang,
Leigh Orf,
Grace Yan
Abstract:
In this paper, we propose a new theory on tornadogenesis from the perspective of vortex tubes based on Kelvin-Helmholtz Theorems. When the pressure difference between the lowest pressure line from the wall cloud down to the ground and its surroundings is large enough, the increase of vorticity inside the squeezed vortex tube can reach the tornado level, and thus a tornado is born. When the pressur…
▽ More
In this paper, we propose a new theory on tornadogenesis from the perspective of vortex tubes based on Kelvin-Helmholtz Theorems. When the pressure difference between the lowest pressure line from the wall cloud down to the ground and its surroundings is large enough, the increase of vorticity inside the squeezed vortex tube can reach the tornado level, and thus a tornado is born. When the pressure difference increases, the tornado strength increases. When the pressure difference decreases, the tornado strength decreases. The decay of tornadoes is caused by the decreasing pressure difference. This is our theory of the entire tornado lifespan.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis
Authors:
Junjian Li,
Hulin Kuang,
Jin Liu,
Hailin Yue,
Mengshen He,
Jianxin Wang
Abstract:
Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distrib…
▽ More
Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at https://github.com/junjianli106/MiCo.
△ Less
Submitted 25 June, 2025; v1 submitted 22 June, 2025;
originally announced June 2025.
-
Learning from the Storm: A Multivariate Machine Learning Approach to Predicting Hurricane-Induced Economic Losses
Authors:
Bolin Shen,
Eren Erman Ozguven,
Yue Zhao,
Guang Wang,
Yiqun Xie,
Yushun Dong
Abstract:
Florida is particularly vulnerable to hurricanes, which frequently cause substantial economic losses. While prior studies have explored specific contributors to hurricane-induced damage, few have developed a unified framework capable of integrating a broader range of influencing factors to comprehensively assess the sources of economic loss. In this study, we propose a comprehensive modeling frame…
▽ More
Florida is particularly vulnerable to hurricanes, which frequently cause substantial economic losses. While prior studies have explored specific contributors to hurricane-induced damage, few have developed a unified framework capable of integrating a broader range of influencing factors to comprehensively assess the sources of economic loss. In this study, we propose a comprehensive modeling framework that categorizes contributing factors into three key components: (1) hurricane characteristics, (2) water-related environmental factors, and (3) socioeconomic factors of affected areas. By integrating multi-source data and aggregating all variables at the finer spatial granularity of the ZIP Code Tabulation Area (ZCTA) level, we employ machine learning models to predict economic loss, using insurance claims as indicators of incurred damage. Beyond accurate loss prediction, our approach facilitates a systematic assessment of the relative importance of each component, providing practical guidance for disaster mitigation, risk assessment, and the development of adaptive urban strategies in coastal and storm-exposed areas. Our code is now available at: https://github.com/LabRAI/Hurricane-Induced-Economic-Loss-Prediction
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Inverse Chance Constrained Optimal Power Flow
Authors:
Shenglu Wang,
Kairui Feng,
Mengqi Xue,
Yue Song
Abstract:
The chance constrained optimal power flow (CC-OPF) essentially finds the low-cost generation dispatch scheme ensuring operational constraints are met with a specified probability, termed the security level. While the security level is a crucial input parameter, how it shapes the CC-OPF feasibility boundary has not been revealed. Changing the security level from a parameter to a decision variable,…
▽ More
The chance constrained optimal power flow (CC-OPF) essentially finds the low-cost generation dispatch scheme ensuring operational constraints are met with a specified probability, termed the security level. While the security level is a crucial input parameter, how it shapes the CC-OPF feasibility boundary has not been revealed. Changing the security level from a parameter to a decision variable, this letter proposes the inverse CC-OPF that seeks the highest feasible security level supported by the system. To efficiently solve this problem, we design a Newton-Raphson-like iteration algorithm leveraging the duality-based sensitivity analysis of an associated surrogate problem. Numerical experiments validate the proposed approach, revealing complex feasibility boundaries for security levels that underscore the importance of coordinating security levels across multiple chance constraints.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception
Authors:
Mengqi Lei,
Siqi Li,
Yihong Wu,
Han Hu,
You Zhou,
Xinhu Zheng,
Guiguang Ding,
Shaoyi Du,
Zongze Wu,
Yue Gao
Abstract:
The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to…
▽ More
The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0\% over YOLO11-N and by 1.5\% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
OpusLM: A Family of Open Unified Speech Language Models
Authors:
Jinchuan Tian,
William Chen,
Yifan Peng,
Jiatong Shi,
Siddhant Arora,
Shikhar Bharadwaj,
Takashi Maekaku,
Yusuke Shinohara,
Keita Goto,
Xiang Yue,
Huck Yang,
Shinji Watanabe
Abstract:
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in…
▽ More
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting
Authors:
Lincan Li,
Eren Erman Ozguven,
Yue Zhao,
Guang Wang,
Yiqun Xie,
Yushun Dong
Abstract:
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such…
▽ More
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models
Authors:
Le Yu,
Kaishen Wang,
Jianlong Xiong,
Yue Cao,
Tao He
Abstract:
Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specifi…
▽ More
Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Dynamics of Multiphase Carbon in the Turbulent Circumgalactic Medium
Authors:
Yue Hu,
Evan Scannapieco,
Edward Buie II,
Siyao Xu,
Samuel T Sebastian,
Om Biswal
Abstract:
The circumgalactic medium (CGM) plays a crucial role in regulating material and energy exchange between galaxies and their environments. The best means of observing this medium is through absorption-line spectroscopy, but we have yet to develop a consistent physical model that fully explains these results. Here we investigate the impact of turbulence and non-equilibrium chemistry on the properties…
▽ More
The circumgalactic medium (CGM) plays a crucial role in regulating material and energy exchange between galaxies and their environments. The best means of observing this medium is through absorption-line spectroscopy, but we have yet to develop a consistent physical model that fully explains these results. Here we investigate the impact of turbulence and non-equilibrium chemistry on the properties of the CGM, using three-dimensional hydrodynamic simulations that include the impact of an ionizing background. Increasing turbulence enhances small-scale density fluctuations, shifting the kinetic energy spectra from Kolmogorov to Burgers scaling. This is indicative of shock-dominated dissipation, which plays a critical role in driving carbon ionization and shaping the multiphase structure of the medium. At the same time, the presence of background radiation significantly alters the ionization balance, increasing the prevalence of C\textsc{ii} and C\textsc{iv}. Thus, turbulence and the background radiation have complementary roles: turbulence governs the spatial distribution and facilitates the formation of ionized species, whereas the background radiation modifies the overall ionization equilibrium, setting the observed distribution of multiphase carbon.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research
Authors:
Shuo Yan,
Ruochen Li,
Ziming Luo,
Zimu Wang,
Daoyang Li,
Liqiang Jing,
Kaiyu He,
Peilin Wu,
George Michalopoulos,
Yue Zhang,
Ziyang Zhang,
Mian Zhang,
Zhiyu Chen,
Xinya Du
Abstract:
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code…
▽ More
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents' ability to autonomously reproduce scientific research
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study
Authors:
Chuanlei Li,
Xu Hu,
Minghui Xu,
Kun Li,
Yue Zhang,
Xiuzhen Cheng
Abstract:
Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system i…
▽ More
Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring. Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average \$104.28 USD). However, the similarity between LLM-selected papers and actual accepted papers remains low (average 38.6\%), indicating issues such as hallucination, lack of independent judgment, and retrieval preferences. Therefore, it is recommended to use LLMs as assistive tools to support human reviewers, rather than to replace them.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
Authors:
Yanzhi Zhang,
Zhaoxi Zhang,
Haoxiang Guan,
Yilin Cheng,
Yitong Duan,
Chen Wang,
Yue Wang,
Shuxin Zheng,
Jiyan He
Abstract:
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have shown strong results, but they require extensive external supervision. We investigate an alternative class of methods, Reinforcement Learning fr…
▽ More
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have shown strong results, but they require extensive external supervision. We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards. In particular, we leverage unsupervised reward proxies such as token-level entropy, trajectory-level entropy, and self-certainty. Our theoretical analysis shows these internal objectives are partially equivalent, and we empirically evaluate various RLIF strategies on challenging math reasoning benchmarks. Experimental results demonstrate that RLIF can boost the reasoning performance of base LLMs at the beginning phase of the training, matching or surpassing RLVR techniques on these tasks. However, when training progresses, performance degrades even below the model before training. Moreover, we find that RLIF yields little improvement for instruction-tuned models, indicating diminishing returns of intrinsic feedback once an LLM is already instruction-tuned. We further analyze this limitation by mixing model weights and explain the reason of RLIF's training behaviors, providing practical guidelines for integrating internal feedback signals into LLM training. We hope our analysis of internal feedback will inform more principled and effective strategies for LLM post-training.
△ Less
Submitted 25 June, 2025; v1 submitted 20 June, 2025;
originally announced June 2025.
-
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Authors:
Shoubin Yu,
Yue Zhang,
Ziyang Wang,
Jaehong Yoon,
Mohit Bansal
Abstract:
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make inform…
▽ More
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Quantum Optimization for Software Engineering: A Survey
Authors:
Man Zhang,
Yuechen Li,
Tao Yue,
Kai-Yuan Cai
Abstract:
Quantum computing, particularly in the area of quantum optimization, is steadily progressing toward practical applications, supported by an expanding range of hardware platforms and simulators. While Software Engineering (SE) optimization has a strong foundation, which is exemplified by the active Search-Based Software Engineering (SBSE) community and numerous classical optimization methods, the g…
▽ More
Quantum computing, particularly in the area of quantum optimization, is steadily progressing toward practical applications, supported by an expanding range of hardware platforms and simulators. While Software Engineering (SE) optimization has a strong foundation, which is exemplified by the active Search-Based Software Engineering (SBSE) community and numerous classical optimization methods, the growing complexity of modern software systems and their engineering processes demands innovative solutions. This Systematic Literature Review (SLR) focuses specifically on studying the literature that applies quantum or quantum-inspired algorithms to solve classical SE optimization problems. We examine 77 primary studies selected from an initial pool of 2083 publications obtained through systematic searches of six digital databases using carefully crafted search strings. Our findings reveal concentrated research efforts in areas such as SE operations and software testing, while exposing significant gaps across other SE activities. Additionally, the SLR uncovers relevant works published outside traditional SE venues, underscoring the necessity of this comprehensive review. Overall, our study provides a broad overview of the research landscape, empowering the SBSE community to leverage quantum advancements in addressing next-generation SE challenges.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Authors:
Tongtian Yue,
Longteng Guo,
Yepeng Tang,
Zijia Zhao,
Xinxin Zhu,
Hua Huang,
Jing Liu
Abstract:
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present…
▽ More
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches
Authors:
Yun Xing,
Yue Cao,
Nhat Chung,
Jie Zhang,
Ivor Tsang,
Ming-Ming Cheng,
Yang Liu,
Lei Ma,
Qing Guo
Abstract:
Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digit…
▽ More
Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated texture structures perform poorly in physical-world implementations, i.e., when deployed as patches, limiting their practical utility for testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals between repeated textures, creating a striped structure, significantly enhances the patch attack effectiveness. Through extensive experimentation, we analyze how variations of this novel structure influence the performance. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack state-of-the-art stereo depth estimation methods, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation
Authors:
Penglong Zhai,
Yifang Yuan,
Fanyi Di,
Jie Li,
Yue Liu,
Chen Li,
Jie Huang,
Sicong Wang,
Yao Xu,
Xin Li
Abstract:
Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alter…
▽ More
Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT's effectiveness in LLM-based generative recommendation.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Category-based Galaxy Image Generation via Diffusion Models
Authors:
Xingzhong Fan,
Hongming Tang,
Yue Zeng,
M. B. N. Kouwenhoven,
Guangquan Zeng
Abstract:
Conventional galaxy generation methods rely on semi-analytical models and hydrodynamic simulations, which are highly dependent on physical assumptions and parameter tuning. In contrast, data-driven generative models do not have explicit physical parameters pre-determined, and instead learn them efficiently from observational data, making them alternative solutions to galaxy generation. Among these…
▽ More
Conventional galaxy generation methods rely on semi-analytical models and hydrodynamic simulations, which are highly dependent on physical assumptions and parameter tuning. In contrast, data-driven generative models do not have explicit physical parameters pre-determined, and instead learn them efficiently from observational data, making them alternative solutions to galaxy generation. Among these, diffusion models outperform Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) in quality and diversity. Leveraging physical prior knowledge to these models can further enhance their capabilities. In this work, we present GalCatDiff, the first framework in astronomy to leverage both galaxy image features and astrophysical properties in the network design of diffusion models. GalCatDiff incorporates an enhanced U-Net and a novel block entitled Astro-RAB (Residual Attention Block), which dynamically combines attention mechanisms with convolution operations to ensure global consistency and local feature fidelity. Moreover, GalCatDiff uses category embeddings for class-specific galaxy generation, avoiding the high computational costs of training separate models for each category. Our experimental results demonstrate that GalCatDiff significantly outperforms existing methods in terms of the consistency of sample color and size distributions, and the generated galaxies are both visually realistic and physically consistent. This framework will enhance the reliability of galaxy simulations and can potentially serve as a data augmentor to support future galaxy classification algorithm development.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction
Authors:
Zhenxuan Zhang,
Lipei Zhang,
Yanqi Cheng,
Zi Wang,
Fanwen Wang,
Haosen Zhang,
Yue Yang,
Yinzhe Wu,
Jiahao Huang,
Angelica I Aviles-Rivero,
Zhifan Gao,
Guang Yang,
Peter J. Lally
Abstract:
In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing cause…
▽ More
In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains.
△ Less
Submitted 24 June, 2025; v1 submitted 19 June, 2025;
originally announced June 2025.
-
Finite Thickness Effects on Metallization Vs. Chiral Majorana Fermions
Authors:
Xin Yue,
Guo-Jian Qiao,
C. P. Sun
Abstract:
In heterostructures composed of quantum anomalous Hall insulators and \textit{s}-wave superconductors (SCs), metallization hinders the identification of chiral Majorana fermions (CMFs). In this Letter, we study how the thickness of SC affects the competition between metallization and CMFs by a holistic approach previously developed for hybrid nanowire systems [Phys. Rev. Lett. 133, 266605 (2024)].…
▽ More
In heterostructures composed of quantum anomalous Hall insulators and \textit{s}-wave superconductors (SCs), metallization hinders the identification of chiral Majorana fermions (CMFs). In this Letter, we study how the thickness of SC affects the competition between metallization and CMFs by a holistic approach previously developed for hybrid nanowire systems [Phys. Rev. Lett. 133, 266605 (2024)]. We predict three types of structures that vary with thickness of SC: (i) Periodic structure of metallization. For thin SCs ($\sim$10\,nm), the metallization region exhibits oscillations as the thickness of SC changes, with the oscillation period corresponding to the Fermi wavelength of SC. (ii) Periodic structure of CMFs. For intermediate thicknesses ($\sim$100\,nm), the window width for observing CMFs exhibits a periodic behavior, oscillating with the same period. (iii) Stable structure of CMFs. For thick SCs ($\sim$1000\,nm), the behavior of CMFs becomes uniform as the thickness varies. Optimizing the thickness of SC may thus improve data quality and provide clearer evidence for CMFs.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Aptamer-protein interaction prediction model based on transformer
Authors:
Zhichao Yan,
Yue Kang,
Buyong Ma
Abstract:
Aptamers are single-stranded DNA/RNAs or short peptides with unique tertiary structures that selectively bind to specific targets. They have great potential in the detection and medical fields. Here, we present SelfTrans-Ensemble, a deep learning model that integrates sequence information models and structural information models to extract multi-scale features for predicting aptamer-protein intera…
▽ More
Aptamers are single-stranded DNA/RNAs or short peptides with unique tertiary structures that selectively bind to specific targets. They have great potential in the detection and medical fields. Here, we present SelfTrans-Ensemble, a deep learning model that integrates sequence information models and structural information models to extract multi-scale features for predicting aptamer-protein interactions (APIs). The model employs two pre-trained models, ProtBert and RNA-FM, to encode protein and aptamer sequences, along with features generated from primary sequence and secondary structural information. To address the data imbalance in the aptamer dataset imbalance, we incorporated short RNA-protein interaction data in the training set. This resulted in a training accuracy of 98.9% and a test accuracy of 88.0%, demonstrating the model's effectiveness in accurately predicting APIs. Additionally, analysis using molecular simulation indicated that SelfTrans-Ensemble is sensitive to aptamer sequence mutations. We anticipate that SelfTrans-Ensemble can offer a more efficient and rapid process for aptamer screening.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
Authors:
Boyu Li,
Siyuan He,
Hang Xu,
Haoqi Yuan,
Yu Zang,
Liwei Hu,
Junpeng Yue,
Zhenxiong Jiang,
Pengbo Hu,
Börje F. Karlsson,
Yehui Tang,
Zongqing Lu
Abstract:
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which li…
▽ More
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which limits their transferability to real-world robots. To address these issues, we present a physics-based simulation platform DualTHOR for complex dual-arm humanoid robots, built upon an extended version of AI2-THOR. Our simulator includes real-world robot assets, a task suite for dual-arm collaboration, and inverse kinematics solvers for humanoid robots. We also introduce a contingency mechanism that incorporates potential failures through physics-based low-level execution, bridging the gap to real-world scenarios. Our simulator enables a more comprehensive evaluation of the robustness and generalization of VLMs in household environments. Extensive evaluations reveal that current VLMs struggle with dual-arm coordination and exhibit limited robustness in realistic environments with contingencies, highlighting the importance of using our simulator to develop more capable VLMs for embodied tasks. The code is available at https://github.com/ds199895/DualTHOR.git.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $D^+\to K^+η^{\prime}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (697 additional authors not shown)
Abstract:
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The bra…
▽ More
Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The branching fractions are determined to be ${\mathcal B}(D^+\to K^+ π^0) = (1.45 \pm 0.06 \pm 0.06)\times 10^{-4}$, ${\mathcal B}(D^+\to K^+ η) = (1.17 \pm 0.10 \pm 0.03)\times 10^{-4}$ and ${\mathcal B}(D^+\to K^+ η^{\prime}) = (1.88 \pm 0.15 \pm 0.06)\times 10^{-4}$, where the first uncertainties are statistical and the second systematic. These results are consistent with the world average values but with significantly improved precision.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Symmetry in Multi-Qubit Correlated Noise Errors Enhances Surface Code Thresholds
Authors:
SiYing Wang,
Yue Yan,
ZhiXin Xia,
Xiang-Bin Wang
Abstract:
Surface codes are promising for practical quantum error correction due to their high threshold and experimental feasibility. However, their performance under realistic noise conditions, particularly those involving correlated errors, requires further investigation. In this study, we investigate the impact of correlated errors on the error threshold. In particular, we focus on several distinct type…
▽ More
Surface codes are promising for practical quantum error correction due to their high threshold and experimental feasibility. However, their performance under realistic noise conditions, particularly those involving correlated errors, requires further investigation. In this study, we investigate the impact of correlated errors on the error threshold. In particular, we focus on several distinct types of correlated errors that could potentially arise from next-nearest-neighbor (NNN) coupling in quantum systems. We present the analytical threshold of the surface code under these types of correlated noise, and find that errors correlated along straight lines possess a type of crucial symmetry, resulting in higher thresholds compared to other types of correlated errors. This deepens our insight into the threshold of surface code and hence facilitates a more robust design of quantum circuits with a higher noise threshold.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.