Search | arXiv e-print repository

doi 10.4271/2025-01-8343

Real-time Terrain Analysis for Off-road Autonomous Vehicles

Authors: Edwina Lewis, Aditya Parameshwaran, Laura Redmond, Yue Wang

Abstract: This research addresses critical autonomous vehicle control challenges arising from road roughness variation, which induces course deviations and potential loss of road contact during steering operations. We present a novel real-time road roughness estimation system employing Bayesian calibration methodology that processes axle accelerations to predict terrain roughness with quantifiable confidenc… ▽ More This research addresses critical autonomous vehicle control challenges arising from road roughness variation, which induces course deviations and potential loss of road contact during steering operations. We present a novel real-time road roughness estimation system employing Bayesian calibration methodology that processes axle accelerations to predict terrain roughness with quantifiable confidence measures. The technical framework integrates a Gaussian process surrogate model with a simulated half-vehicle model, systematically processing vehicle velocity and road surface roughness parameters to generate corresponding axle acceleration responses. The Bayesian calibration routine performs inverse estimation of road roughness from observed accelerations and velocities, yielding posterior distributions that quantify prediction uncertainty for adaptive risk management. Training data generation utilizes Latin Hypercube sampling across comprehensive velocity and roughness parameter spaces, while the calibrated model integrates seamlessly with a Simplex controller architecture to dynamically adjust velocity limits based on real-time roughness predictions. Experimental validation on stochastically generated surfaces featuring varying roughness regions demonstrates robust real-time characterization capabilities, with the integrated Simplex control strategy effectively enhancing autonomous vehicle operational safety through proactive surface condition response. This innovative Bayesian framework establishes a comprehensive foundation for mitigating roughness-related operational risks while simultaneously improving efficiency and safety margins in autonomous vehicle systems. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Journal ref: SAE Technical Papers 2025-01-8343

arXiv:2506.21289 [pdf, ps, other]

Dynamic Focusing to Suppress Emittance Transfer in Crab-Crossing Flat Beam Collisions

Authors: Derong Xu, J Scott Berg, Michael M Blaskiewicz, Yue Hao, Yun Luo, Christoph Montag, Sergei Nagaitsev, Boris Podobedov, Vadim Ptitsyn, Ferdinand Willeke, Binping Xiao

Abstract: Flat hadron beam collisions, though expected to enhance peak luminosity by about an order of magnitude, have not yet been demonstrated. Our study reveals a critical limitation: realistic fluctuations, when amplified by synchro-betatron resonance, lead to transverse emittance transfer in flat-beam collisions. Using beam-beam simulations based on Electron-Ion Collider design parameters, we show that… ▽ More Flat hadron beam collisions, though expected to enhance peak luminosity by about an order of magnitude, have not yet been demonstrated. Our study reveals a critical limitation: realistic fluctuations, when amplified by synchro-betatron resonance, lead to transverse emittance transfer in flat-beam collisions. Using beam-beam simulations based on Electron-Ion Collider design parameters, we show that this effect leads to vertical emittance growth, which can distort the flat-beam profile and degrade luminosity. We propose a dynamic focusing scheme that combines sextupoles with crab cavities to suppress the hourglass-induced resonance. This approach increases tolerance to fluctuations and improves the robustness of flat-beam collisions. This practical mitigation facilitates the adoption of flat-beam collisions in next-generation lepton-hadron colliders. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 5 figures

arXiv:2506.21030 [pdf, ps, other]

STEP Planner: Constructing cross-hierarchical subgoal tree as an embodied long-horizon task planner

Authors: Zhou Tianxing, Wang Zhirui, Ao Haojia, Chen Guangyan, Xing Boyang, Cheng Jingwen, Yang Yi, Yue Yufeng

Abstract: The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models… ▽ More The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21019 [pdf, ps, other]

doi 10.3390/universe11060178

Statistical Strong Lensing as a Test of Conformal Gravity

Authors: Li-Xue Yue, Da-Ming Chen

Abstract: As an alternative gravitational theory to General Relativity (GR), Conformal Gravity (CG) can be verified through astronomical observations. Currently, Mannheim and Kazanas have provided vacuum solutions for cosmological and local gravitational systems, and these solutions may resolve the dark matter and dark energy issues encountered in GR, making them particularly valuable. For static, spherical… ▽ More As an alternative gravitational theory to General Relativity (GR), Conformal Gravity (CG) can be verified through astronomical observations. Currently, Mannheim and Kazanas have provided vacuum solutions for cosmological and local gravitational systems, and these solutions may resolve the dark matter and dark energy issues encountered in GR, making them particularly valuable. For static, spherically symmetric systems, CG predicts an additional linear potential generated by luminous matter in addition to the conventional Newtonian potential. This extra potential is expected to account for the observations of galaxies and galaxy clusters without the need of dark matter. It is characterized by the parameter $γ^*$, which corresponds to the linear potential generated by the unit of the solar mass, and it is thus a universal constant. The value of $γ^\ast$ was determined by fitting the rotation curve data of spiral galaxies. These predictions of CG should also be verified by the observations of strong gravitational lensing. In this study, building upon the previous research, we tested CG via strong lensing statistics. We used a well-defined sample that consisted of both galaxies and galaxy clusters. This allowed us to test CG through statistical strong lensing in a way similar to the conventional approach in GR. As anticipated, our results were consistent with previous studies, namely that the fitted $γ^*$ is much larger than that from rotation curves. Intriguingly, we further discovered that, in order to fit the strong lensing data of another sample, the value of $γ^*$ cannot be a constant, as is required in CG. Instead, we derived a formula for $γ^*$ as a function of the stellar mass $M_*$ of the galaxies or galaxy clusters. It was found that $γ^*$ decreases as $M_*$ increases. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 19 pages, 3 figures, 2 tables, Published in Universe Journal

Journal ref: Universe 2025, 11(6), 178

arXiv:2506.20406 [pdf, ps, other]

POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

Authors: Ruijia Zhang, Zhengling Qi, Yue Wu, Xiangyu Zhang, Yanxun Xu

Abstract: Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline rei… ▽ More Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20330 [pdf, ps, other]

doi 10.1145/3539618.3591863

Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search

Authors: Zhigong Zhou, Ning Ding, Xiaochuan Fan, Yue Shang, Yiming Qiu, Jingwei Zhuo, Zhiwei Ge, Songlin Wang, Lin Liu, Sulong Xu, Han Zhang

Abstract: Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval perform… ▽ More Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval performance. Though learning from cross-modality data has been studied extensively in tasks such as visual question answering or media summarization, multimodal retrieval remains a non-trivial and unsolved problem especially in the asymmetric scenario where the query is unimodal while the item is multimodal. In this paper, we propose a novel model named SMAR, which stands for Semantic-enhanced Modality-Asymmetric Retrieval, to tackle the problem of modality fusion and alignment in this kind of asymmetric scenario. Extensive experimental results on an industrial dataset show that the proposed model outperforms baseline models significantly in retrieval accuracy. We have open sourced our industrial dataset for the sake of reproducibility and future research works. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: published in sigir2023

arXiv:2506.20225 [pdf]

doi 10.1016/j.joi.2025.101663

The role of preprints in open science: Accelerating knowledge transfer from science to technology

Authors: Zhiqi Wang, Yue Chen, Chun Yang

Abstract: Preprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric ana… ▽ More Preprints have become increasingly essential in the landscape of open science, facilitating not only the exchange of knowledge within the scientific community but also bridging the gap between science and technology. However, the impact of preprints on technological innovation, given their unreviewed nature, remains unclear. This study fills this gap by conducting a comprehensive scientometric analysis of patent citations to bioRxiv preprints submitted between 2013 and 2021, measuring and accessing the contribution of preprints in accelerating knowledge transfer from science to technology. Our findings reveal a growing trend of patent citations to bioRxiv preprints, with a notable surge in 2020, primarily driven by the COVID-19 pandemic. Preprints play a critical role in accelerating innovation, not only expedite the dissemination of scientific knowledge into technological innovation but also enhance the visibility of early research results in the patenting process, while journals remain essential for academic rigor and reliability. The substantial number of post-online-publication patent citations highlights the critical role of the open science model-particularly the "open access" effect of preprints-in amplifying the impact of science on technological innovation. This study provides empirical evidence that open science policies encouraging the early sharing of research outputs, such as preprints, contribute to more efficient linkage between science and technology, suggesting an acceleration in the pace of innovation, higher innovation quality, and economic benefits. △ Less

Submitted 26 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted manuscript for publication in Journal of Informetrics.The final version is available at DOI:10.1016/j.joi.2025.101663

Journal ref: Journal of Informetrics (2025)

arXiv:2506.20167 [pdf, ps, other]

SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs

Authors: Fengze Li, Yue Wang, Yangle Liu, Ming Huang, Dou Hong, Jieming Ma

Abstract: Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remai… ▽ More Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED's role in addressing the structural-semantic modeling gap. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.19918 [pdf, ps, other]

QCD Axion Domain Walls from Super-Cooling First Order Phase Transition

Authors: Kun-Feng Lyu, Yue Zhao

Abstract: The QCD axion is a well-motivated hypothetical particle beyond the Standard Model (SM) and a compelling dark matter candidate. Its relic abundance is highly sensitive to the thermal history of the universe when the temperature is around the QCD confinement scale. Meanwhile, the NANOGrav Collaboration has reported evidence for a stochastic gravitational wave background, which could originate from a… ▽ More The QCD axion is a well-motivated hypothetical particle beyond the Standard Model (SM) and a compelling dark matter candidate. Its relic abundance is highly sensitive to the thermal history of the universe when the temperature is around the QCD confinement scale. Meanwhile, the NANOGrav Collaboration has reported evidence for a stochastic gravitational wave background, which could originate from a supercooled first-order phase transition (FOPT) with a nucleation temperature around the O(MeV-GeV) scale. We explore how such an FOPT might alter the evolution of the QCD axion. Our findings suggest that it could induce the axion to go through a short stage of mini kinetic misalignment. Moreover, in some parameter regime, the formation of QCD axion domain walls becomes generically expected. This has intriguing implications for both the existence of the QCD axion and the FOPT interpretation of the NANOGrav signal. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 7 pages, 6 figures

arXiv:2506.19694 [pdf, ps, other]

UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Authors: Yue Zhou, Yuan Bi, Wenjuan Tong, Wei Wang, Nassir Navab, Zhongliang Jiang

Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations,… ▽ More Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19563 [pdf, ps, other]

PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

Authors: Jinwen He, Yiyang Lu, Zijin Lin, Kai Chen, Yue Zhao

Abstract: Large Language Models (LLMs) are widely used in sensitive domains, including healthcare, finance, and legal services, raising concerns about potential private information leaks during inference. Privacy extraction attacks, such as jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the models to output sensitive information. However, these attacks cannot verify whether the e… ▽ More Large Language Models (LLMs) are widely used in sensitive domains, including healthcare, finance, and legal services, raising concerns about potential private information leaks during inference. Privacy extraction attacks, such as jailbreaking, expose vulnerabilities in LLMs by crafting inputs that force the models to output sensitive information. However, these attacks cannot verify whether the extracted private information is accurate, as no public datasets exist for cross-validation, leaving a critical gap in private information detection during inference. To address this, we propose PrivacyXray, a novel framework detecting privacy breaches by analyzing LLM inner states. Our analysis reveals that LLMs exhibit higher semantic coherence and probabilistic certainty when generating correct private outputs. Based on this, PrivacyXray detects privacy breaches using four metrics: intra-layer and inter-layer semantic similarity, token-level and sentence-level probability distributions. PrivacyXray addresses critical challenges in private information detection by overcoming the lack of open-source private datasets and eliminating reliance on external data for validation. It achieves this through the synthesis of realistic private data and a detection mechanism based on the inner states of LLMs. Experiments show that PrivacyXray achieves consistent performance, with an average accuracy of 92.69% across five LLMs. Compared to state-of-the-art methods, PrivacyXray achieves significant improvements, with an average accuracy increase of 20.06%, highlighting its stability and practical utility in real-world applications. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19474 [pdf, ps, other]

HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

Authors: Xin Zhang, Liangxiu Han, Yue Shi, Yanlin Zheng, Alam Uazman, Maryam Ferdousi, Rayaz Malik

Abstract: Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corn… ▽ More Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19358 [pdf, ps, other]

From High-SNR Radar Signal to ECG: A Transfer Learning Model with Cardio-Focusing Algorithm for Scenarios with Limited Data

Authors: Yuanyuan Zhang, Haocheng Zhao, Sijie Xiong, Rui Yang, Eng Gee Lim, Yutao Yue

Abstract: Electrocardiogram (ECG), as a crucial find-grained cardiac feature, has been successfully recovered from radar signals in the literature, but the performance heavily relies on the high-quality radar signal and numerous radar-ECG pairs for training, restricting the applications in new scenarios due to data scarcity. Therefore, this work will focus on radar-based ECG recovery in new scenarios with l… ▽ More Electrocardiogram (ECG), as a crucial find-grained cardiac feature, has been successfully recovered from radar signals in the literature, but the performance heavily relies on the high-quality radar signal and numerous radar-ECG pairs for training, restricting the applications in new scenarios due to data scarcity. Therefore, this work will focus on radar-based ECG recovery in new scenarios with limited data and propose a cardio-focusing and -tracking (CFT) algorithm to precisely track the cardiac location to ensure an efficient acquisition of high-quality radar signals. Furthermore, a transfer learning model (RFcardi) is proposed to extract cardio-related information from the radar signal without ECG ground truth based on the intrinsic sparsity of cardiac features, and only a few synchronous radar-ECG pairs are required to fine-tune the pre-trained model for the ECG recovery. The experimental results reveal that the proposed CFT can dynamically identify the cardiac location, and the RFcardi model can effectively generate faithful ECG recoveries after using a small number of radar-ECG pairs for training. The code and dataset are available after the publication. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19324 [pdf, ps, other]

Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning

Authors: Mingcheng Qu, Guang Yang, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan

Abstract: Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality… ▽ More Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: accepted by MICCAI2025 code: https://github.com/MCPathology/M2Surv

arXiv:2506.19288 [pdf, ps, other]

Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Authors: Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong

Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to… ▽ More Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks. △ Less