Search | arXiv e-print repository

DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Authors: Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and… ▽ More Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: Accepted by ACM MM 2025

arXiv:2507.06971 [pdf, ps, other]

Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting

Authors: Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Kunyu Peng, Jiaming Zhang, Kailun Yang

Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have… ▽ More Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird's Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360. △ Less

Submitted 9 July, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

Comments: The source code will be publicly available at https://github.com/Bryant-Teng/Percep360

arXiv:2507.05656 [pdf, ps, other]

ADPv2: A Hierarchical Histological Tissue Type-Annotated Dataset for Potential Biomarker Discovery of Colorectal Disease

Authors: Zhiyuan Yang, Kai Li, Sophia Ghamoshi Ramandi, Patricia Brassard, Hakim Khellaf, Vincent Quoc-Huy Trinh, Jennifer Zhang, Lina Chen, Corwyn Rowsell, Sonal Varma, Kostas Plataniotis, Mahdi S. Hosseini

Abstract: Computational pathology (CoPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CoPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the At… ▽ More Computational pathology (CoPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CoPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the Atlas of Digital Pathology (ADP), address this by offering diverse HTT annotations generalized to multiple organs, but limit the capability for in-depth studies on specific organ diseases. Building upon this foundation, we introduce ADPv2, a novel dataset focused on gastrointestinal histopathology. Our dataset comprises 20,004 image patches derived from healthy colon biopsy slides, annotated according to a hierarchical taxonomy of 32 distinct HTTs of 3 levels. Furthermore, we train a multilabel representation learning model following a two-stage training procedure on our ADPv2 dataset. We leverage the VMamba architecture and achieving a mean average precision (mAP) of 0.88 in multilabel classification of colon HTTs. Finally, we show that our dataset is capable of an organ-specific in-depth study for potential biomarker discovery by analyzing the model's prediction behavior on tissues affected by different colon diseases, which reveals statistical patterns that confirm the two pathological pathways of colon cancer development. Our dataset is publicly available at https://zenodo.org/records/15307021 △ Less

Submitted 9 July, 2025; v1 submitted 8 July, 2025; originally announced July 2025.

ACM Class: I.2.10; I.2.1

arXiv:2507.05451 [pdf]

Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging

Authors: Lijie Huang, Jingyi Yin, Jingke Zhang, U-Wai Lok, Ryan M. DeRuiter, Jieyang Jin, Kate M. Knoll, Kendra E. Petersen, James D. Krier, Xiang-yang Zhu, Gina K. Hesley, Kathryn A. Robinson, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen, Chengwu Huang

Abstract: Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs… ▽ More Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs from complementary angular subsets of beamformed radio-frequency (RF) blood flow data, across which vascular signals remain consistent while noise varies. HA2HA was trained using in-vivo contrast-free pig kidney data and validated across diverse datasets, including contrast-free and contrast-enhanced data from pig kidneys, as well as human liver and kidney. An improvement exceeding 15 dB in both contrast-to-noise ratio (CNR) and SNR was observed, indicating a substantial enhancement in image quality. In addition to power Doppler imaging, denoising directly in the RF domain is also beneficial for other downstream processing such as color Doppler imaging (CDI). CDI results of human liver derived from the HA2HA-denoised signals exhibited improved microvascular flow visualization, with a suppressed noisy background. HA2HA offers a label-free, generalizable, and clinically applicable solution for robust vascular imaging in both contrast-free and contrast-enhanced UMI. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 12 pages, 10 figures. Supplementary materials are available at https://zenodo.org/records/15832003

arXiv:2507.05177 [pdf, ps, other]

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for trans… ▽ More Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S △ Less

Submitted 8 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

Comments: Technical Report

arXiv:2507.04821 [pdf, ps, other]

Force-IMU Fusion-Based Sensing Acupuncture Needle and Quantitative Analysis System for Acupuncture Manipulations

Authors: Peng Tian, Kang Yu, Tianyun Jiang, Yuqi Wang, Haiying Zhang, Hao Yang, Yunfeng Wang, Jun Zhang, Shuo Gao, Junhong Gao

Abstract: Acupuncture, one of the key therapeutic methods in Traditional Chinese Medicine (TCM), has been widely adopted in various clinical fields. Quantitative research on acupuncture manipulation parameters is critical to achieve standardized techniques. However, quantitative mechanical detection of acupuncture parameters remains limited. This study establishes a kinematic and dynamic model of acupunctur… ▽ More Acupuncture, one of the key therapeutic methods in Traditional Chinese Medicine (TCM), has been widely adopted in various clinical fields. Quantitative research on acupuncture manipulation parameters is critical to achieve standardized techniques. However, quantitative mechanical detection of acupuncture parameters remains limited. This study establishes a kinematic and dynamic model of acupuncture, identifying key parameters such as lifting-thrusting force, acceleration, velocity, displacement, as well as twirling-rotating angular velocity and angle. To measure these critical parameters, we propose a quantitative system comprising a sensing needle equipped with a force sensor and an inertial measurement unit (IMU), as well as an external camera module to capture image information. By fusing visual and IMU data, we accurately identify the stationary or motion states of the needle, enabling segmented computation of lifting-thrusting velocity and displacement. The experimental results demonstrate that the sensing needle achieves comprehensive detection with high precision, featuring a nonlinearity error of 0.45% in force measurement and an RMSE of 1.2 mm in displacement. The extracted parameters provide an objective description of the operational characteristics and motion patterns of the four basic acupuncture manipulations. These findings provide valuable tools and methods for research in acupuncture standardization. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2507.04284 [pdf, ps, other]

High-Availability Integrity Monitoring for Multi-Constellation GNSS Navigation with Non-Gaussian Errors

Authors: Penggao Yan, Ronghe Jin, Junyi Zhang, Cheng-Wei Wang, Li-Ta Hsu

Abstract: Global navigation satellite systems (GNSS) are essential for aviation, requiring strict integrity monitoring to alert users to hazardously misleading information. Conventional receiver autonomous integrity monitoring (RAIM) and advanced RAIM (ARAIM) rely heavily on Gaussian models in bounding nominal errors, which can be overly conservative with real-world non-Gaussian errors with heavy tails, suc… ▽ More Global navigation satellite systems (GNSS) are essential for aviation, requiring strict integrity monitoring to alert users to hazardously misleading information. Conventional receiver autonomous integrity monitoring (RAIM) and advanced RAIM (ARAIM) rely heavily on Gaussian models in bounding nominal errors, which can be overly conservative with real-world non-Gaussian errors with heavy tails, such as the satellite clock and orbit errors. This paper proposes an extended jackknife detector capable of detecting multiple simultaneous faults with non-Gaussian nominal errors. Furthermore, an integrity monitoring algorithm, jackknife ARAIM, is developed by systematically exploiting the properties of the jackknife detector in the range domain. A tight bound of the integrity risk is derived by quantifying the impacts of hypothetical fault vectors on the position solution. The proposed method is examined in worldwide simulations, with the nominal measurement error simulated based on authentic experimental data, which reveals different findings in existing research. In a setting of a single Global Positioning System (GPS) constellation, the proposed method reduces the 99.5 percentile vertical protection level (VPL) 45m, where the VPL of the baseline ARAIM is larger than 50m in most user locations. For dual-constellation (GPS-Galileo) settings, baseline ARAIM suffers VPL inflation over 60m due to the over-conservatism induced by the heavy-tailed Galileo signal-in-space range errors, whereas the proposed jackknife ARAIM retains VPL below 40m, achieving over 92% normal operations for a 35m Vertical Alert Limit. These improvements have promising potential to support localizer performance with vertical guidance (LPV) with a decision height of 200 ft, enhancing integrity and availability for multi-constellation GNSS applications. △ Less

Submitted 6 July, 2025; originally announced July 2025.

Comments: Submitted to IEEE Transactions on Instrumentation and Measurement

arXiv:2507.03315 [pdf, ps, other]

Towards Interpretable PolSAR Image Classification: Polarimetric Scattering Mechanism Informed Concept Bottleneck and Kolmogorov-Arnold Network

Authors: Jinqi Zhang, Fangzhou Han, Di Zhuang, Lamei Zhang, Bin Zou, Li Yuan

Abstract: In recent years, Deep Learning (DL) based methods have received extensive and sufficient attention in the field of PolSAR image classification, which show excellent performance. However, due to the ``black-box" nature of DL methods, the interpretation of the high-dimensional features extracted and the backtracking of the decision-making process based on the features are still unresolved problems.… ▽ More In recent years, Deep Learning (DL) based methods have received extensive and sufficient attention in the field of PolSAR image classification, which show excellent performance. However, due to the ``black-box" nature of DL methods, the interpretation of the high-dimensional features extracted and the backtracking of the decision-making process based on the features are still unresolved problems. In this study, we first highlight this issue and attempt to achieve the interpretability analysis of DL-based PolSAR image classification technology with the help of Polarimetric Target Decomposition (PTD), a feature extraction method related to the scattering mechanism unique to the PolSAR image processing field. In our work, by constructing the polarimetric conceptual labels and a novel structure named Parallel Concept Bottleneck Networks (PaCBM), the uninterpretable high-dimensional features are transformed into human-comprehensible concepts based on physically verifiable polarimetric scattering mechanisms. Then, the Kolmogorov-Arnold Network (KAN) is used to replace Multi-Layer Perceptron (MLP) for achieving a more concise and understandable mapping process between layers and further enhanced non-linear modeling ability. The experimental results on several PolSAR datasets show that the features could be conceptualization under the premise of achieving satisfactory accuracy through the proposed pipeline, and the analytical function for predicting category labels from conceptual labels can be obtained by combining spline functions, thus promoting the research on the interpretability of the DL-based PolSAR image classification model. △ Less

Submitted 4 July, 2025; originally announced July 2025.

arXiv:2507.02437 [pdf, ps, other]

F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

Authors: Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu

Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in ran… ▽ More Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: This paper has been submitted to relevant journals

arXiv:2507.01876 [pdf, ps, other]

Joint Power Control and Precoding for Cell-Free Massive MIMO Systems With Sparse Multi-Dimensional Graph Neural Networks

Authors: Yukun Ma, Jiayi Zhang, Ziheng Liu, Guowei Shi, Bo Ai

Abstract: Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges,… ▽ More Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges, this correspondence proposes a framework based on a sparse multi-dimensional graph neural network (SP-MDGNN), which sparsifies the connections between access points (APs) and user equipments (UEs) to significantly reduce computational complexity while maintaining high performance. In addition, the weighted minimum mean square error (WMMSE) algorithm is introduced as a comparative method to further analyze the trade-off between performance and complexity. Simulation results demonstrate that the sparse method achieves an optimal balance between performance and complexity, significantly reducing the computational complexity of the original MDGNN method while incurring only a slight performance degradation, providing insights for the practical deployment of CF mMIMO systems in large-scale network. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 5 pages, 5 figures

arXiv:2507.01323 [pdf, ps, other]

SWinMamba: Serpentine Window State Space Model for Vascular Segmentation

Authors: Rongchang Zhao, Huanchi Liu, Jian Zhang

Abstract: Vascular segmentation in medical images is crucial for disease diagnosis and surgical navigation. However, the segmented vascular structure is often discontinuous due to its slender nature and inadequate prior modeling. In this paper, we propose a novel Serpentine Window Mamba (SWinMamba) to achieve accurate vascular segmentation. The proposed SWinMamba innovatively models the continuity of slende… ▽ More Vascular segmentation in medical images is crucial for disease diagnosis and surgical navigation. However, the segmented vascular structure is often discontinuous due to its slender nature and inadequate prior modeling. In this paper, we propose a novel Serpentine Window Mamba (SWinMamba) to achieve accurate vascular segmentation. The proposed SWinMamba innovatively models the continuity of slender vascular structures by incorporating serpentine window sequences into bidirectional state space models. The serpentine window sequences enable efficient feature capturing by adaptively guiding global visual context modeling to the vascular structure. Specifically, the Serpentine Window Tokenizer (SWToken) adaptively splits the input image using overlapping serpentine window sequences, enabling flexible receptive fields (RFs) for vascular structure modeling. The Bidirectional Aggregation Module (BAM) integrates coherent local features in the RFs for vascular continuity representation. In addition, dual-domain learning with Spatial-Frequency Fusion Unit (SFFU) is designed to enhance the feature representation of vascular structure. Extensive experiments on three challenging datasets demonstrate that the proposed SWinMamba achieves superior performance with complete and connected vessels. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.01055 [pdf, ps, other]

Prompt Mechanisms in Medical Imaging: A Comprehensive Survey

Authors: Hao Yang, Xinlong Liang, Zhang Li, Yue Sun, Zheyu Hu, Xinghe Xie, Behdad Dashtbozorg, Jincheng Huang, Shiwei Zhu, Luyi Han, Jiong Zhang, Shanshan Wang, Ritse Mann, Qifeng Yu, Tao Tan

Abstract: Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performa… ▽ More Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performance and adaptability without extensive retraining. This systematic review critically examines the burgeoning landscape of prompt engineering in medical imaging. We dissect diverse prompt modalities, including textual instructions, visual prompts, and learnable embeddings, and analyze their integration for core tasks such as image generation, segmentation, and classification. Our synthesis reveals how these mechanisms improve task-specific outcomes by enhancing accuracy, robustness, and data efficiency and reducing reliance on manual feature engineering while fostering greater model interpretability by making the model's guidance explicit. Despite substantial advancements, we identify persistent challenges, particularly in prompt design optimization, data heterogeneity, and ensuring scalability for clinical deployment. Finally, this review outlines promising future trajectories, including advanced multimodal prompting and robust clinical integration, underscoring the critical role of prompt-driven AI in accelerating the revolution of diagnostics and personalized treatment planning in medicine. △ Less

Submitted 27 June, 2025; originally announced July 2025.

arXiv:2507.00660 [pdf, ps, other]

MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound

Authors: Rusi Chen, Yuanting Yang, Jiezhi Yao, Hongning Song, Ji Zhang, Yongsong Zhou, Yuhao Huang, Ronghao Yang, Dan Jia, Yuhan Zhang, Xing Tao, Haoran Dou, Qing Zhou, Xin Yang, Dong Ni

Abstract: Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hind… ▽ More Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at https://github.com/crs524/MTCNet. △ Less

Submitted 3 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

Comments: Accepted by MICCAI 2025

arXiv:2507.00452 [pdf]

The impact of the following vehicles behaviors on the car following behaviors of the ego-vehicle

Authors: Yang Liu, Jiahao Zhang, Yuxuan Ouyang, Huan Yu, Dengbo He

Abstract: Among all types of crashes, rear-end crashes dominate, which are closely related to the car-following (CF) behaviors. Traditional CF behavior models focused on the influence of the vehicle in front, but usually ignored the peer pressure from the surrounding road users, including the following vehicle (FV). Based on an open dataset, the highD dataset, we investigated whether the FV's states can aff… ▽ More Among all types of crashes, rear-end crashes dominate, which are closely related to the car-following (CF) behaviors. Traditional CF behavior models focused on the influence of the vehicle in front, but usually ignored the peer pressure from the surrounding road users, including the following vehicle (FV). Based on an open dataset, the highD dataset, we investigated whether the FV's states can affect the CF behavior of the ego-vehicle in CF events. Two types of CF events were extracted from highD database, including the tailgated events, where the time headway between the FV and the ego-vehicle (i.e., time gap) was smaller than 1 second, and the gapped events, where the time gap was larger than 3 seconds. The dynamic time warping was used to extract CF pairs with similar speed profiles of the leading vehicle (LV). Statistical analyses were conducted to compare the CF-performance metrics in tailgated and gapped events. Then, the inverse reinforcement learning was used to recover the reward function of the ego-vehicle drivers in different CF events. The results showed that the ego-driver would adjust their CF behavior in response to the pressure from a tailgating FV, by maintaining a closer distance to the LV, but at the same time, driving more cautiously. Further, drivers were still able to adjust their CF strategies based on the speed of traffic flow and the distance to the LV, even when being tailgated. These findings provide insights regarding more accurate modelling of traffic flow by considering the peer pressure from surrounding road users. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2506.23495 [pdf, ps, other]

Far-Field vs. Near-Field Propagation Channels: Key Differences and Impact on 6G XL-MIMO Performance Evaluation

Authors: Zihang Ding, Jianhua Zhang, Changsheng You, Pan Tang, Hongbo Xing, Zhiqiang Yuan, Jie Meng, Guangyi Liu

Abstract: Extremely large-scale multiple-input multiple-output (XL-MIMO) is regarded as a promising technology for next-generation communication systems. However, this will expand the near-field (NF) range, rendering more users more likely to be located in the NF region. In this paper, we aim to answer two questions: What are the new characteristics of the NF channel? Is it necessary to develop new transciv… ▽ More Extremely large-scale multiple-input multiple-output (XL-MIMO) is regarded as a promising technology for next-generation communication systems. However, this will expand the near-field (NF) range, rendering more users more likely to be located in the NF region. In this paper, we aim to answer two questions: What are the new characteristics of the NF channel? Is it necessary to develop new transciver techniques to maintain system performance within the NF region? To this end, we first review current NF channel models and analyze the differences between the existing 3GPP TR 38.901 channel model and the NF channel model, including the spherical wavefront and spatially non-stationarity. Then, we provide examples on how these differences affect the XL-MIMO system performance in terms of beamforming gain and achievable rate. Simulation results demonstrate that, when using far-field (FF) technique under the NF channel, the maximum normalized beam gain loss is less than 3 dB for most users in the NF region defined by Rayleigh distance. Moreover, the achievable rate loss of beam training is less than 3% compared to that realized by NF technique. Finally, we demonstrate the necessity of employing NF transceiver techniques based on simulation results. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: 13 pages, 8 figures, 2 tables, 52 references. Note: This article has been submitted to China Communications and is currently under review

arXiv:2506.22467 [pdf]

SegmentAnyMuscle: A universal muscle segmentation model across different locations in MRI

Authors: Roy Colglazier, Jisoo Lee, Haoyu Dong, Hanxue Gu, Yaqian Chen, Joseph Cao, Zafer Yildiz, Zhonghao Liu, Nicholas Konz, Jichen Yang, Jikai Zhang, Yuwen Chen, Lin Li, Adrian Camarena, Maciej A. Mazurowski

Abstract: The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locati… ▽ More The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locations and imaging sequences. A total of 362 MRIs from 160 patients at a single tertiary center (Duke University Health System, 2016-2020) were included, with 316 MRIs from 114 patients used for model development. The model was tested on two separate sets: one with 28 MRIs representing common sequence types, achieving an average Dice Similarity Coefficient (DSC) of 88.45%, and another with 18 MRIs featuring less frequent sequences and abnormalities such as muscular atrophy, hardware, and significant noise, achieving 86.21% DSC. These results demonstrate the feasibility of a fully automated deep learning algorithm for segmenting muscles on MRI across diverse settings. The public release of this model enables consistent, reproducible research into the relationship between musculature and health. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 24 pages, 6 figures

arXiv:2506.22012 [pdf, ps, other]

Noise-Inspired Diffusion Model for Generalizable Low-Dose CT Reconstruction

Authors: Qi Gao, Zhihao Chen, Dong Zeng, Junping Zhang, Jianhua Ma, Hongming Shan

Abstract: The generalization of deep learning-based low-dose computed tomography (CT) reconstruction models to doses unseen in the training data is important and remains challenging. Previous efforts heavily rely on paired data to improve the generalization performance and robustness through collecting either diverse CT data for re-training or a few test data for fine-tuning. Recently, diffusion models have… ▽ More The generalization of deep learning-based low-dose computed tomography (CT) reconstruction models to doses unseen in the training data is important and remains challenging. Previous efforts heavily rely on paired data to improve the generalization performance and robustness through collecting either diverse CT data for re-training or a few test data for fine-tuning. Recently, diffusion models have shown promising and generalizable performance in low-dose CT (LDCT) reconstruction, however, they may produce unrealistic structures due to the CT image noise deviating from Gaussian distribution and imprecise prior information from the guidance of noisy LDCT images. In this paper, we propose a noise-inspired diffusion model for generalizable LDCT reconstruction, termed NEED, which tailors diffusion models for noise characteristics of each domain. First, we propose a novel shifted Poisson diffusion model to denoise projection data, which aligns the diffusion process with the noise model in pre-log LDCT projections. Second, we devise a doubly guided diffusion model to refine reconstructed images, which leverages LDCT images and initial reconstructions to more accurately locate prior information and enhance reconstruction fidelity. By cascading these two diffusion models for dual-domain reconstruction, our NEED requires only normal-dose data for training and can be effectively extended to various unseen dose levels during testing via a time step matching strategy. Extensive qualitative, quantitative, and segmentation-based evaluations on two datasets demonstrate that our NEED consistently outperforms state-of-the-art methods in reconstruction and generalization performance. Source code is made available at https://github.com/qgao21/NEED. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Accepted for publication in Medical Image Analysis, 2025

arXiv:2506.21198 [pdf, ps, other]

Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Authors: Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng, Hang Liu, Kailun Yang, Hui Zhang

Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Fre… ▽ More Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Accepted to ICCV 2025. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK

arXiv:2506.17887 [pdf, ps, other]

Near-Field Propagation and Spatial Non-Stationarity Channel Model for 6-24 GHz (FR3) Extremely Large-Scale MIMO: Adopted by 3GPP for 6G

Authors: Huixin Xu, Jianhua Zhang, Pan Tang, Hongbo Xing, Haiyang Miao, Nan Zhang, Jian Li, Jianming Wu, Wenfei Yang, Zhening Zhang, Wei Jiang, Zijian He, Afshin Haghighat, Qixing Wang, Guangyi Liu

Abstract: Next generation cellular deployments are expected to exploit the 6-24 GHz frequency range 3 (FR3) and extremely large-scale multiple-input multiple-output (XL-MIMO) to enable ultra-high data rates and reliability. However, the significantly enlarged antenna apertures and higher carrier frequencies render the far-field and spatial stationarity assumptions in the existing 3rd generation partnership… ▽ More Next generation cellular deployments are expected to exploit the 6-24 GHz frequency range 3 (FR3) and extremely large-scale multiple-input multiple-output (XL-MIMO) to enable ultra-high data rates and reliability. However, the significantly enlarged antenna apertures and higher carrier frequencies render the far-field and spatial stationarity assumptions in the existing 3rd generation partnership project (3GPP) channel models invalid, giving rise to new features such as near-field propagation and spatial non-stationarity (SNS). Despite extensive prior research, incorporating these new features within the standardized channel modeling framework remains an open issue. To address this, this paper presents a channel modeling framework for XL-MIMO systems that incorporates both near-field and SNS features, adopted by 3GPP. For the near-field propagation feature, the framework models the distances from the base station (BS) and user equipment to the spherical-wave sources associated with clusters. These distances are used to characterize element-wise variations of path parameters, such as nonlinear changes in phase and angle. To capture the effect of SNS at the BS side, a stochastic-based approach is proposed to model SNS caused by incomplete scattering, by establishing power attenuation factors from visibility probability and visibility region to characterize antenna element-wise path power variation. In addition, a physical blocker-based approach is introduced to model SNS effects caused by partial blockage. Finally, a simulation framework for near-field and SNS is developed within the structure of the existing 3GPP channel model. Performance evaluations demonstrate that the near-field model captures higher channel capacity potential compared to the far-field model. Coupling loss results indicate that SNS leads to more pronounced propagation fading relative to the spatial stationary model. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.15972 [pdf, ps, other]

Theoretical Analysis of Near-Field MIMO Channel Capacity and Mid-Band Experimental Validation

Authors: Haiyang Miao, Jianhua Zhang, Pan Tang, Heng Wang, Lei Tian, Guangyi Liu

Abstract: With the increase of multiple-input-multiple-output (MIMO) array size and carrier frequency, near-field MIMO communications will become crucial in 6G wireless networks. Due to the increase of MIMO near-field range, the research of near-field MIMO capacity has aroused wide interest. In this paper, we focus on the theoretical analysis and empirical study of near-field MIMO capacity. First, the near-… ▽ More With the increase of multiple-input-multiple-output (MIMO) array size and carrier frequency, near-field MIMO communications will become crucial in 6G wireless networks. Due to the increase of MIMO near-field range, the research of near-field MIMO capacity has aroused wide interest. In this paper, we focus on the theoretical analysis and empirical study of near-field MIMO capacity. First, the near-field channel model is characterized from the electromagnetic information perspective. Second, with the uniform planar array (UPA), the channel capacity based on effective degree of freedom (EDoF) is analyzed theoretically, and the closed-form analytical expressions are derived in detail. Finally, based on the numerical verification of near-field channel measurement experiment at 13 GHz band, we reveal that the channel capacity of UPA-type MIMO systems decreases continuously with the communication distance increasing. It can be observed that the near-field channel capacity gain is relatively obvious when large-scale MIMO is adopted at both receiving and transmitter ends, but the near-field channel capacity gain may be limited in the actual communication system with the small antenna array at receiving end. This work will give some reference to the near-field communication systems. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.14165 [pdf, ps, other]

A Comprehensive Survey on Underwater Acoustic Target Positioning and Tracking: Progress, Challenges, and Perspectives

Authors: Zhong Yang, Zhengqiu Zhu, Yong Zhao, Yonglin Tian, Changjun Fan, Runkang Guo, Wenhao Lu, Jingwei Ge, Bin Chen, Yin Zhang, Guohua Wu, Rui Wang, Gyorgy Eigner, Guangquan Cheng, Jincai Huang, Zhong Liu, Jun Zhang, Imre J. Rudas, Fei-Yue Wang

Abstract: Underwater target tracking technology plays a pivotal role in marine resource exploration, environmental monitoring, and national defense security. Given that acoustic waves represent an effective medium for long-distance transmission in aquatic environments, underwater acoustic target tracking has become a prominent research area of underwater communications and networking. Existing literature re… ▽ More Underwater target tracking technology plays a pivotal role in marine resource exploration, environmental monitoring, and national defense security. Given that acoustic waves represent an effective medium for long-distance transmission in aquatic environments, underwater acoustic target tracking has become a prominent research area of underwater communications and networking. Existing literature reviews often offer a narrow perspective or inadequately address the paradigm shifts driven by emerging technologies like deep learning and reinforcement learning. To address these gaps, this work presents a systematic survey of this field and introduces an innovative multidimensional taxonomy framework based on target scale, sensor perception modes, and sensor collaboration patterns. Within this framework, we comprehensively survey the literature (more than 180 publications) over the period 2016-2025, spanning from the theoretical foundations to diverse algorithmic approaches in underwater acoustic target tracking. Particularly, we emphasize the transformative potential and recent advancements of machine learning techniques, including deep learning and reinforcement learning, in enhancing the performance and adaptability of underwater tracking systems. Finally, this survey concludes by identifying key challenges in the field and proposing future avenues based on emerging technologies such as federated learning, blockchain, embodied intelligence, and large models. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13443 [pdf]

PRO: Projection Domain Synthesis for CT Imaging

Authors: Kang Chen, Bin Huang, Xuebin Yang, Junyan Zhang, Qiegen Liu

Abstract: Synthesizing high quality CT projection data remains a significant challenge due to the limited availability of annotated data and the complex nature of CT imaging. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that ope… ▽ More Synthesizing high quality CT projection data remains a significant challenge due to the limited availability of annotated data and the complex nature of CT imaging. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that operate in the image domain, PRO learns rich structural representations from raw projection data and leverages anatomical text prompts for controllable synthesis. This projection domain strategy enables more faithful modeling of underlying imaging physics and anatomical structures. Moreover, PRO functions as a foundation model, capable of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves performance across multiple downstream tasks, including low-dose and sparse-view reconstruction. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO. △ Less

Submitted 18 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.12544 [pdf, ps, other]

Constrained Diffusers for Safe Planning and Control

Authors: Jichen Zhang, Liqun Zhao, Antonis Papachristodoulou, Jack Umenberger

Abstract: Diffusion models have shown remarkable potential in planning and control tasks due to their ability to represent multimodal distributions over actions and trajectories. However, ensuring safety under constraints remains a critical challenge for diffusion models. This paper proposes Constrained Diffusers, a novel framework that incorporates constraints into pre-trained diffusion models without retr… ▽ More Diffusion models have shown remarkable potential in planning and control tasks due to their ability to represent multimodal distributions over actions and trajectories. However, ensuring safety under constraints remains a critical challenge for diffusion models. This paper proposes Constrained Diffusers, a novel framework that incorporates constraints into pre-trained diffusion models without retraining or architectural modifications. Inspired by constrained optimization, we apply a constrained Langevin sampling mechanism for the reverse diffusion process that jointly optimizes the trajectory and realizes constraint satisfaction through three iterative algorithms: projected method, primal-dual method and augmented Lagrangian approaches. In addition, we incorporate discrete control barrier functions as constraints for constrained diffusers to guarantee safety in online implementation. Experiments in Maze2D, locomotion, and pybullet ball running tasks demonstrate that our proposed methods achieve constraint satisfaction with less computation time, and are competitive to existing methods in environments with static and time-varying constraints. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: 12 pages, 5 figures

arXiv:2506.12073 [pdf, ps, other]

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis

Authors: Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Accurate alignment of dysfluent speech with intended text is crucial for automating the diagnosis of neurodegenerative speech disorders. Traditional methods often fail to model phoneme similarities effectively, limiting their performance. In this work, we propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment. Neural LCS addresses key challenges, including partial a… ▽ More Accurate alignment of dysfluent speech with intended text is crucial for automating the diagnosis of neurodegenerative speech disorders. Traditional methods often fail to model phoneme similarities effectively, limiting their performance. In this work, we propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment. Neural LCS addresses key challenges, including partial alignment and context-aware similarity mapping, by leveraging robust phoneme-level modeling. We evaluate our method on a large-scale simulated dataset, generated using advanced data simulation techniques, and real PPA data. Neural LCS significantly outperforms state-of-the-art models in both alignment accuracy and dysfluent speech segmentation. Our results demonstrate the potential of Neural LCS to enhance automated systems for diagnosing and analyzing speech disorders, offering a more accurate and linguistically grounded solution for dysfluent speech alignment. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted for Interspeech2025

arXiv:2506.11514 [pdf, ps, other]

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders

Authors: Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan

Abstract: Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a comp… ▽ More Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a compact encoder network. Subsequently, a vocoder synthesizes the clean speech from denoised embeddings. An ablation study substantiates the parameter efficiency of the denoise encoder with a pre-trained audioencoder and vocoder. Experimental results on both speech enhancement and speaker fidelity demonstrate that our generative audioencoder-based SE system outperforms models utilizing discriminative audioencoders. Furthermore, subjective listening tests validate that our proposed system surpasses an existing state-of-the-art SE model in terms of perceptual quality. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025

arXiv:2506.11350 [pdf, ps, other]

GLAP: General contrastive audio-text pretraining across domains and languages

Authors: Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan

Abstract: Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by a… ▽ More Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10813 [pdf, ps, other]

Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing

Authors: Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang, Min Liu, Yaonan Wang, Gaolei Li, Xinxing Cheng, Jinming Duan

Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsuper… ▽ More Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: Accepted for publication at Information Processing in Medical Imaging (IPMI) 2025

arXiv:2506.10207 [pdf, ps, other]

FedMLAC: Mutual Learning Driven Heterogeneous Federated Audio Classification

Authors: Jun Bai, Rajib Rana, Di Wu, Youyang Qu, Xiaohui Tao, Ji Zhang

Abstract: Federated Learning (FL) provides a privacy-preserving paradigm for training audio classification (AC) models across distributed clients without sharing raw data. However, Federated Audio Classification (FedAC) faces three critical challenges that substantially hinder performance: data heterogeneity, model heterogeneity, and data poisoning. While prior works have attempted to address these issues,… ▽ More Federated Learning (FL) provides a privacy-preserving paradigm for training audio classification (AC) models across distributed clients without sharing raw data. However, Federated Audio Classification (FedAC) faces three critical challenges that substantially hinder performance: data heterogeneity, model heterogeneity, and data poisoning. While prior works have attempted to address these issues, they are typically treated independently, lacking a unified and robust solution suited to real-world federated audio scenarios. To bridge this gap, we propose FedMLAC, a unified mutual learning framework designed to simultaneously tackle these challenges in FedAC. Specifically, FedMLAC introduces a dual-model architecture on each client, comprising a personalized local AC model and a lightweight, globally shared Plug-in model. Through bidirectional knowledge distillation, the Plug-in model enables global knowledge transfer while adapting to client-specific data distributions, thus supporting both generalization and personalization. To further enhance robustness against corrupted audio data, we develop a Layer-wise Pruning Aggregation (LPA) strategy that filters unreliable Plug-in model updates based on parameter deviations during server-side aggregation. Extensive experiments on four diverse audio classification benchmarks, spanning both speech and non-speech tasks, demonstrate that FedMLAC consistently outperforms existing state-of-the-art methods in terms of classification accuracy and robustness to noisy data. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: initial version

arXiv:2506.09807 [pdf, ps, other]

doi 10.1109/TIFS.2025.3570118

Physical Layer-Based Device Fingerprinting for Wireless Security: From Theory to Practice

Authors: Junqing Zhang, Francesco Ardizzon, Mattia Piana, Guanxiong Shen, Stefano Tomasin

Abstract: The identification of the devices from which a message is received is part of security mechanisms to ensure authentication in wireless communications. Conventional authentication approaches are cryptography-based, which, however, are usually computationally expensive and not adequate in the Internet of Things (IoT), where devices tend to be low-cost and with limited resources. This paper provides… ▽ More The identification of the devices from which a message is received is part of security mechanisms to ensure authentication in wireless communications. Conventional authentication approaches are cryptography-based, which, however, are usually computationally expensive and not adequate in the Internet of Things (IoT), where devices tend to be low-cost and with limited resources. This paper provides a comprehensive survey of physical layer-based device fingerprinting, which is an emerging device authentication for wireless security. In particular, this article focuses on hardware impairment-based identity authentication and channel features-based authentication. They are passive techniques that are readily applicable to legacy IoT devices. Their intrinsic hardware and channel features, algorithm design methodologies, application scenarios, and key research questions are extensively reviewed here. The remaining research challenges are discussed, and future work is suggested that can further enhance the physical layer-based device fingerprinting. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.08404 [pdf, ps, other]

Compact Amplified Laser Power Stabilization Using Robust Active Disturbance Rejection Control with Sensor Noise Decoupling

Authors: Yanpei Shi, Jingxuan Zhang, Zhuo Shi, Chenyao Zhang, Yuze Guo, Rui Feng

Abstract: Laser power instability, encompassing random jitter and slow drift, severely limits the performance of optically pumped magnetometers (OPMs) in detecting ultra-weak magnetic fields, especially in large-scale OPM arrays for magnetoencephalography. Although a unified amplified laser (AL) architecture improves integration, fluctuations in the pump beam progressively degrade performance across all cha… ▽ More Laser power instability, encompassing random jitter and slow drift, severely limits the performance of optically pumped magnetometers (OPMs) in detecting ultra-weak magnetic fields, especially in large-scale OPM arrays for magnetoencephalography. Although a unified amplified laser (AL) architecture improves integration, fluctuations in the pump beam progressively degrade performance across all channels, exacerbated by environmental disturbances and system uncertainties. To address this challenge, this paper presents a compact AL power stabilization approach based on an innovative dual-loop active disturbance rejection control (DLADRC) strategy, while integrating a comprehensive quantitative stability analysis through novel exponential decay estimates for extended state observers (ESOs) and control error dynamics. As validated through physical experimental results, the proposed method significantly improves AL's long-term stability with sensor noise decoupling, achieving an over 85.7% reduction in 1-hour power instability and a tenfold decrease in Allan variance for correlation times 10^2 s--10^3 s, compared to standard ADRC. Crucially, the strategy demonstrates robust effectiveness across diverse operating scenarios, enabling AL-based OPM systems to achieve their full potential in high-sensitivity biomagnetic field detection. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07599 [pdf, ps, other]

Flexible MIMO for Future Wireless Communications: Which Flexibilities are Possible?

Authors: Zhe Wang, Jiayi Zhang, Bokai Xu, Wenhui Yi, Emil Björnson, Bo Ai

Abstract: To enable next-generation wireless communication networks with modest spectrum availability, multiple-input multiple-output (MIMO) technology needs to undergo further evolution. In this paper, we introduce a promising next-generation wireless communication concept: flexible MIMO technology. This technology represents a MIMO technology with flexible physical configurations and integrated applicatio… ▽ More To enable next-generation wireless communication networks with modest spectrum availability, multiple-input multiple-output (MIMO) technology needs to undergo further evolution. In this paper, we introduce a promising next-generation wireless communication concept: flexible MIMO technology. This technology represents a MIMO technology with flexible physical configurations and integrated applications. We categorize twelve representative flexible MIMO technologies into three major classifications: flexible deployment characteristics-based, flexible geometry characteristics-based, and flexible real-time modifications-based. Then, we provide a comprehensive overview of their fundamental characteristics, potential, and challenges. Furthermore, we demonstrate three vital enablers for the flexible MIMO technology, including efficient channel state information (CSI) acquisition schemes, low-complexity beamforming design, and explainable artificial intelligence (AI)-enabled optimization. Within these areas, eight critical sub-enabling technologies are discussed in detail. Finally, we present two case studies-pre-optimized irregular arrays and cell-free movable antennas-where significant potential for flexible MIMO technologies to enhance the system capacity is showcased. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 9 pages, 5 figures, 1 table

arXiv:2506.06710 [pdf, ps, other]

A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

Authors: Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li

Abstract: Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous i… ▽ More Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: https://github.com/nqian1/Survey-on-ODISR-and-ODVSR. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.06360 [pdf]

Towards Generalizable Drowsiness Monitoring with Physiological Sensors: A Preliminary Study

Authors: Jiyao Wang, Suzan Ayas, Jiahao Zhang, Xiao Wen, Dengbo He, Birsen Donmez

Abstract: Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity… ▽ More Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity (EDA), and respiratory (RESP) signals across four datasets, where different drowsiness inducers (such as fatigue and low arousal) and assessment methods (subjective vs. objective) were used. Binary logistic regression models were built to identify the physiological metrics that are associated with drowsiness. Findings indicate that distinct different drowsiness inducers can lead to different physiological responses, and objective assessments were more sensitive than subjective ones in detecting drowsiness. Further, the increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA are robustly associated with increased drowsiness. The results enhance understanding of drowsiness detection and can inform future generalizable monitoring designs. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Accepted by HFES2025

arXiv:2506.05921 [pdf, ps, other]

Multi-Modal Large Models Based Beam Prediction: An Example Empowered by DeepSeek

Authors: Yizhu Zhao, Li Yu, Lianzheng Shi, Jianhua Zhang, Guangyi Liu

Abstract: Beam prediction is an effective approach to reduce training overhead in massive multiple-input multiple-output (MIMO) systems. However, existing beam prediction models still exhibit limited generalization ability in diverse scenarios, which remains a critical challenge. In this paper, we propose MLM-BP, a beam prediction framework based on the multi-modal large model released by DeepSeek, with ful… ▽ More Beam prediction is an effective approach to reduce training overhead in massive multiple-input multiple-output (MIMO) systems. However, existing beam prediction models still exhibit limited generalization ability in diverse scenarios, which remains a critical challenge. In this paper, we propose MLM-BP, a beam prediction framework based on the multi-modal large model released by DeepSeek, with full consideration of multi-modal environmental information. Specifically, the distribution of scatterers that impact the optimal beam is captured by the sensing devices. Then positions are tokenized to generate text-based representations, and multi-view images are processed by an image encoder, which is fine-tuned with low-rank adaptation (LoRA), to extract environmental embeddings. Finally, these embeddings are fed into the large model, and an output projection module is designed to determine the optimal beam index. Simulation results show that MLM-BP achieves 98.1% Top-1 accuracy on the simulation dataset. Additionally, it demonstrates few-shot generalization on a real-world dataset, achieving 72.7% Top-1 accuracy and 92.4% Top-3 accuracy with only 30% of the dataset, outperforming the existing small models by over 15%. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.04594 [pdf, other]

Intelligent Channel Allocation for IEEE 802.11be Multi-Link Operation: When MAB Meets LLM

Authors: Shumin Lian, Jingwen Tong, Jun Zhang, Liqun Fu

Abstract: WiFi networks have achieved remarkable success in enabling seamless communication and data exchange worldwide. The IEEE 802.11be standard, known as WiFi 7, introduces Multi-Link Operation (MLO), a groundbreaking feature that enables devices to establish multiple simultaneous connections across different bands and channels. While MLO promises substantial improvements in network throughput and laten… ▽ More WiFi networks have achieved remarkable success in enabling seamless communication and data exchange worldwide. The IEEE 802.11be standard, known as WiFi 7, introduces Multi-Link Operation (MLO), a groundbreaking feature that enables devices to establish multiple simultaneous connections across different bands and channels. While MLO promises substantial improvements in network throughput and latency reduction, it presents significant challenges in channel allocation, particularly in dense network environments. Current research has predominantly focused on performance analysis and throughput optimization within static WiFi 7 network configurations. In contrast, this paper addresses the dynamic channel allocation problem in dense WiFi 7 networks with MLO capabilities. We formulate this challenge as a combinatorial optimization problem, leveraging a novel network performance analysis mechanism. Given the inherent lack of prior network information, we model the problem within a Multi-Armed Bandit (MAB) framework to enable online learning of optimal channel allocations. Our proposed Best-Arm Identification-enabled Monte Carlo Tree Search (BAI-MCTS) algorithm includes rigorous theoretical analysis, providing upper bounds for both sample complexity and error probability. To further reduce sample complexity and enhance generalizability across diverse network scenarios, we put forth LLM-BAI-MCTS, an intelligent algorithm for the dynamic channel allocation problem by integrating the Large Language Model (LLM) into the BAI-MCTS algorithm. Numerical results demonstrate that the BAI-MCTS algorithm achieves a convergence rate approximately $50.44\%$ faster than the state-of-the-art algorithms when reaching $98\%$ of the optimal value. Notably, the convergence rate of the LLM-BAI-MCTS algorithm increases by over $63.32\%$ in dense networks. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: This work has been accepted by JSAC 2025

ACM Class: I.2.7

arXiv:2506.02642 [pdf, ps, other]

Joint Optimization based on Two-phase GNN in RIS- and DF-assisted MISO Systems with Fine-grained Rate Demands

Authors: Huijun Tang, Jieling Zhang, Zhidong Zhao, Huaming Wu, Hongjian Sun, Pengfei Jiao

Abstract: Reconfigurable intelligent Surfaces (RIS) and half-duplex decoded and forwarded (DF) relays can collaborate to optimize wireless signal propagation in communication systems. Users typically have different rate demands and are clustered into groups in practice based on their requirements, where the former results in the trade-off between maximizing the rate and satisfying fine-grained rate demands,… ▽ More Reconfigurable intelligent Surfaces (RIS) and half-duplex decoded and forwarded (DF) relays can collaborate to optimize wireless signal propagation in communication systems. Users typically have different rate demands and are clustered into groups in practice based on their requirements, where the former results in the trade-off between maximizing the rate and satisfying fine-grained rate demands, while the latter causes a trade-off between inter-group competition and intra-group cooperation when maximizing the sum rate. However, traditional approaches often overlook the joint optimization encompassing both of these trade-offs, disregarding potential optimal solutions and leaving some users even consistently at low date rates. To address this issue, we propose a novel joint optimization model for a RIS- and DF-assisted multiple-input single-output (MISO) system where a base station (BS) is with multiple antennas transmits data by multiple RISs and DF relays to serve grouped users with fine-grained rate demands. We design a new loss function to not only optimize the sum rate of all groups but also adjust the satisfaction ratio of fine-grained rate demands by modifying the penalty parameter. We further propose a two-phase graph neural network (GNN) based approach that inputs channel state information (CSI) to simultaneously and autonomously learn efficient phase shifts, beamforming, and relay selection. The experimental results demonstrate that the proposed method significantly improves system performance. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 14 Pages, 9 figures, accepted by IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS

arXiv:2506.00480 [pdf, ps, other]

The Coupling Effect of Sensing Targets on the Environment for 3GPP ISAC Channels: Observation, Modeling, and Validation

Authors: Yameng Liu, Jianhua Zhang, Yuxiang Zhang, Hongbo Xing, Yifeng Xiong, Zhiqiang Yuan, Guangyi Liu

Abstract: Integrated Sensing And Communication (ISAC) has been identified as a key 6G application by ITU and 3GPP, with standardization efforts already underway. Sensing tasks, such as target localization, demand more precise characterization of the sensing target (ST) in ISAC channel modeling. The ST couples complexly with environmental scatterers, potentially blocking some multipaths and generating new on… ▽ More Integrated Sensing And Communication (ISAC) has been identified as a key 6G application by ITU and 3GPP, with standardization efforts already underway. Sensing tasks, such as target localization, demand more precise characterization of the sensing target (ST) in ISAC channel modeling. The ST couples complexly with environmental scatterers, potentially blocking some multipaths and generating new ones, resulting in power variations compared to the original channel. To accurately model this effect, this paper proposes a coupled ISAC channel model based on measurements and validates it through similarity analysis between simulated and measured channels. In this work, we first conduct ISAC channel measurements in an indoor factory scenario at 105 GHz, where the multipath power variations caused by the ST's interaction with the environment are clearly observed. Then, we propose an ISAC channel modeling framework that incorporates two novel parameters: the Blockage-Region Coupling Factor (BR-CF) and the Forward-Scattering (FS)-CF, which characterize the spatial region and intensity of the coupling effect, respectively. Finally, the proposed model is validated through similarity comparison with measured data, demonstrating higher accuracy for both LoS and NLoS scenarios compared to the non-coupled model. This realistic ISAC channel model provides an effective framework for capturing the ST-environment coupling effect, supporting the design and evaluation of ISAC technologies. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.00466 [pdf, ps, other]

M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction

Authors: Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv

Abstract: The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic tempo… ▽ More The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic temporal operations (e.g., one-dimensional convolution), which are unable to effectively extract target speaker information. To address these issues, this paper proposes a multi-scale and multi-modal alignment network (M3ANet) for brain-assisted TSE. Specifically, to eliminate the temporal inconsistency between EEG and speech modalities, the modal alignment module that uses a contrastive learning strategy is applied to align the temporal features of both modalities. Additionally, to fully extract speech information, multi-scale convolutions with GroupMamba modules are used as the speech encoder, which scans speech features at each scale from different directions, enabling the model to capture deep sequence information. Experimental results on three publicly available datasets show that the proposed model outperforms current state-of-the-art methods across various evaluation metrics, highlighting the effectiveness of our proposed method. The source code is available at: https://github.com/fchest/M3ANet. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: Accepted to IJCAI 2025

arXiv:2505.24576 [pdf, ps, other]

A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement

Authors: Jie Zhang, Haoyin Yan, Xiaofei Li

Abstract: It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in h… ▽ More It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing

arXiv:2505.22029 [pdf, ps, other]

Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection

Authors: Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the… ▽ More Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at https://github.com/Berkeley-Speech-Group/LLM-Dys. △ Less

Submitted 22 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.21384 [pdf]

Label-free Super-Resolution Microvessel Color Flow Imaging with Ultrasound

Authors: Zhengchang Kou, Junhang Zhang, Chen Gong, Jie Ji, Nathiya Vaithiyalingam Chandra Sekaran, Zikai Wang, Rita J. Miller, Yaoheng Yang, Daniel Adolfo Llano, Qifa Zhou, Michael L. Oelze

Abstract: We present phase subtraction imaging (PSI), a new spatial-temporal beamforming method that enables micrometer level resolution imaging of microvessels in live animals without labels, which are microbubbles in ultrasound super-resolution imaging. Subtraction of relative phase differences between consecutive frames beamformed with mismatched apodizations is used in PSI to overcome the diffraction li… ▽ More We present phase subtraction imaging (PSI), a new spatial-temporal beamforming method that enables micrometer level resolution imaging of microvessels in live animals without labels, which are microbubbles in ultrasound super-resolution imaging. Subtraction of relative phase differences between consecutive frames beamformed with mismatched apodizations is used in PSI to overcome the diffraction limit. We validated this method by imaging both the mouse brain and rabbit kidney using different ultrasound probes and scanning machines. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.20984 [pdf, ps, other]

Generative Image Compression by Estimating Gradients of the Rate-variable Feature Distribution

Authors: Minghao Han, Weiyi You, Jinhua Zhang, Leheng Zhang, Ce Zhu, Shuhang Gu

Abstract: While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly e… ▽ More While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing generative image compression approaches across a range of metrics, including perceptual distortion, statistical fidelity, and no-reference quality assessments. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.20673 [pdf, other]

A Unified RCS Modeling of Typical Targets for 3GPP ISAC Channel Standardization and Experimental Analysis

Authors: Yuxiang Zhang, Jianhua Zhang, Xidong Hu, Jiwei Zhang, Hongbo Xing, Huiwen Gong, Shilin Luo, Yifeng Xiong, Li Yu, Zhiqing Yuan, Guangyi Liu, Tao Jiang

Abstract: Accurate radar cross section (RCS) modeling is crucial for characterizing target scattering and improving the precision of Integrated Sensing and Communication (ISAC) channel modeling. Existing RCS models are typically designed for specific target types, leading to increased complexity and lack of generalization. This makes it difficult to standardize RCS models for 3GPP ISAC channels, which need… ▽ More Accurate radar cross section (RCS) modeling is crucial for characterizing target scattering and improving the precision of Integrated Sensing and Communication (ISAC) channel modeling. Existing RCS models are typically designed for specific target types, leading to increased complexity and lack of generalization. This makes it difficult to standardize RCS models for 3GPP ISAC channels, which need to account for multiple typical target types simultaneously. Furthermore, 3GPP models must support both system-level and link-level simulations, requiring the integration of large-scale and small-scale scattering characteristics. To address these challenges, this paper proposes a unified RCS modeling framework that consolidates these two aspects. The model decomposes RCS into three components: (1) a large-scale power factor representing overall scattering strength, (2) a small-scale angular-dependent component describing directional scattering, and (3) a random component accounting for variations across target instances. We validate the model through mono-static RCS measurements for UAV, human, and vehicle targets across five frequency bands. The results demonstrate that the proposed model can effectively capture RCS variations for different target types. Finally, the model is incorporated into an ISAC channel simulation platform to assess the impact of target RCS characteristics on path loss, delay spread, and angular spread, providing valuable insights for future ISAC system design. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: 13 pages,12 figures,39 conferences,submitted to IEEE Journal on Selected Areas in Communications

arXiv:2505.20424 [pdf, ps, other]

Robot Operation of Home Appliances by Reading User Manuals

Authors: Jian Zhang, Hanbo Zhang, Anxing Xiao, David Hsu

Abstract: Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) gr… ▽ More Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19539 [pdf, ps, other]

Water Level Sensing via Communication Signals in a Bi-Static System

Authors: Zhongqin Wang, J. Andrew Zhang, Kai Wu, Y. Jay Guo

Abstract: Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for w… ▽ More Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for water level sensing. Our scheme begins with a CSI-power method to eliminate phase offsets caused by clock asynchrony in bi-static systems. We then apply multi-domain filtering across the time (Doppler), frequency (delay), and spatial (Angle-of-Arrival, AoA) domains to extract phase features that finely capture variations in path length over water. To resolve the $2π$ phase ambiguity, we introduce a Kalman filter-based unwrapping technique. Additionally, we exploit transceiver geometry to convert path length variations into water level height changes, even with limited antenna configurations. We validate our framework through controlled experiments with 28 GHz mmWave and 3.1 GHz LTE signals in real time, achieving average height estimation errors of 0.025 cm and 0.198 cm, respectively. Moreover, real-world river monitoring with 2.6 GHz LTE signals achieves an average error of 4.8 cm for a 1-meter water level change, demonstrating its effectiveness in practical deployments. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.17426 [pdf, ps, other]

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Authors: Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang

Abstract: The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be… ▽ More The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.17421 [pdf, ps, other]

Adaptive Implicit-Based Deep Learning Channel Estimation for 6G Communications

Authors: Zhen Qiao, Jiang Xue, Junkai Zhang, Guanzhang Liu, Xiaoqin Ma, Runhua Li, Faheem A. Khan, John S. Thompson, Zongben Xu

Abstract: With the widespread deployment of fifth-generation (5G) wireless networks, research on sixth-generation (6G) technology is gaining momentum. Artificial Intelligence (AI) is anticipated to play a significant role in 6G, particularly through integration with the physical layer for tasks such as channel estimation. Considering resource limitations in real systems, the AI algorithm should be designed… ▽ More With the widespread deployment of fifth-generation (5G) wireless networks, research on sixth-generation (6G) technology is gaining momentum. Artificial Intelligence (AI) is anticipated to play a significant role in 6G, particularly through integration with the physical layer for tasks such as channel estimation. Considering resource limitations in real systems, the AI algorithm should be designed to have the ability to balance the accuracy and resource consumption according to the scenarios dynamically. However, conventional explicit multilayer-stacked Deep Learning (DL) models struggle to adapt due to their heavy reliance on the structure of deep neural networks. This article proposes an adaptive Implicit-layer DL Channel Estimation Network (ICENet) with a lightweight framework for vehicle-to-everything communications. This novel approach balances computational complexity and channel estimation accuracy by dynamically adjusting computational resources based on input data conditions, such as channel quality. Unlike explicit multilayer-stacked DL-based channel estimation models, ICENet offers a flexible framework, where specific requirements can be achieved by adaptively changing the number of iterations of the iterative layer. Meanwhile, ICENet requires less memory while maintaining high performance. The article concludes by highlighting open research challenges and promising future research directions. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.16369 [pdf, ps, other]

X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance

Authors: Junbo Zhang, Heinrich Dinkel, Yadong Niu, Chenyu Liu, Si Cheng, Anbei Zhao, Jian Luan

Abstract: We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The fra… ▽ More We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The framework includes 22 distinct tasks that cover essential aspects of audio processing, from speech recognition and emotion detection to sound event classification and music genre identification. Our extensive evaluation of state-of-the-art audio encoders reveals significant performance variations across different tasks and domains, highlighting the complexity of general audio representation learning. △ Less

Submitted 27 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.16351 [pdf, other]

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Authors: Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Abstract: Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-sh… ▽ More Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems. △ Less

Submitted 24 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted for Interspeech2025

arXiv:2505.16168 [pdf, ps, other]

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

Authors: Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie

Abstract: Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models an… ▽ More Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Accepted by INTERSPEECH 2025

Showing 1–50 of 1,469 results for author: Zhang, J