Search | arXiv e-print repository

Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging

Authors: Lijie Huang, Jingyi Yin, Jingke Zhang, U-Wai Lok, Ryan M. DeRuiter, Jieyang Jin, Kate M. Knoll, Kendra E. Petersen, James D. Krier, Xiang-yang Zhu, Gina K. Hesley, Kathryn A. Robinson, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen, Chengwu Huang

Abstract: Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs… ▽ More Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs from complementary angular subsets of beamformed radio-frequency (RF) blood flow data, across which vascular signals remain consistent while noise varies. HA2HA was trained using in-vivo contrast-free pig kidney data and validated across diverse datasets, including contrast-free and contrast-enhanced data from pig kidneys, as well as human liver and kidney. An improvement exceeding 15 dB in both contrast-to-noise ratio (CNR) and SNR was observed, indicating a substantial enhancement in image quality. In addition to power Doppler imaging, denoising directly in the RF domain is also beneficial for other downstream processing such as color Doppler imaging (CDI). CDI results of human liver derived from the HA2HA-denoised signals exhibited improved microvascular flow visualization, with a suppressed noisy background. HA2HA offers a label-free, generalizable, and clinically applicable solution for robust vascular imaging in both contrast-free and contrast-enhanced UMI. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: 12 pages, 10 figures. Supplementary materials are available at https://zenodo.org/records/15832003

arXiv:2507.01445 [pdf, ps, other]

Basis Expansion Extrapolation based Long-Term Channel Prediction for Massive MIMO OTFS Systems

Authors: Yanfeng Zhang, Xu Zhu, Yujie Liu, Yong Liang Guan, David González G., Vincent K. N. Lau

Abstract: Massive multi-input multi-output (MIMO) combined with orthogonal time frequency space (OTFS) modulation has emerged as a promising technique for high-mobility scenarios. However, its performance could be severely degraded due to channel aging caused by user mobility and high processing latency. In this paper, an integrated scheme of uplink (UL) channel estimation and downlink (DL) channel predicti… ▽ More Massive multi-input multi-output (MIMO) combined with orthogonal time frequency space (OTFS) modulation has emerged as a promising technique for high-mobility scenarios. However, its performance could be severely degraded due to channel aging caused by user mobility and high processing latency. In this paper, an integrated scheme of uplink (UL) channel estimation and downlink (DL) channel prediction is proposed to alleviate channel aging in time division duplex (TDD) massive MIMO-OTFS systems. Specifically, first, an iterative basis expansion model (BEM) based UL channel estimation scheme is proposed to accurately estimate UL channels with the aid of carefully designed OTFS frame pattern. Then a set of Slepian sequences are used to model the estimated UL channels, and the dynamic Slepian coefficients are fitted by a set of orthogonal polynomials. A channel predictor is derived to predict DL channels by iteratively extrapolating the Slepian coefficients. Simulation results verify that the proposed UL channel estimation and DL channel prediction schemes outperform the existing schemes in terms of normalized mean square error of channel estimation/prediction and DL spectral efficiency, with less pilot overhead. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.19181 [pdf, ps, other]

VHU-Net: Variational Hadamard U-Net for Body MRI Bias Field Correction

Authors: Xin Zhu, Ahmet Enis Cetin, Gorkem Durak, Batuhan Gundogdu, Ziliang Hong, Hongyi Pan, Ertugrul Aktas, Elif Keles, Hatice Savas, Aytekin Oto, Hiten Patel, Adam B. Murphy, Ashley Ross, Frank Miller, Baris Turkbey, Ulas Bagci

Abstract: Bias field artifacts in magnetic resonance imaging (MRI) scans introduce spatially smooth intensity inhomogeneities that degrade image quality and hinder downstream analysis. To address this challenge, we propose a novel variational Hadamard U-Net (VHU-Net) for effective body MRI bias field correction. The encoder comprises multiple convolutional Hadamard transform blocks (ConvHTBlocks), each inte… ▽ More Bias field artifacts in magnetic resonance imaging (MRI) scans introduce spatially smooth intensity inhomogeneities that degrade image quality and hinder downstream analysis. To address this challenge, we propose a novel variational Hadamard U-Net (VHU-Net) for effective body MRI bias field correction. The encoder comprises multiple convolutional Hadamard transform blocks (ConvHTBlocks), each integrating convolutional layers with a Hadamard transform (HT) layer. Specifically, the HT layer performs channel-wise frequency decomposition to isolate low-frequency components, while a subsequent scaling layer and semi-soft thresholding mechanism suppress redundant high-frequency noise. To compensate for the HT layer's inability to model inter-channel dependencies, the decoder incorporates an inverse HT-reconstructed transformer block, enabling global, frequency-aware attention for the recovery of spatially consistent bias fields. The stacked decoder ConvHTBlocks further enhance the capacity to reconstruct the underlying ground-truth bias field. Building on the principles of variational inference, we formulate a new evidence lower bound (ELBO) as the training objective, promoting sparsity in the latent space while ensuring accurate bias field estimation. Comprehensive experiments on abdominal and prostate MRI datasets demonstrate the superiority of VHU-Net over existing state-of-the-art methods in terms of intensity uniformity, signal fidelity, and tissue contrast. Moreover, the corrected images yield substantial downstream improvements in segmentation accuracy. Our framework offers computational efficiency, interpretability, and robust performance across multi-center datasets, making it suitable for clinical deployment. △ Less

Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.15191 [pdf]

Islanding Strategy for Smart Grids Oriented to Resilience Enhancement and Its Power Supply Range Optimization

Authors: Yanhong Luo, Wenchao Meng, Xi Zhu, Andreas Elombo, Hu Rong, Bing Xie, Tianwen Zhang

Abstract: With the increasing prevalence of distributed generators, islanded operation based on distributed generation is considered a vital means to enhance the reliability and resilience of smart grids. This paper investigates the main factors in islanding partition of smart grids and establishes a mathematical model for islanding division. A method to determine the maximum power supply range of distribut… ▽ More With the increasing prevalence of distributed generators, islanded operation based on distributed generation is considered a vital means to enhance the reliability and resilience of smart grids. This paper investigates the main factors in islanding partition of smart grids and establishes a mathematical model for islanding division. A method to determine the maximum power supply range of distributed energy resources (DERs) based on the reachability matrix and power circle algorithm is proposed to improve computational efficiency. A dynamic programming method based on breadth-first search (BFS) is used to solve the islanding partition scheme, and a region correction method is applied to modify the maximum power supply area by considering controllable loads and prioritizing critical load restoration, thereby enhancing system resilience. Finally, simulation results verify the effectiveness of the proposed algorithm in improving smart grid resilience. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.13291 [pdf, ps, other]

Aggregating Inverter-Based Resources for Fast Frequency Response: A Nash Bargaining Game-Based Approach

Authors: Xiang Zhu, Hua Geng, Hongyang Qing, Xin Zou

Abstract: This paper proposes a multi-objective optimization (MOO) approach for grid-level frequency regulation by aggregating inverter-based resources (IBRs). Virtual power plants (VPPs), acting as aggregators, can efficiently respond to dynamic response requirements from the grid. Through parametric modeling, grid-level frequency regulation requirements are accurately quantified and translated into a feas… ▽ More This paper proposes a multi-objective optimization (MOO) approach for grid-level frequency regulation by aggregating inverter-based resources (IBRs). Virtual power plants (VPPs), acting as aggregators, can efficiently respond to dynamic response requirements from the grid. Through parametric modeling, grid-level frequency regulation requirements are accurately quantified and translated into a feasible parameter region defined by device-level parameters. Based on this feasible region, an MOO model is developed to address the conflicting demands of IBRs during frequency response. A Nash bargaining game-based approach is then employed to optimally allocate regulation requirements within the VPP, balancing the various demands of the IBRs. Numerical experiments demonstrate the effectiveness of the proposed method in enhancing frequency stability and improving coordination among IBRs. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: Accepted by the 2025 IEEE IAS Annual Meeting

arXiv:2506.11496 [pdf, ps, other]

Taming Stable Diffusion for Computed Tomography Blind Super-Resolution

Authors: Chunlei Li, Yilei Shi, Haoxi Hu, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

Abstract: High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion mod… ▽ More High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.07876 [pdf, ps, other]

Versatile Loco-Manipulation through Flexible Interlimb Coordination

Authors: Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac'h, Jiuguang Wang, Kuan Fang

Abstract: The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation thr… ▽ More The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation through flexible interlimb coordination. The key to our approach is an adaptive controller that seamlessly bridges the execution of manipulation motions and the generation of stable gaits based on task demands. Through the interplay between two controller modules, ReLIC dynamically assigns each limb for manipulation or locomotion and robustly coordinates them to achieve the task success. Using efficient reinforcement learning in simulation, ReLIC learns to perform stable gaits in accordance with the manipulation goals in the real world. To solve diverse and complex tasks, we further propose to interface the learned controller with different types of task specifications, including target trajectories, contact points, and natural language instructions. Evaluated on 12 real-world tasks that require diverse and complex coordination patterns, ReLIC demonstrates its versatility and robustness by achieving a success rate of 78.9% on average. Videos and code can be found at https://relic-locoman.rai-inst.com. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.03445 [pdf, ps, other]

Maximum Likelihood for Logistic Regression Model with Incomplete and Hybrid-Type Covariates

Authors: Mohamed Cherifi, Xujia Zhu, Mohammed Nabil El Korso, Ammar Mesloub

Abstract: Logistic regression is a fundamental and widely used statistical method for modeling binary outcomes based on covariates. However, the presence of missing data, particularly in settings involving hybrid covariates (a mix of discrete and continuous variables), poses significant challenges. In this paper, we propose a novel Expectation-Maximization based algorithm tailored for parameter estimation i… ▽ More Logistic regression is a fundamental and widely used statistical method for modeling binary outcomes based on covariates. However, the presence of missing data, particularly in settings involving hybrid covariates (a mix of discrete and continuous variables), poses significant challenges. In this paper, we propose a novel Expectation-Maximization based algorithm tailored for parameter estimation in logistic regression models with missing hybrid covariates. The proposed method is specifically designed to handle these complexities, delivering efficient parameter estimates. Through comprehensive simulations and real-world application, we demonstrate that our approach consistently outperforms traditional methods, achieving superior accuracy and reliability. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 20 pages, 4 figures. To appear in IEEE Signal Processing Letters

MSC Class: 62F10; 62J12; 62H30

arXiv:2505.22063 [pdf, ps, other]

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

Authors: Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

Abstract: Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder.… ▽ More Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2505.19476 [pdf, ps, other]

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

Authors: Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie

Abstract: Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inferenc… ▽ More Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inference latency. To address these challenges, we propose FlowSE, a flow-matching-based model for SE. Flow matching learns a continuous transformation between noisy and clean speech distributions in a single pass, significantly reducing inference latency while maintaining high-quality reconstruction. Specifically, FlowSE trains on noisy mel spectrograms and optional character sequences, optimizing a conditional flow matching loss with ground-truth mel spectrograms as supervision. It implicitly learns speech's temporal-spectral structure and text-speech alignment. During inference, FlowSE can operate with or without textual information, achieving impressive results in both scenarios, with further improvements when transcripts are available. Extensive experiments demonstrate that FlowSE significantly outperforms state-of-the-art generative methods, establishing a new paradigm for generative-based SE and demonstrating the potential of flow matching to advance the field. Our code, pre-trained checkpoints, and audio samples are available. △ Less

Submitted 27 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

Comments: Accepted to InterSpeech 2025

arXiv:2505.13880 [pdf, ps, other]

U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

Authors: Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie

Abstract: The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equ… ▽ More The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a Semantic-Aware Contrastive Loss Module, which explicitly identifies redundant audio features under language supervision and rectifies their semantic and spectral representations to enhance cross-modal alignment. Extensive experiments demonstrate that U-SAM consistently outperforms both specialized models and existing audio language models across multiple benchmarks. Moreover, it exhibits emergent capabilities on unseen tasks, showcasing its generalization potential. Code is available (https://github.com/Honee-W/U-SAM/). △ Less

Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.12740 [pdf, ps, other]

Multi-Reference and Adaptive Nonlinear Transform Source-Channel Coding for Wireless Image Semantic Transmission

Authors: Cheng Yuan, Yufei Jiang, Xu Zhu

Abstract: We propose a multi-reference and adaptive nonlinear transform source-channel coding (MA-NTSCC) system for wireless image semantic transmission to improve rate-distortion (RD) performance by introducing multi-dimensional contexts into the entropy model of the state-of-the-art (SOTA) NTSCC system. Improvements in RD performance of the proposed MA-NTSCC system are particularly significant in high-res… ▽ More We propose a multi-reference and adaptive nonlinear transform source-channel coding (MA-NTSCC) system for wireless image semantic transmission to improve rate-distortion (RD) performance by introducing multi-dimensional contexts into the entropy model of the state-of-the-art (SOTA) NTSCC system. Improvements in RD performance of the proposed MA-NTSCC system are particularly significant in high-resolution image transmission under low bandwidth constraints. The proposed multi-reference entropy model leverages correlations within the latent representation in both spatial and channel dimensions. In the spatial dimension, the latent representation is divided into anchors and non-anchors in a checkerboard pattern, where anchors serve as reference to estimate the mutual information between anchors and non-anchors. In the channel dimension, the latent representation is partitioned into multiple groups, and features in previous groups are analyzed to estimate the mutual information between features in previous and current groups. Taking mutual information into account, the entropy model provides an accurate estimation on the entropy, which enables efficient bandwidth allocation and enhances RD performance. Additionally, the proposed lightweight adaptation modules enable the proposed MA-NTSCC model to achieve transmission quality comparable to separately trained models across various channel conditions and bandwidth requirements. In contrast, traditional NTSCC models provide signal-to-noise ratio (SNR)-distortion performance degrading with channel quality deviating from the fixed training SNR, and consume inflexible bandwidth to transmit an image. Comprehensive experiments are conducted to verify the peak signal-to-noise ratio (PSNR) performance and adaptability of the proposed MA-NTSCC model superior to SOTA methods over both additive white Gaussian noise channel and Rayleigh fading channel. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.09980 [pdf, ps, other]

Event-Triggered Synergistic Controllers with Dwell-Time Transmission

Authors: Xuanzhi Zhu, Pedro Casau, Carlos Silvestre

Abstract: We propose novel event-triggered synergistic controllers for nonlinear continuous-time plants by incorporating event-triggered control into stabilizing synergistic controllers. We highlight that a naive application of common event-triggering conditions may not ensure dwell-time transmission due to the joint jumping dynamics of the closed-loop system. Under mild conditions, we develop a suite of ev… ▽ More We propose novel event-triggered synergistic controllers for nonlinear continuous-time plants by incorporating event-triggered control into stabilizing synergistic controllers. We highlight that a naive application of common event-triggering conditions may not ensure dwell-time transmission due to the joint jumping dynamics of the closed-loop system. Under mild conditions, we develop a suite of event-triggered synergistic controllers that guarantee both dwell-time transmission and global asymptotic stability. Through numerical simulations, we demonstrate the effectiveness of our controller applied to the problem of rigid body attitude stabilization. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 9 pages, 2 figures, 1 table

arXiv:2505.01476 [pdf, other]

CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Authors: Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is ina… ▽ More Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD. △ Less

Submitted 23 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

Comments: 25 pages, 12 figures, 20 tables, accepted by Forty-Second International Conference on Machine Learning ( ICML 2025 ), link: https://icml.cc/virtual/2025/poster/46359

arXiv:2505.01212 [pdf, other]

High Dynamic Range Novel View Synthesis with Single Exposure

Authors: Kaixuan Zhang, Hu Wang, Minxian Li, Mingwu Ren, Mao Ye, Xiatian Zhu

Abstract: High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has signifi… ▽ More High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs. To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods. Source code will be released. △ Less

Submitted 19 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

Comments: It has been accepted by ICML 2025

arXiv:2504.07703 [pdf, other]

Optimal Frequency Support from Virtual Power Plants: Minimal Reserve and Allocation

Authors: Xiang Zhu, Guangchun Ruan, Hua Geng

Abstract: This paper proposes a novel reserve-minimizing and allocation strategy for virtual power plants (VPPs) to deliver optimal frequency support. The proposed strategy enables VPPs, acting as aggregators for inverter-based resources (IBRs), to provide optimal frequency support economically. The proposed strategy captures time-varying active power injections, reducing the unnecessary redundancy compared… ▽ More This paper proposes a novel reserve-minimizing and allocation strategy for virtual power plants (VPPs) to deliver optimal frequency support. The proposed strategy enables VPPs, acting as aggregators for inverter-based resources (IBRs), to provide optimal frequency support economically. The proposed strategy captures time-varying active power injections, reducing the unnecessary redundancy compared to traditional fixed reserve schemes. Reserve requirements for the VPPs are determined based on system frequency response and safety constraints, ensuring efficient grid support. Furthermore, an energy-based allocation model decomposes power injections for each IBR, accounting for their specific limitations. Numerical experiments validate the feasibility of the proposed approach, highlighting significant financial gains for VPPs, especially as system inertia decreases due to higher renewable energy integration. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: Accepted by Applied Energy

arXiv:2504.01025 [pdf]

Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network

Authors: Fubao Zhu, Yang Zhang, Gengmin Liang, Jiaofen Nan, Yanting Li, Chuang Han, Danyang Sun, Zhiguo Wang, Chen Zhao, Wenxuan Zhou, Jian He, Yi Xu, Iokfai Cheang, Xu Zhu, Yanli Zhou, Weihua Zhou

Abstract: Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study ana… ▽ More Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses. △ Less

Submitted 27 March, 2025; originally announced April 2025.

Comments: 23 pages, 8 figures, 4 tables

arXiv:2503.23783 [pdf]

ANNs-SaDE: A Machine-Learning-Based Design Automation Framework for Microwave Branch-Line Couplers

Authors: Tianqi Chen, Wei Huang, Qiang Wu, Li Yang, Roberto Gómez-García, Xi Zhu

Abstract: The traditional method for designing branch-line couplers involves a trial-and-error optimization process that requires multiple design iterations through electromagnetic (EM) simulations. Thus, it is extremely time consuming and labor intensive. In this paper, a novel machine-learning-based framework is proposed to tackle this issue. It integrates artificial neural networks with a self-adaptive d… ▽ More The traditional method for designing branch-line couplers involves a trial-and-error optimization process that requires multiple design iterations through electromagnetic (EM) simulations. Thus, it is extremely time consuming and labor intensive. In this paper, a novel machine-learning-based framework is proposed to tackle this issue. It integrates artificial neural networks with a self-adaptive differential evolution algorithm (ANNs-SaDE). This framework enables the self-adaptive design of various types of microwave branch-line couplers by precisely optimizing essential electrical properties, such as coupling factor, isolation, and phase difference between output ports. The effectiveness of the ANNs-SaDE framework is demonstrated by the designs of folded single-stage branch-line couplers and multi-stage wideband branch-line couplers. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: This paper has been accepted for presentation at ISCAS 2025

arXiv:2503.19703 [pdf, other]

High-Quality Spatial Reconstruction and Orthoimage Generation Using Efficient 2D Gaussian Splatting

Authors: Qian Wang, Zhihao Zhan, Jialei He, Zhituo Tu, Xiang Zhu, Jie Yuan

Abstract: Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring.Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive a… ▽ More Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring.Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive and prone to errors.This work presents an alternative technique rooted in 2D Gaussian Splatting (2DGS), free of explicit DSM and occlusion detection. With depth map generation, spatial information for every pixel within the TDOM is retrieved and can reconstruct the scene with high precision. Divide-and-conquer strategy achieves excellent GS training and rendering with high-resolution TDOMs at a lower resource cost, which preserves higher quality of rendering on complex terrain and thin structure without a decrease in efficiency. Experimental results demonstrate the efficiency of large-scale scene reconstruction and high-precision terrain modeling. This approach provides accurate spatial data, which assists users in better planning and decision-making based on maps. △ Less

Submitted 13 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.14966 [pdf, other]

Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

Authors: Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we intr… ▽ More Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Comments: MICCAI 2024

arXiv:2503.13987 [pdf, other]

Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation

Authors: Yaxiong Chen, Yujie Wang, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou

Abstract: Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient l… ▽ More Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at https://github.com/WUTCM-Lab/Shape-Prior-Semi-Seg. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: MICCAI 2024

arXiv:2503.09961 [pdf, other]

Edge-Fog Computing-Enabled EEG Data Compression via Asymmetrical Variational Discrete Cosine Transform Network

Authors: Xin Zhu, Hongyi Pan, Ahmet Enis Cetin

Abstract: The large volume of electroencephalograph (EEG) data produced by brain-computer interface (BCI) systems presents challenges for rapid transmission over bandwidth-limited channels in Internet of Things (IoT) networks. To address the issue, we propose a novel multi-channel asymmetrical variational discrete cosine transform (DCT) network for EEG data compression within an edge-fog computing framework… ▽ More The large volume of electroencephalograph (EEG) data produced by brain-computer interface (BCI) systems presents challenges for rapid transmission over bandwidth-limited channels in Internet of Things (IoT) networks. To address the issue, we propose a novel multi-channel asymmetrical variational discrete cosine transform (DCT) network for EEG data compression within an edge-fog computing framework. At the edge level, low-complexity DCT compression units are designed using parallel trainable hard-thresholding and scaling operators to remove redundant data and extract the effective latent space representation. At the fog level, an adaptive filter bank is applied to merge important features from adjacent channels into each individual channel by leveraging inter-channel correlations. Then, the inverse DCT reconstructed multi-head attention is developed to capture both local and global dependencies and reconstruct the original signals. Furthermore, by applying the principles of variational inference, a new evidence lower bound is formulated as the loss function, driving the model to balance compression efficiency and reconstruction accuracy. Experimental results on two public datasets demonstrate that the proposed method achieves superior compression performance without sacrificing any useful information for BCI detection compared with state-of-the-art techniques, indicating a feasible solution for EEG data compression. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: Accepted by the IEEE Internet of Things Journal

arXiv:2503.03355 [pdf, other]

Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

Authors: Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai

Abstract: In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily… ▽ More In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution. △ Less

Submitted 8 May, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

arXiv:2503.01710 [pdf, other]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS. △ Less

Submitted 3 March, 2025; originally announced March 2025.

Comments: Submitted to ACL 2025

arXiv:2503.01202 [pdf, other]

A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping

Authors: Jialei He, Zhihao Zhan, Zhituo Tu, Xiang Zhu, Jie Yuan

Abstract: Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to… ▽ More Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to overcome the limitations of conventional orthoimage generation methods in terms of temporal performance, system robustness, and geographic reference accuracy. A prior-pose-optimized feature matching method is introduced to enhance matching speed and accuracy, reducing the number of required features and providing precise references for the Structure from Motion (SfM) process. The proposed method exhibits robustness in low-texture scenes like farmlands, where feature matching is difficult. Experiments show that our approach achieves accurate feature matching orthoimage generation in a short time. The proposed drone system effectively aids in farmland detection and management. △ Less

Submitted 4 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

arXiv:2503.00980 [pdf, other]

RSSI Positioning with Fluid Antenna Systems

Authors: Wenzhi Liu, Zhisheng Rong, Xiayue Liu, Yufei Jiang, Xu Zhu

Abstract: We introduce a novel received signal strength intensity (RSSI)-based positioning method using fluid antenna systems (FAS), leveraging their inherent channel correlation properties to improve location accuracy. By enabling a single antenna to sample multiple spatial positions, FAS exhibits high correlation between its ports. We integrate this high inter-port correlation with a logarithmic path loss… ▽ More We introduce a novel received signal strength intensity (RSSI)-based positioning method using fluid antenna systems (FAS), leveraging their inherent channel correlation properties to improve location accuracy. By enabling a single antenna to sample multiple spatial positions, FAS exhibits high correlation between its ports. We integrate this high inter-port correlation with a logarithmic path loss model to mitigate the impact of fast fading on RSSI signals, and derive a simplified multipoint positioning model based on the established relationship between channel correlation and RSSI signal correlation. A maximum likelihood estimator (MLE) is then developed, for which we provide a closed-form solution. Results demonstrate that our approach outperforms both traditional least squares (LS) methods and single-antenna systems, achieving accuracy comparable to conventional multi-antenna positioning. Furthermore, we analyze the impact of different antenna structures on positioning performance, offering practical guidance for FAS antenna design. △ Less

Submitted 2 March, 2025; originally announced March 2025.

arXiv:2503.00493 [pdf, ps, other]

LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Authors: Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie

Abstract: Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited… ▽ More Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area. △ Less

Submitted 10 June, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

Comments: ACL2025 main, Codes available at https://github.com/Kevin-naticl/LLaSE-G1

arXiv:2503.00348 [pdf, other]

SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping

Authors: Samuel Garske, Konrad Heidler, Bradley Evans, KC Wong, Xiao Xiang Zhu

Abstract: The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring f… ▽ More The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards. The Python code is available at: https://github.com/WiseGamgee/SHAZAM △ Less

Submitted 28 February, 2025; originally announced March 2025.

Comments: 20 pages, 9 figures, 3 tables, code available at: https://github.com/WiseGamgee/SHAZAM

arXiv:2502.18311 [pdf, other]

Cost-Effective Single-Antenna RSSI Positioning Through Dynamic Radiation Pattern Analysis

Authors: Zhisheng Rong, Wenzhi Liu, Xiayue Liu, Zhixiang Xu, Yufei Jiang, Xu Zhu

Abstract: This paper presents a novel indoor positioning approach that leverages antenna radiation pattern characteristics through Received Signal Strength Indication (RSSI) measurements in a single-antenna system. By rotating the antenna or reconfiguring its radiation pattern, we derive a maximum likelihood estimation (MLE) algorithm that achieves near-optimal positioning accuracy approaching the Cramer-Ra… ▽ More This paper presents a novel indoor positioning approach that leverages antenna radiation pattern characteristics through Received Signal Strength Indication (RSSI) measurements in a single-antenna system. By rotating the antenna or reconfiguring its radiation pattern, we derive a maximum likelihood estimation (MLE) algorithm that achieves near-optimal positioning accuracy approaching the Cramer-Rao lower bound (CRLB). Through theoretical analysis, we establish three fundamental theorems characterizing the estimation accuracy bounds and demonstrating how performance improves with increased signal-to-noise ratio, antenna rotation count, and radiation pattern variations. Additionally, we propose a two-position measurement strategy that eliminates dependence on receiving antenna patterns. Simulation results validate that our approach provides an effective solution for indoor robot tracking applications where both accuracy and system simplicity are essential considerations. △ Less

Submitted 3 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

Comments: 6 pages, 7 figures

arXiv:2502.18186 [pdf, other]

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Authors: Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

Abstract: Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the… ▽ More Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research. △ Less

Submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.04128 [pdf, other]

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a pa… ▽ More Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available. △ Less

Submitted 22 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

arXiv:2501.16761 [pdf, other]

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Authors: Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Xi Wang, Sheng Zhao, Lei Xie

Abstract: Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose Cos… ▽ More Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trained with well-labeled data, AudioCapTeller leverages its assessment capabilities on weakly-labeled datasets for high-quality filtering and reinforcement learning, which further improves its performance. The well-trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open-source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios. △ Less

Submitted 28 January, 2025; originally announced January 2025.

Comments: 12 pages, 5 figures, 7 tables

arXiv:2501.15085 [pdf, other]

Data Center Cooling System Optimization Using Offline Reinforcement Learning

Authors: Xianyuan Zhan, Xiangyu Zhu, Peng Cheng, Xiao Hu, Ziteng He, Hanfei Geng, Jichao Leng, Huiwen Zheng, Chenhui Liu, Tianshun Hong, Yan Liang, Yunxin Liu, Feng Zhao

Abstract: The recent advances in information technology and artificial intelligence have fueled a rapid expansion of the data center (DC) industry worldwide, accompanied by an immense appetite for electricity to power the DCs. In a typical DC, around 30~40% of the energy is spent on the cooling system rather than on computer servers, posing a pressing need for developing new energy-saving optimization techn… ▽ More The recent advances in information technology and artificial intelligence have fueled a rapid expansion of the data center (DC) industry worldwide, accompanied by an immense appetite for electricity to power the DCs. In a typical DC, around 30~40% of the energy is spent on the cooling system rather than on computer servers, posing a pressing need for developing new energy-saving optimization technologies for DC cooling systems. However, optimizing such real-world industrial systems faces numerous challenges, including but not limited to a lack of reliable simulation environments, limited historical data, and stringent safety and control robustness requirements. In this work, we present a novel physics-informed offline reinforcement learning (RL) framework for energy efficiency optimization of DC cooling systems. The proposed framework models the complex dynamical patterns and physical dependencies inside a server room using a purposely designed graph neural network architecture that is compliant with the fundamental time-reversal symmetry. Because of its well-behaved and generalizable state-action representations, the model enables sample-efficient and robust latent space offline policy learning using limited real-world operational data. Our framework has been successfully deployed and verified in a large-scale production DC for closed-loop control of its air-cooling units (ACUs). We conducted a total of 2000 hours of short and long-term experiments in the production DC environment. The results show that our method achieves 14~21% energy savings in the DC cooling system, without any violation of the safety or operational constraints. Our results have demonstrated the significant potential of offline RL in solving a broad range of data-limited, safety-critical real-world industrial control problems. △ Less

Submitted 14 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

Comments: Accepted in ICLR 2025

arXiv:2501.13306 [pdf, other]

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Authors: Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie

Abstract: Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover… ▽ More Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies. △ Less

Submitted 16 February, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: OSUM Technical Report v2. The experimental results reported herein differ from those in v1 because of adding new data and training in more steps

arXiv:2501.12604 [pdf, other]

Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

Authors: Wang Pang, Zhihao Zhan, Xiang Zhu, Yechao Bai

Abstract: Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture divers… ▽ More Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges. △ Less

Submitted 21 January, 2025; originally announced January 2025.

arXiv:2501.11737 [pdf, other]

Efficient Bearing Sensor Data Compression via an Asymmetrical Autoencoder with a Lifting Wavelet Transform Layer

Authors: Xin Zhu, Ahmet Enis Cetin

Abstract: Bearing data compression is vital to manage the large volumes of data generated during condition monitoring. In this paper, a novel asymmetrical autoencoder with a lifting wavelet transform (LWT) layer is developed to compress bearing sensor data. The encoder part of the network consists of a convolutional layer followed by a wavelet filterbank layer. Specifically, a dual-channel convolutional blo… ▽ More Bearing data compression is vital to manage the large volumes of data generated during condition monitoring. In this paper, a novel asymmetrical autoencoder with a lifting wavelet transform (LWT) layer is developed to compress bearing sensor data. The encoder part of the network consists of a convolutional layer followed by a wavelet filterbank layer. Specifically, a dual-channel convolutional block with diverse convolutional kernel sizes and varying processing depths is integrated into the wavelet filterbank layer to enable comprehensive feature extraction from the wavelet domain. Additionally, the adaptive hard-thresholding nonlinearity is applied to remove redundant components while denoising the primary wavelet coefficients. On the decoder side, inverse LWT, along with multiple linear layers and activation functions, is employed to reconstruct the original signals. Furthermore, to enhance compression efficiency, a sparsity constraint is introduced during training to impose sparsity on the latent representations. The experimental results demonstrate that the proposed approach achieves superior data compression performance compared to state-of-the-art methods. △ Less

Submitted 20 January, 2025; originally announced January 2025.

Comments: Accepted at the 2025 IEEE International Symposium on Circuits and Systems

arXiv:2501.04416 [pdf, other]

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Authors: Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

Abstract: Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a… ▽ More Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve style similarity. Experiments conducted on 44,000 hours of speech data demonstrate the superior performance of ZSVC in generating speech with diverse speaking styles in zero-shot scenarios. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: 5 pages, 3 figures, accepted by ICASSP 2025

arXiv:2412.18077 [pdf]

Optimizing In Vivo Data Acquisition for Robust Clinical Microvascular Imaging Using Ultrasound Localization Microscopy

Authors: Chengwu Huang, U-Wai Lok, Jingke Zhang, Xiang Yang Zhu, James D. Krier, Amy Stern, Kate M. Knoll, Kendra E. Petersen, Kathryn A. Robinson, Gina K. Hesley, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen

Abstract: Ultrasound localization microscopy (ULM) enables microvascular imaging at spatial resolutions beyond the acoustic diffraction limit, offering significant clinical potentials. However, ULM performance relies heavily on microbubble (MB) signal sparsity, the number of detected MBs, and signal-to-noise ratio (SNR), all of which vary in clinical scenarios involving bolus MB injections. These sources of… ▽ More Ultrasound localization microscopy (ULM) enables microvascular imaging at spatial resolutions beyond the acoustic diffraction limit, offering significant clinical potentials. However, ULM performance relies heavily on microbubble (MB) signal sparsity, the number of detected MBs, and signal-to-noise ratio (SNR), all of which vary in clinical scenarios involving bolus MB injections. These sources of variations underscore the need to optimize MB dosage, data acquisition timing, and imaging settings in order to standardize and optimize ULM of microvasculature. This pilot study investigated temporal changes in MB signals during bolus injections in both pig and human models to optimize data acquisition for clinical ULM. Quantitative indices were developed to evaluate MB signal quality, guiding selection of acquisition timing that balances the MB localization quality and adequate MB counts. The effects of transmitted voltage and dosage were also explored. In the pig model, a relatively short window (approximately 10 seconds) for optimal acquisition was identified during the rapid wash-out phase, highlighting the need for real-time MB signal monitoring during data acquisition. The slower wash-out phase in humans allowed for a more flexible imaging window of 1-2 minutes, while trade-offs were observed between localization quality and MB density (or acquisition length) at different wash-out phase timings. Guided by these findings, robust ULM imaging was achieved in both pig and human kidneys using a short period of data acquisition, demonstrating its feasibility in clinical practice. This study provides insights into optimizing data acquisition for consistent and reproducible ULM, paving the way for its standardization and broader clinical applications. △ Less

Submitted 23 December, 2024; originally announced December 2024.

Comments: 33 pages, 9 figures

arXiv:2412.16846 [pdf, other]

Autoregressive Speech Synthesis with Next-Distribution Prediction

Authors: Xinfa Zhu, Wenjie Tian, Lei Xie

Abstract: We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from wavefo… ▽ More We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \url{https://zxf-icpc.github.io/kalle/}. △ Less

Submitted 21 December, 2024; originally announced December 2024.

Comments: Technical report, work in progress

arXiv:2412.13676 [pdf, ps, other]

Robust UAV Jittering and Task Scheduling in Mobile Edge Computing with Data Compression

Authors: Bin Li, Xiao Zhu, Junyi Wang

Abstract: Data compression technology is able to reduce data size, which can be applied to lower the cost of task offloading in mobile edge computing (MEC). This paper addresses the practical challenges for robust trajectory and scheduling optimization based on data compression in the unmanned aerial vehicle (UAV)-assisted MEC, aiming to minimize the sum energy cost of terminal users while maintaining robus… ▽ More Data compression technology is able to reduce data size, which can be applied to lower the cost of task offloading in mobile edge computing (MEC). This paper addresses the practical challenges for robust trajectory and scheduling optimization based on data compression in the unmanned aerial vehicle (UAV)-assisted MEC, aiming to minimize the sum energy cost of terminal users while maintaining robust performance during UAV flight. Considering the non-convexity of the problem and the dynamic nature of the scenario, the optimization problem is reformulated as a Markov decision process. Then, a randomized ensembled double Q-learning (REDQ) algorithm is adopted to solve the issue. The algorithm allows for higher feasible update-to-data ratio, enabling more effective learning from observed data. The simulation results show that the proposed scheme effectively reduces the energy consumption while ensuring flight robustness. Compared to the PPO and A2C algorithms, energy consumption is reduced by approximately $21.9\%$ and $35.4\%$, respectively. This method demonstrates significant advantages in complex environments and holds great potential for practical applications. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: 10 pages, 8 figures

arXiv:2412.09168 [pdf, other]

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Authors: Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan Ding, Lei Xie

Abstract: Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in fe… ▽ More Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/} △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: 16 pages, 4 figures

arXiv:2412.06451 [pdf, other]

How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

Authors: Yuanyuan Wang, Qian Song, Dawood Wasif, Muhammad Shahzad, Christoph Koller, Jonathan Bamber, Xiao Xiang Zhu

Abstract: Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key… ▽ More Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: Submitted to IEEE Geoscience and Remote Sensing Magazine

arXiv:2411.18918 [pdf, other]

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Authors: Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

Abstract: Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker r… ▽ More Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech. △ Less

Submitted 3 December, 2024; v1 submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.13362 [pdf, other]

doi 10.1109/ISCAS56072.2025.11044297

RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content

Authors: Yuxuan Jiang, Jakub Nawała, Chen Feng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull

Abstract: Super-resolution (SR) is a key technique for improving the visual quality of video content by increasing its spatial resolution while reconstructing fine details. SR has been employed in many applications including video streaming, where compressed low-resolution content is typically transmitted to end users and then reconstructed with a higher resolution and enhanced quality. To support real-time… ▽ More Super-resolution (SR) is a key technique for improving the visual quality of video content by increasing its spatial resolution while reconstructing fine details. SR has been employed in many applications including video streaming, where compressed low-resolution content is typically transmitted to end users and then reconstructed with a higher resolution and enhanced quality. To support real-time playback, it is important to implement fast SR models while preserving reconstruction quality; however most existing solutions, in particular those based on complex deep neural networks, fail to do so. To address this issue, this paper proposes a low-complexity SR method, RTSR, designed to enhance the visual quality of compressed video content, focusing on resolution up-scaling from a) 360p to 1080p and from b) 540p to 4K. The proposed approach utilizes a CNN-based network architecture, which was optimized for AV1 (SVT)-encoded content at various quantization levels based on a dual-teacher knowledge distillation method. This method was submitted to the AIM 2024 Video Super-Resolution Challenge, specifically targeting the Efficient/Mobile Real-Time Video Super-Resolution competition. It achieved the best trade-off between complexity and coding performance (measured in PSNR, SSIM and VMAF) among all six submissions. The code will be available soon. △ Less

Submitted 20 November, 2024; originally announced November 2024.

arXiv:2411.09307 [pdf, ps, other]

Model-Based Event-Triggered Implementation of Hybrid Controllers Using Finite-Time Convergent Observers

Authors: Xuanzhi Zhu, Pedro Casau, Carlos Silvestre

Abstract: In this paper, we explore the conditions for asymptotic stability of the hybrid closed-loop system resulting from the interconnection of a nonlinear plant, an intelligent sensor that generates finite-time convergent estimates of the plant state, and a controller node that receives opportunistic samples from the sensor node when certain model-based event-triggering conditions are met. The proposed… ▽ More In this paper, we explore the conditions for asymptotic stability of the hybrid closed-loop system resulting from the interconnection of a nonlinear plant, an intelligent sensor that generates finite-time convergent estimates of the plant state, and a controller node that receives opportunistic samples from the sensor node when certain model-based event-triggering conditions are met. The proposed method is endowed with a degree of separation, in the sense that the controller design is independent of the sensor design. This is achieved under mild regularity conditions imposed on the hybrid closed-loop system and the existence of persistently flowing solutions. We demonstrate the versatility of the method by implementing it on: 1) a sampled-data controller for regulation of linear plants; 2) a synergistic controller for attitude stabilization of rigid bodies. The effectiveness of these novel controllers is demonstrated through numerical simulations. △ Less

Submitted 14 November, 2024; originally announced November 2024.

arXiv:2411.08680 [pdf, other]

Finite-Alphabet-Aware Trajectory and Precoder Optimization for UAV Relaying

Authors: Haoyang Di, Xiaodong Zhu, Yulin Shao

Abstract: Unmanned aerial vehicles (UAVs) have become key enablers in relay-assisted wireless communications thanks to their flexibility and line-of-sight channel advantage. However, most existing trajectory optimization frameworks assume ideal Gaussian inputs, overlooking the fact that practical wireless systems rely on structured, finite-alphabet constellations. This mismatch can lead to suboptimal, and s… ▽ More Unmanned aerial vehicles (UAVs) have become key enablers in relay-assisted wireless communications thanks to their flexibility and line-of-sight channel advantage. However, most existing trajectory optimization frameworks assume ideal Gaussian inputs, overlooking the fact that practical wireless systems rely on structured, finite-alphabet constellations. This mismatch can lead to suboptimal, and sometimes misleading, design choices. In this paper, we challenge that convention by introducing a finite-alphabet-aware framework for joint trajectory and precoder optimization in UAV-assisted relay systems. We formulate a non-convex design problem that directly accounts for discrete signal structures and propose an efficient solution based on alternating optimization and successive convex approximation. Simulation results reveal that strategies optimized under Gaussian assumptions can waste energy and degrade throughput in real deployments. In contrast, our approach adapts both the UAV's trajectory and transmission strategy to the underlying modulation format, delivering consistent performance gains under practical system constraints. This work takes a key step toward aligning UAV communication design with the realities of modern wireless systems: discrete signals, power limits, and intelligent mobility. △ Less

Submitted 12 May, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.08413 [pdf, other]

Inference-Aware State Reconstruction for Industrial Metaverse under Synchronous/Asynchronous Short-Packet Transmission

Authors: Qinqin Xiong, Jie Cao, Xu Zhu, Yufei Jiang, Nikolaos Pappas

Abstract: We consider a real-time state reconstruction system for industrial metaverse. The time-varying physical process states in real space are captured by multiple sensors via wireless links, and then reconstructed in virtual space. In this paper, we use the spatial-temporal correlation of the sensor data of interest to infer the real-time data of the target sensor to reduce the mean squared error (MSE)… ▽ More We consider a real-time state reconstruction system for industrial metaverse. The time-varying physical process states in real space are captured by multiple sensors via wireless links, and then reconstructed in virtual space. In this paper, we use the spatial-temporal correlation of the sensor data of interest to infer the real-time data of the target sensor to reduce the mean squared error (MSE) of reconstruction for industrial metaverse under short-packet transmission (SPT). Both synchronous and asynchronous transmission modes for multiple sensors are considered. It is proved that the average MSE of reconstruction and average block error probability (BLEP) have a positive correlation under inference with synchronous transmission scheme, and they have a negative correlation in some conditions under inference with asynchronous transmission scheme. Also, it is proved that the average MSE of reconstruction with inference can be significantly lower than that without inference, even under weak mean squared spatial correlation (MSSC). In addition, closed-form MSSC thresholds are derived for the superiority regions of the inference with synchronous transmission and inference with asynchronous transmission schemes, respectively. Adaptations of blocklength and time shift of asynchronous transmission are conducted to minimize the average MSE of reconstruction. Simulation results show that the two schemes significantly outperform the no inference case, with an average MSE reduction of more than 50%. △ Less

Submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.02236 [pdf, other]

3D Audio-Visual Segmentation

Authors: Artem Sokolov, Swapnil Bhosale, Xiatian Zhu

Abstract: Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still… ▽ More Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://surrey-uplab.github.io/research/3d-audio-visual-segmentation/ △ Less

Submitted 4 November, 2024; originally announced November 2024.

Comments: Accepted at the NeurIPS 2024 Workshop on Audio Imagination

arXiv:2411.00373 [pdf, other]

Discrete RIS Enhanced Space Shift Keying MIMO System via Reflecting Beamforming Optimization

Authors: Xusheng Zhu, Qingqing Wu, Wen Chen, Xinyuan He, Lexi Xu, Yaxin Zhang

Abstract: In this paper, a discrete reconfigurable intelligent surface (RIS)-assisted spatial shift keying (SSK) multiple-input multiple-output (MIMO) scheme is investigated, in which a direct link between the transmitter and the receiver is considered. To improve the reliability of the RIS-SSK-MIMO scheme, we formulate an objective function based on minimizing the average bit error probability (ABEP). Sinc… ▽ More In this paper, a discrete reconfigurable intelligent surface (RIS)-assisted spatial shift keying (SSK) multiple-input multiple-output (MIMO) scheme is investigated, in which a direct link between the transmitter and the receiver is considered. To improve the reliability of the RIS-SSK-MIMO scheme, we formulate an objective function based on minimizing the average bit error probability (ABEP). Since the reflecting phase shift of RIS is discrete, it is difficult to address this problem directly. To this end, we optimize the RIS phase shift to maximize the Euclidean distance between the minimum constellations by applying the successive convex approximation (SCA) and penaltyalternating optimization method. Simulation results verify the superiority of the proposed RIS-SSK-MIMO scheme and demonstrate the impact of the number of RIS elements, the number of phase quantization bits, and the number of receive and transmit antennas in terms of reliability. △ Less

Submitted 1 November, 2024; originally announced November 2024.

arXiv:2410.23815 [pdf, other]

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

Authors: Dake Guo, Jixun Yao, Xinfa Zhu, Kangxiang Xia, Zhao Guo, Ziyu Zhang, Yao Wang, Jie Liu, Lei Xie

Abstract: This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking… ▽ More This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively. △ Less

Submitted 31 October, 2024; originally announced October 2024.

Comments: accepted by ISCSLP 2024

Showing 1–50 of 253 results for author: Zhu, X