Skip to main content

Showing 1–50 of 253 results for author: Zhu, X

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.05451  [pdf

    eess.IV cs.CV eess.SP

    Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging

    Authors: Lijie Huang, Jingyi Yin, Jingke Zhang, U-Wai Lok, Ryan M. DeRuiter, Jieyang Jin, Kate M. Knoll, Kendra E. Petersen, James D. Krier, Xiang-yang Zhu, Gina K. Hesley, Kathryn A. Robinson, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen, Chengwu Huang

    Abstract: Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 12 pages, 10 figures. Supplementary materials are available at https://zenodo.org/records/15832003

  2. arXiv:2507.01445  [pdf, ps, other

    eess.SP

    Basis Expansion Extrapolation based Long-Term Channel Prediction for Massive MIMO OTFS Systems

    Authors: Yanfeng Zhang, Xu Zhu, Yujie Liu, Yong Liang Guan, David González G., Vincent K. N. Lau

    Abstract: Massive multi-input multi-output (MIMO) combined with orthogonal time frequency space (OTFS) modulation has emerged as a promising technique for high-mobility scenarios. However, its performance could be severely degraded due to channel aging caused by user mobility and high processing latency. In this paper, an integrated scheme of uplink (UL) channel estimation and downlink (DL) channel predicti… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  3. arXiv:2506.19181  [pdf, ps, other

    eess.IV

    VHU-Net: Variational Hadamard U-Net for Body MRI Bias Field Correction

    Authors: Xin Zhu, Ahmet Enis Cetin, Gorkem Durak, Batuhan Gundogdu, Ziliang Hong, Hongyi Pan, Ertugrul Aktas, Elif Keles, Hatice Savas, Aytekin Oto, Hiten Patel, Adam B. Murphy, Ashley Ross, Frank Miller, Baris Turkbey, Ulas Bagci

    Abstract: Bias field artifacts in magnetic resonance imaging (MRI) scans introduce spatially smooth intensity inhomogeneities that degrade image quality and hinder downstream analysis. To address this challenge, we propose a novel variational Hadamard U-Net (VHU-Net) for effective body MRI bias field correction. The encoder comprises multiple convolutional Hadamard transform blocks (ConvHTBlocks), each inte… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  4. arXiv:2506.15191  [pdf

    eess.SY

    Islanding Strategy for Smart Grids Oriented to Resilience Enhancement and Its Power Supply Range Optimization

    Authors: Yanhong Luo, Wenchao Meng, Xi Zhu, Andreas Elombo, Hu Rong, Bing Xie, Tianwen Zhang

    Abstract: With the increasing prevalence of distributed generators, islanded operation based on distributed generation is considered a vital means to enhance the reliability and resilience of smart grids. This paper investigates the main factors in islanding partition of smart grids and establishes a mathematical model for islanding division. A method to determine the maximum power supply range of distribut… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  5. arXiv:2506.13291  [pdf, ps, other

    eess.SY

    Aggregating Inverter-Based Resources for Fast Frequency Response: A Nash Bargaining Game-Based Approach

    Authors: Xiang Zhu, Hua Geng, Hongyang Qing, Xin Zou

    Abstract: This paper proposes a multi-objective optimization (MOO) approach for grid-level frequency regulation by aggregating inverter-based resources (IBRs). Virtual power plants (VPPs), acting as aggregators, can efficiently respond to dynamic response requirements from the grid. Through parametric modeling, grid-level frequency regulation requirements are accurately quantified and translated into a feas… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted by the 2025 IEEE IAS Annual Meeting

  6. arXiv:2506.11496  [pdf, ps, other

    eess.IV cs.CV

    Taming Stable Diffusion for Computed Tomography Blind Super-Resolution

    Authors: Chunlei Li, Yilei Shi, Haoxi Hu, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

    Abstract: High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion mod… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  7. arXiv:2506.07876  [pdf, ps, other

    cs.RO eess.SY

    Versatile Loco-Manipulation through Flexible Interlimb Coordination

    Authors: Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac'h, Jiuguang Wang, Kuan Fang

    Abstract: The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation thr… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  8. arXiv:2506.03445  [pdf, ps, other

    stat.ME eess.SP

    Maximum Likelihood for Logistic Regression Model with Incomplete and Hybrid-Type Covariates

    Authors: Mohamed Cherifi, Xujia Zhu, Mohammed Nabil El Korso, Ammar Mesloub

    Abstract: Logistic regression is a fundamental and widely used statistical method for modeling binary outcomes based on covariates. However, the presence of missing data, particularly in settings involving hybrid covariates (a mix of discrete and continuous variables), poses significant challenges. In this paper, we propose a novel Expectation-Maximization based algorithm tailored for parameter estimation i… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 20 pages, 4 figures. To appear in IEEE Signal Processing Letters

    MSC Class: 62F10; 62J12; 62H30

  9. arXiv:2505.22063  [pdf, ps, other

    cs.SD eess.AS

    Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

    Authors: Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

    Abstract: Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder.… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  10. arXiv:2505.19476  [pdf, ps, other

    eess.AS eess.SP

    FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

    Authors: Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie

    Abstract: Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inferenc… ▽ More

    Submitted 27 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: Accepted to InterSpeech 2025

  11. arXiv:2505.13880  [pdf, ps, other

    eess.AS cs.SD eess.SP

    U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

    Authors: Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie

    Abstract: The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equ… ▽ More

    Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  12. arXiv:2505.12740  [pdf, ps, other

    eess.SP

    Multi-Reference and Adaptive Nonlinear Transform Source-Channel Coding for Wireless Image Semantic Transmission

    Authors: Cheng Yuan, Yufei Jiang, Xu Zhu

    Abstract: We propose a multi-reference and adaptive nonlinear transform source-channel coding (MA-NTSCC) system for wireless image semantic transmission to improve rate-distortion (RD) performance by introducing multi-dimensional contexts into the entropy model of the state-of-the-art (SOTA) NTSCC system. Improvements in RD performance of the proposed MA-NTSCC system are particularly significant in high-res… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  13. arXiv:2505.09980  [pdf, ps, other

    eess.SY

    Event-Triggered Synergistic Controllers with Dwell-Time Transmission

    Authors: Xuanzhi Zhu, Pedro Casau, Carlos Silvestre

    Abstract: We propose novel event-triggered synergistic controllers for nonlinear continuous-time plants by incorporating event-triggered control into stabilizing synergistic controllers. We highlight that a naive application of common event-triggering conditions may not ensure dwell-time transmission due to the joint jumping dynamics of the closed-loop system. Under mild conditions, we develop a suite of ev… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 9 pages, 2 figures, 1 table

  14. arXiv:2505.01476  [pdf, other

    eess.IV cs.AI cs.CV

    CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

    Authors: Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

    Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is ina… ▽ More

    Submitted 23 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

    Comments: 25 pages, 12 figures, 20 tables, accepted by Forty-Second International Conference on Machine Learning ( ICML 2025 ), link: https://icml.cc/virtual/2025/poster/46359

  15. arXiv:2505.01212  [pdf, other

    cs.CV eess.IV

    High Dynamic Range Novel View Synthesis with Single Exposure

    Authors: Kaixuan Zhang, Hu Wang, Minxian Li, Mingwu Ren, Mao Ye, Xiatian Zhu

    Abstract: High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has signifi… ▽ More

    Submitted 19 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

    Comments: It has been accepted by ICML 2025

  16. arXiv:2504.07703  [pdf, other

    eess.SY

    Optimal Frequency Support from Virtual Power Plants: Minimal Reserve and Allocation

    Authors: Xiang Zhu, Guangchun Ruan, Hua Geng

    Abstract: This paper proposes a novel reserve-minimizing and allocation strategy for virtual power plants (VPPs) to deliver optimal frequency support. The proposed strategy enables VPPs, acting as aggregators for inverter-based resources (IBRs), to provide optimal frequency support economically. The proposed strategy captures time-varying active power injections, reducing the unnecessary redundancy compared… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted by Applied Energy

  17. arXiv:2504.01025  [pdf

    eess.IV cs.AI cs.CV physics.med-ph

    Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network

    Authors: Fubao Zhu, Yang Zhang, Gengmin Liang, Jiaofen Nan, Yanting Li, Chuang Han, Danyang Sun, Zhiguo Wang, Chen Zhao, Wenxuan Zhou, Jian He, Yi Xu, Iokfai Cheang, Xu Zhu, Yanli Zhou, Weihua Zhou

    Abstract: Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study ana… ▽ More

    Submitted 27 March, 2025; originally announced April 2025.

    Comments: 23 pages, 8 figures, 4 tables

  18. arXiv:2503.23783  [pdf

    eess.SP

    ANNs-SaDE: A Machine-Learning-Based Design Automation Framework for Microwave Branch-Line Couplers

    Authors: Tianqi Chen, Wei Huang, Qiang Wu, Li Yang, Roberto Gómez-García, Xi Zhu

    Abstract: The traditional method for designing branch-line couplers involves a trial-and-error optimization process that requires multiple design iterations through electromagnetic (EM) simulations. Thus, it is extremely time consuming and labor intensive. In this paper, a novel machine-learning-based framework is proposed to tackle this issue. It integrates artificial neural networks with a self-adaptive d… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: This paper has been accepted for presentation at ISCAS 2025

  19. arXiv:2503.19703  [pdf, other

    cs.CV eess.IV

    High-Quality Spatial Reconstruction and Orthoimage Generation Using Efficient 2D Gaussian Splatting

    Authors: Qian Wang, Zhihao Zhan, Jialei He, Zhituo Tu, Xiang Zhu, Jie Yuan

    Abstract: Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring.Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive a… ▽ More

    Submitted 13 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

  20. arXiv:2503.14966  [pdf, other

    cs.CV eess.IV

    Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

    Authors: Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

    Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we intr… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: MICCAI 2024

  21. arXiv:2503.13987  [pdf, other

    eess.IV cs.CV

    Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation

    Authors: Yaxiong Chen, Yujie Wang, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou

    Abstract: Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient l… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: MICCAI 2024

  22. arXiv:2503.09961  [pdf, other

    eess.SP

    Edge-Fog Computing-Enabled EEG Data Compression via Asymmetrical Variational Discrete Cosine Transform Network

    Authors: Xin Zhu, Hongyi Pan, Ahmet Enis Cetin

    Abstract: The large volume of electroencephalograph (EEG) data produced by brain-computer interface (BCI) systems presents challenges for rapid transmission over bandwidth-limited channels in Internet of Things (IoT) networks. To address the issue, we propose a novel multi-channel asymmetrical variational discrete cosine transform (DCT) network for EEG data compression within an edge-fog computing framework… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted by the IEEE Internet of Things Journal

  23. arXiv:2503.03355  [pdf, other

    cs.CV cs.LG eess.IV

    Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

    Authors: Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai

    Abstract: In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily… ▽ More

    Submitted 8 May, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

  24. arXiv:2503.01710  [pdf, other

    cs.SD cs.AI eess.AS

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

    Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Submitted to ACL 2025

  25. arXiv:2503.01202  [pdf, other

    cs.CV cs.RO eess.IV

    A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping

    Authors: Jialei He, Zhihao Zhan, Zhituo Tu, Xiang Zhu, Jie Yuan

    Abstract: Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to… ▽ More

    Submitted 4 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  26. arXiv:2503.00980  [pdf, other

    eess.SP

    RSSI Positioning with Fluid Antenna Systems

    Authors: Wenzhi Liu, Zhisheng Rong, Xiayue Liu, Yufei Jiang, Xu Zhu

    Abstract: We introduce a novel received signal strength intensity (RSSI)-based positioning method using fluid antenna systems (FAS), leveraging their inherent channel correlation properties to improve location accuracy. By enabling a single antenna to sample multiple spatial positions, FAS exhibits high correlation between its ports. We integrate this high inter-port correlation with a logarithmic path loss… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  27. arXiv:2503.00493  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

    Authors: Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie

    Abstract: Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited… ▽ More

    Submitted 10 June, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

    Comments: ACL2025 main, Codes available at https://github.com/Kevin-naticl/LLaSE-G1

  28. arXiv:2503.00348  [pdf, other

    cs.CV eess.IV

    SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping

    Authors: Samuel Garske, Konrad Heidler, Bradley Evans, KC Wong, Xiao Xiang Zhu

    Abstract: The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring f… ▽ More

    Submitted 28 February, 2025; originally announced March 2025.

    Comments: 20 pages, 9 figures, 3 tables, code available at: https://github.com/WiseGamgee/SHAZAM

  29. arXiv:2502.18311  [pdf, other

    eess.SP

    Cost-Effective Single-Antenna RSSI Positioning Through Dynamic Radiation Pattern Analysis

    Authors: Zhisheng Rong, Wenzhi Liu, Xiayue Liu, Zhixiang Xu, Yufei Jiang, Xu Zhu

    Abstract: This paper presents a novel indoor positioning approach that leverages antenna radiation pattern characteristics through Received Signal Strength Indication (RSSI) measurements in a single-antenna system. By rotating the antenna or reconfiguring its radiation pattern, we derive a maximum likelihood estimation (MLE) algorithm that achieves near-optimal positioning accuracy approaching the Cramer-Ra… ▽ More

    Submitted 3 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

    Comments: 6 pages, 7 figures

  30. arXiv:2502.18186  [pdf, other

    cs.SD cs.CL eess.AS

    Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

    Authors: Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

    Abstract: Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  31. arXiv:2502.04128  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

    Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

    Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a pa… ▽ More

    Submitted 22 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  32. arXiv:2501.16761  [pdf, other

    eess.AS cs.SD

    CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

    Authors: Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Xi Wang, Sheng Zhao, Lei Xie

    Abstract: Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose Cos… ▽ More

    Submitted 28 January, 2025; originally announced January 2025.

    Comments: 12 pages, 5 figures, 7 tables

  33. arXiv:2501.15085  [pdf, other

    cs.AI cs.LG eess.SY

    Data Center Cooling System Optimization Using Offline Reinforcement Learning

    Authors: Xianyuan Zhan, Xiangyu Zhu, Peng Cheng, Xiao Hu, Ziteng He, Hanfei Geng, Jichao Leng, Huiwen Zheng, Chenhui Liu, Tianshun Hong, Yan Liang, Yunxin Liu, Feng Zhao

    Abstract: The recent advances in information technology and artificial intelligence have fueled a rapid expansion of the data center (DC) industry worldwide, accompanied by an immense appetite for electricity to power the DCs. In a typical DC, around 30~40% of the energy is spent on the cooling system rather than on computer servers, posing a pressing need for developing new energy-saving optimization techn… ▽ More

    Submitted 14 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

    Comments: Accepted in ICLR 2025

  34. arXiv:2501.13306  [pdf, other

    cs.SD cs.CL eess.AS

    OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

    Authors: Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie

    Abstract: Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover… ▽ More

    Submitted 16 February, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

    Comments: OSUM Technical Report v2. The experimental results reported herein differ from those in v1 because of adding new data and training in more steps

  35. arXiv:2501.12604  [pdf, other

    eess.IV cs.CV cs.LG

    Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

    Authors: Wang Pang, Zhihao Zhan, Xiang Zhu, Yechao Bai

    Abstract: Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture divers… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  36. arXiv:2501.11737  [pdf, other

    eess.SP

    Efficient Bearing Sensor Data Compression via an Asymmetrical Autoencoder with a Lifting Wavelet Transform Layer

    Authors: Xin Zhu, Ahmet Enis Cetin

    Abstract: Bearing data compression is vital to manage the large volumes of data generated during condition monitoring. In this paper, a novel asymmetrical autoencoder with a lifting wavelet transform (LWT) layer is developed to compress bearing sensor data. The encoder part of the network consists of a convolutional layer followed by a wavelet filterbank layer. Specifically, a dual-channel convolutional blo… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: Accepted at the 2025 IEEE International Symposium on Circuits and Systems

  37. arXiv:2501.04416  [pdf, other

    eess.AS cs.SD

    ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

    Authors: Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

    Abstract: Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2025

  38. arXiv:2412.18077  [pdf

    physics.med-ph eess.IV

    Optimizing In Vivo Data Acquisition for Robust Clinical Microvascular Imaging Using Ultrasound Localization Microscopy

    Authors: Chengwu Huang, U-Wai Lok, Jingke Zhang, Xiang Yang Zhu, James D. Krier, Amy Stern, Kate M. Knoll, Kendra E. Petersen, Kathryn A. Robinson, Gina K. Hesley, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen

    Abstract: Ultrasound localization microscopy (ULM) enables microvascular imaging at spatial resolutions beyond the acoustic diffraction limit, offering significant clinical potentials. However, ULM performance relies heavily on microbubble (MB) signal sparsity, the number of detected MBs, and signal-to-noise ratio (SNR), all of which vary in clinical scenarios involving bolus MB injections. These sources of… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

    Comments: 33 pages, 9 figures

  39. arXiv:2412.16846  [pdf, other

    eess.AS cs.CL cs.SD

    Autoregressive Speech Synthesis with Next-Distribution Prediction

    Authors: Xinfa Zhu, Wenjie Tian, Lei Xie

    Abstract: We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from wavefo… ▽ More

    Submitted 21 December, 2024; originally announced December 2024.

    Comments: Technical report, work in progress

  40. arXiv:2412.13676  [pdf, ps, other

    cs.ET eess.SP

    Robust UAV Jittering and Task Scheduling in Mobile Edge Computing with Data Compression

    Authors: Bin Li, Xiao Zhu, Junyi Wang

    Abstract: Data compression technology is able to reduce data size, which can be applied to lower the cost of task offloading in mobile edge computing (MEC). This paper addresses the practical challenges for robust trajectory and scheduling optimization based on data compression in the unmanned aerial vehicle (UAV)-assisted MEC, aiming to minimize the sum energy cost of terminal users while maintaining robus… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: 10 pages, 8 figures

  41. arXiv:2412.09168  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

    Authors: Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan Ding, Lei Xie

    Abstract: Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in fe… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: 16 pages, 4 figures

  42. arXiv:2412.06451  [pdf, other

    cs.LG cs.AI eess.IV

    How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

    Authors: Yuanyuan Wang, Qian Song, Dawood Wasif, Muhammad Shahzad, Christoph Koller, Jonathan Bamber, Xiao Xiang Zhu

    Abstract: Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Submitted to IEEE Geoscience and Remote Sensing Magazine

  43. arXiv:2411.18918  [pdf, other

    cs.SD eess.AS

    CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

    Authors: Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

    Abstract: Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker r… ▽ More

    Submitted 3 December, 2024; v1 submitted 28 November, 2024; originally announced November 2024.

  44. RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content

    Authors: Yuxuan Jiang, Jakub Nawała, Chen Feng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull

    Abstract: Super-resolution (SR) is a key technique for improving the visual quality of video content by increasing its spatial resolution while reconstructing fine details. SR has been employed in many applications including video streaming, where compressed low-resolution content is typically transmitted to end users and then reconstructed with a higher resolution and enhanced quality. To support real-time… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

  45. arXiv:2411.09307  [pdf, ps, other

    eess.SY

    Model-Based Event-Triggered Implementation of Hybrid Controllers Using Finite-Time Convergent Observers

    Authors: Xuanzhi Zhu, Pedro Casau, Carlos Silvestre

    Abstract: In this paper, we explore the conditions for asymptotic stability of the hybrid closed-loop system resulting from the interconnection of a nonlinear plant, an intelligent sensor that generates finite-time convergent estimates of the plant state, and a controller node that receives opportunistic samples from the sensor node when certain model-based event-triggering conditions are met. The proposed… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  46. arXiv:2411.08680  [pdf, other

    cs.IT eess.SP

    Finite-Alphabet-Aware Trajectory and Precoder Optimization for UAV Relaying

    Authors: Haoyang Di, Xiaodong Zhu, Yulin Shao

    Abstract: Unmanned aerial vehicles (UAVs) have become key enablers in relay-assisted wireless communications thanks to their flexibility and line-of-sight channel advantage. However, most existing trajectory optimization frameworks assume ideal Gaussian inputs, overlooking the fact that practical wireless systems rely on structured, finite-alphabet constellations. This mismatch can lead to suboptimal, and s… ▽ More

    Submitted 12 May, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

  47. arXiv:2411.08413  [pdf, other

    eess.SY

    Inference-Aware State Reconstruction for Industrial Metaverse under Synchronous/Asynchronous Short-Packet Transmission

    Authors: Qinqin Xiong, Jie Cao, Xu Zhu, Yufei Jiang, Nikolaos Pappas

    Abstract: We consider a real-time state reconstruction system for industrial metaverse. The time-varying physical process states in real space are captured by multiple sensors via wireless links, and then reconstructed in virtual space. In this paper, we use the spatial-temporal correlation of the sensor data of interest to infer the real-time data of the target sensor to reduce the mean squared error (MSE)… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

  48. arXiv:2411.02236  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    3D Audio-Visual Segmentation

    Authors: Artem Sokolov, Swapnil Bhosale, Xiatian Zhu

    Abstract: Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Accepted at the NeurIPS 2024 Workshop on Audio Imagination

  49. arXiv:2411.00373  [pdf, other

    cs.IT eess.SP

    Discrete RIS Enhanced Space Shift Keying MIMO System via Reflecting Beamforming Optimization

    Authors: Xusheng Zhu, Qingqing Wu, Wen Chen, Xinyuan He, Lexi Xu, Yaxin Zhang

    Abstract: In this paper, a discrete reconfigurable intelligent surface (RIS)-assisted spatial shift keying (SSK) multiple-input multiple-output (MIMO) scheme is investigated, in which a direct link between the transmitter and the receiver is considered. To improve the reliability of the RIS-SSK-MIMO scheme, we formulate an objective function based on minimizing the average bit error probability (ABEP). Sinc… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

  50. arXiv:2410.23815  [pdf, other

    cs.SD cs.AI eess.AS

    The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

    Authors: Dake Guo, Jixun Yao, Xinfa Zhu, Kangxiang Xia, Zhao Guo, Ziyu Zhang, Yao Wang, Jie Liu, Lei Xie

    Abstract: This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: accepted by ISCSLP 2024