Skip to main content

Showing 1–50 of 210 results for author: Xiao, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.00366  [pdf, ps, other

    cs.IT eess.SP

    Wireless AI Evolution: From Statistical Learners to Electromagnetic-Guided Foundation Models

    Authors: Jian Xiao, Ji Wang, Kunrui Cao, Xingwang Li, Zhao Chen, Chau Yuen

    Abstract: While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wi… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  2. arXiv:2505.11909  [pdf, other

    eess.IV cs.CV

    Bridging the Inter-Domain Gap through Low-Level Features for Cross-Modal Medical Image Segmentation

    Authors: Pengfei Lyu, Pak-Hei Yeung, Xiaosheng Yu, Jing Xia, Jianning Chi, Chengdong Wu, Jagath C. Rajapakse

    Abstract: This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recov… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: 11 pages, 2 figures

  3. arXiv:2505.04281  [pdf, other

    cs.CV eess.IV

    TS-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement

    Authors: Yi Li, Zhiyuan Zhang, Jiangnan Xia, Jianghan Cheng, Qilong Wu, Junwei Li, Yibin Tian, Hui Kong

    Abstract: This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning st… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: International Joint Conference on Neural Networks (IJCNN)

  4. arXiv:2504.10819  [pdf, other

    cs.SD eess.AS

    Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy

    Authors: Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

    Abstract: Generalizability, the capacity of a robust model to perform effectively on unseen data, is crucial for audio deepfake detection due to the rapid evolution of text-to-speech (TTS) and voice conversion (VC) technologies. A promising approach to differentiate between bonafide and spoof samples lies in identifying intrinsic disparities to enhance model generalizability. From an information-theoretic p… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accpeted by IEEE International Conference on Multimedia & Expo 2025 (ICME 2025)

  5. arXiv:2503.17970  [pdf, other

    eess.IV cs.CV

    PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images

    Authors: Yang Luo, Shiru Wang, Jun Liu, Jiaxuan Xiao, Rundong Xue, Zeyu Zhang, Hao Zhang, Yu Lu, Yang Zhao, Yutong Xie

    Abstract: Breast cancer survival prediction in computational pathology presents a remarkable challenge due to tumor heterogeneity. For instance, different regions of the same tumor in the pathology image can show distinct morphological and molecular characteristics. This makes it difficult to extract representative features from whole slide images (WSIs) that truly reflect the tumor's aggressive potential a… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  6. arXiv:2503.13268  [pdf, other

    cs.IT eess.SP

    Channel Estimation for Pinching-Antenna Systems (PASS)

    Authors: Jian Xiao, Ji Wang, Yuanwei Liu

    Abstract: Pinching Antennas (PAs) represent a revolutionary flexible antenna technology that leverages dielectric waveguides and electromagnetic coupling to mitigate large-scale path loss. This letter is the first to explore channel estimation for Pinching-Antenna SyStems (PASS), addressing their uniquely ill-conditioned and underdetermined channel characteristics. In particular, two efficient deep learning… ▽ More

    Submitted 10 May, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  7. arXiv:2503.09560  [pdf, other

    eess.IV cs.CV

    FCaS: Fine-grained Cardiac Image Synthesis based on 3D Template Conditional Diffusion Model

    Authors: Jiahao Xia, Yutao Hu, Yaolei Qi, Zhenliang Li, Wenqi Shao, Junjun He, Ying Fu, Longjiang Zhang, Guanyu Yang

    Abstract: Solving medical imaging data scarcity through semantic image generation has attracted significant attention in recent years. However, existing methods primarily focus on generating whole-organ or large-tissue structures, showing limited effectiveness for organs with fine-grained structure. Due to stringent topological consistency, fragile coronary features, and complex 3D morphological heterogenei… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 16 pages, 9 figures

  8. arXiv:2502.02862  [pdf, other

    eess.IV cs.AI cs.CV

    Learning Generalizable Features for Tibial Plateau Fracture Segmentation Using Masked Autoencoder and Limited Annotations

    Authors: Peiyan Yue, Die Cai, Chu Guo, Mengxing Liu, Jun Xia, Yi Wang

    Abstract: Accurate automated segmentation of tibial plateau fractures (TPF) from computed tomography (CT) requires large amounts of annotated data to train deep learning models, but obtaining such annotations presents unique challenges. The process demands expert knowledge to identify diverse fracture patterns, assess severity, and account for individual anatomical variations, making the annotation process… ▽ More

    Submitted 9 April, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

    Comments: 5 pages, 6 figures. Accepted to IEEE EMBC 2025

  9. arXiv:2501.19129  [pdf, other

    cs.CV eess.IV

    RGB-Event ISP: The Dataset and Benchmark

    Authors: Yunfan Lu, Yanlin Qian, Ziyang Rao, Junren Xiao, Liming Chen, Hui Xiong

    Abstract: Event-guided imaging has received significant attention due to its potential to revolutionize instant imaging systems. However, the prior methods primarily focus on enhancing RGB images in a post-processing manner, neglecting the challenges of image signal processor (ISP) dealing with event sensor and the benefits events provide for reforming the ISP process. To achieve this, we conduct the first… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: Accepted by ICLR 2025; 14 pages, 8 figures, 4 tables

  10. arXiv:2501.12235  [pdf, other

    cs.CV eess.IV

    DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains

    Authors: Junyu Xia, Jiesong Bai, Yihang Dong

    Abstract: Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous driving.Traditional enhancement techniques, such as multi-scale fusion and histo… ▽ More

    Submitted 21 April, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

    Comments: 9 pages and 6 figures

  11. arXiv:2501.10891  [pdf, other

    eess.IV cs.AI cs.CV eess.SP

    OpenEarthMap-SAR: A Benchmark Synthetic Aperture Radar Dataset for Global High-Resolution Land Cover Mapping

    Authors: Junshi Xia, Hongruixuan Chen, Clifford Broni-Bediako, Yimin Wei, Jian Song, Naoto Yokoya

    Abstract: High-resolution land cover mapping plays a crucial role in addressing a wide range of global challenges, including urban planning, environmental monitoring, disaster response, and sustainable development. However, creating accurate, large-scale land cover datasets remains a significant challenge due to the inherent complexities of geospatial data, such as diverse terrain, varying sensor modalities… ▽ More

    Submitted 21 January, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

    Comments: 8 pages, 3 figures

  12. arXiv:2501.07830  [pdf, other

    eess.SP

    Deep Learning Waveform Channel Modeling for Wideband Optical Fiber Transmission: Model Comparisons, Challenges and Potential Solutions

    Authors: Minghui Shi, Hang Yang, Zekun Niu, Chuyan Zeng, Junzhe Xiao, Yunfan Zhang, Mingzhe Chen, Weisheng Hu, Lilin Yi

    Abstract: Fast and accurate waveform simulation is critical for understanding fiber channel characteristics, developing digital signal processing (DSP) technologies, optimizing optical network configurations, and advancing the optical fiber transmission system towards wideband. Deep learning (DL) has emerged as a powerful tool for waveform modeling, offering high accuracy and low complexity compared to trad… ▽ More

    Submitted 3 April, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

  13. arXiv:2501.06019  [pdf, other

    cs.CV cs.AI eess.IV eess.SP

    BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response

    Authors: Hongruixuan Chen, Jian Song, Olivier Dietrich, Clifford Broni-Bediako, Weihao Xuan, Junjue Wang, Xinlei Shao, Yimin Wei, Junshi Xia, Cuiling Lan, Konrad Schindler, Naoto Yokoya

    Abstract: Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of un… ▽ More

    Submitted 18 April, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

  14. arXiv:2412.08161  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation

    Authors: Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao

    Abstract: Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object-level details and ii) the timing of when objec… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

  15. arXiv:2411.17552  [pdf, other

    eess.SY

    Ensuring Safety in Target Pursuit Control: A CBF-Safe Reinforcement Learning Approach

    Authors: Yaosheng Deng, Junjie Gao, Jiaping Xiao, Mir Feroskhan

    Abstract: This paper addresses the target-pursuit problem, aiming to ensure each pursuer's safety regarding collision avoidance, sensing range, and input saturation. An input-constrained CBF is proposed to dynamically regulate the pursuer's control, ensuring effective target pursuit even when the target performs evasive maneuvers. To further ensure safety, two sets of CBF constraints are designed to regulat… ▽ More

    Submitted 10 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: 12 pages

  16. arXiv:2411.12450  [pdf, other

    cs.CV eess.IV

    Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

    Authors: Jun Xiao, Zihang Lyu, Hao Xie, Cong Zhang, Yakun Ju, Changjian Shui, Kin-Man Lam

    Abstract: Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consid… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

    Comments: 17 pages, 6 figures, has been accepted by the ECCV 2024: AIM workshop

  17. arXiv:2411.06685  [pdf, other

    cs.CV cs.AI eess.IV

    High-Frequency Enhanced Hybrid Neural Representation for Video Compression

    Authors: Li Yu, Zhihui Li, Jimin Xiao, Moncef Gabbouj

    Abstract: Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Fre… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 November, 2024; originally announced November 2024.

  18. arXiv:2411.05842  [pdf, other

    eess.SY cs.LG

    Efficient and Robust Freeway Traffic Speed Estimation under Oblique Grid using Vehicle Trajectory Data

    Authors: Yang He, Chengchuan An, Yuheng Jia, Jiachao Liu, Zhenbo Lu, Jingxin Xia

    Abstract: Accurately estimating spatiotemporal traffic states on freeways is a significant challenge due to limited sensor deployment and potential data corruption. In this study, we propose an efficient and robust low-rank model for precise spatiotemporal traffic speed state estimation (TSE) using lowpenetration vehicle trajectory data. Leveraging traffic wave priors, an oblique grid-based matrix is first… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Comments: accepted by T-ITS

  19. arXiv:2410.23325  [pdf

    eess.AS cs.AI cs.MM cs.SD

    Transfer Learning in Vocal Education: Technical Evaluation of Limited Samples Describing Mezzo-soprano

    Authors: Zhenyi Hou, Xu Zhao, Kejie Ye, Xinyu Sheng, Shanggerile Jiang, Jiajing Xia, Yitao Zhang, Chenxi Ban, Daijun Luo, Jiaxing Chen, Yan Zou, Yuchao Feng, Guangyu Fan, Xin Yuan

    Abstract: Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, suc… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

  20. arXiv:2410.07876  [pdf

    eess.IV cs.CV

    FDDM: Frequency-Decomposed Diffusion Model for Rectum Cancer Dose Prediction in Radiotherapy

    Authors: Xin Liao, Zhenghao Feng, Jianghong Xiao, Xingchen Peng, Yan Wang

    Abstract: Accurate dose distribution prediction is crucial in the radiotherapy planning. Although previous methods based on convolutional neural network have shown promising performance, they have the problem of over-smoothing, leading to prediction without important high-frequency details. Recently, diffusion model has achieved great success in computer vision, which excels in generating images with more h… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  21. arXiv:2409.19627  [pdf, other

    cs.MM cs.CR cs.SD eess.AS

    IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

    Authors: Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang

    Abstract: The audio watermarking technique embeds messages into audio and accurately extracts messages from the watermarked audio. Traditional methods develop algorithms based on expert experience to embed watermarks into the time-domain or transform-domain of signals. With the development of deep neural networks, deep learning-based neural audio watermarking has emerged. Compared to traditional algorithms,… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Comments: Accepted by the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

    ACM Class: K.6.5; D.4.6

  22. arXiv:2408.11982  [pdf, other

    eess.IV cs.CV cs.MM

    AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results

    Authors: Maksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Zhenzhong Chen, Zhengxue Cheng, Jiahao Xiao , et al. (7 additional authors not shown)

    Abstract: Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat… ▽ More

    Submitted 22 October, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

  23. arXiv:2408.08228  [pdf, other

    eess.IV cs.CV

    Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

    Authors: Zixuan Pan, Jun Xia, Zheyu Yan, Guoyue Xu, Yawen Wu, Zhenge Jia, Jianxu Chen, Yiyu Shi

    Abstract: Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted to perform anomaly detection in brain MRI. While most existing works try to improve detection accuracy by proposing new model structures or algorithms, we tackle the problem through image quality assessment, an underexplored perspective in the field. We propose a fusion quality loss function that com… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  24. arXiv:2408.07592  [pdf, other

    eess.SP

    Multi-periodicity dependency Transformer based on spectrum offset for radio frequency fingerprint identification

    Authors: Jing Xiao, Wenrui Ding, Zeqi Shao, Duona Zhang, Yanan Ma, Yufeng Wang, Jian Wang

    Abstract: Radio Frequency Fingerprint Identification (RFFI) has emerged as a pivotal task for reliable device authentication. Despite advancements in RFFI methods, background noise and intentional modulation features result in weak energy and subtle differences in the RFF features. These challenges diminish the capability of RFFI methods in feature representation, complicating the effective identification o… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  25. arXiv:2407.05619  [pdf, other

    cs.RO eess.SY

    AIRA: A Low-cost IR-based Approach Towards Autonomous Precision Drone Landing and NLOS Indoor Navigation

    Authors: Yanchen Liu, Minghui Zhao, Kaiyuan Hou, Junxi Xia, Charlie Carver, Stephen Xia, Xia Zhou, Xiaofan Jiang

    Abstract: Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  26. Deep Learning Segmentation of Ascites on Abdominal CT Scans for Automatic Volume Quantification

    Authors: Benjamin Hou, Sung-Won Lee, Jung-Min Lee, Christopher Koh, Jing Xiao, Perry J. Pickhardt, Ronald M. Summers

    Abstract: Purpose: To evaluate the performance of an automated deep learning method in detecting ascites and subsequently quantifying its volume in patients with liver cirrhosis and ovarian cancer. Materials and Methods: This retrospective study included contrast-enhanced and non-contrast abdominal-pelvic CT scans of patients with cirrhotic ascites and patients with ovarian cancer from two institutions, N… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  27. arXiv:2406.10869  [pdf, other

    eess.IV cs.CV

    Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

    Authors: Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guoping Qiu

    Abstract: As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI sup… ▽ More

    Submitted 16 January, 2025; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 12 figures, journal

  28. Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    Authors: Jingyuan Xia, Zhixiong Yang, Shengxi Li, Shuanghui Zhang, Yaowen Fu, Deniz Gündüz, Xiang Li

    Abstract: Learning-based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a Meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as k… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  29. arXiv:2406.08835  [pdf, other

    cs.SD eess.AS

    EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

    Authors: Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, Jing Xiao

    Abstract: Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mappin… ▽ More

    Submitted 8 January, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

  30. arXiv:2405.17028  [pdf, other

    cs.SD eess.AS

    RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

    Authors: Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

    Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data

  31. arXiv:2405.01242  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

    Authors: Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, Stephen Xia

    Abstract: We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art m… ▽ More

    Submitted 29 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

  32. arXiv:2405.00930  [pdf, other

    cs.SD eess.AS

    MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

    Authors: Pengcheng Li, Jianzong Wang, Xulong Zhang, Yong Zhang, Jing Xiao, Ning Cheng

    Abstract: One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively… ▽ More

    Submitted 24 November, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

    ACM Class: I.2.7; H.5.5

  33. arXiv:2405.00603  [pdf, other

    cs.SD eess.AS

    Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

    Authors: Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  34. arXiv:2405.00577  [pdf, other

    cs.LG eess.SP q-bio.NC

    Discovering robust biomarkers of psychiatric disorders from resting-state functional MRI via graph neural networks: A systematic review

    Authors: Yi Hao Chan, Deepank Girish, Sukrit Gupta, Jing Xia, Chockalingam Kasi, Yinan He, Conghao Wang, Jagath C. Rajapakse

    Abstract: Graph neural networks (GNN) have emerged as a popular tool for modelling functional magnetic resonance imaging (fMRI) datasets. Many recent studies have reported significant improvements in disorder classification performance via more sophisticated GNN designs and highlighted salient features that could be potential biomarkers of the disorder. However, existing methods of evaluating their robustne… ▽ More

    Submitted 1 February, 2025; v1 submitted 1 May, 2024; originally announced May 2024.

  35. arXiv:2404.19214  [pdf, other

    cs.SD eess.AS

    EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

    Authors: Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  36. arXiv:2404.19212  [pdf, other

    cs.SD eess.AS

    EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

    Authors: Ziqi Liang, Jianzong Wang, Xulong Zhang, Yong Zhang, Ning Cheng, Jing Xiao

    Abstract: Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of informat… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  37. arXiv:2404.19187  [pdf, other

    cs.SD eess.AS

    CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

    Authors: Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and s… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  38. arXiv:2404.15704  [pdf, other

    cs.LG cs.AI cs.SD eess.AS

    Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning

    Authors: Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Jing Xiao

    Abstract: Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial comp… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

  39. arXiv:2404.15620  [pdf, other

    eess.IV

    A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution

    Authors: Zhixiong Yang, Jingyuan Xia, Shengxi Li, Xinghua Huang, Shuanghui Zhang, Zhen Liu, Yaowen Fu, Yongxiang Liu

    Abstract: Deep learning-based methods have achieved significant successes on solving the blind super-resolution (BSR) problem. However, most of them request supervised pre-training on labelled datasets. This paper proposes an unsupervised kernel estimation model, named dynamic kernel prior (DKP), to realize an unsupervised and pre-training-free learning-based algorithm for solving the BSR problem. DKP can a… ▽ More

    Submitted 25 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted for publication in CVPR 2024

  40. arXiv:2404.13892  [pdf, other

    cs.SD cs.AI eess.AS

    Retrieval-Augmented Audio Deepfake Detection

    Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

    Abstract: With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired… ▽ More

    Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)

  41. arXiv:2404.13786  [pdf, other

    eess.SY cs.AI cs.DC cs.LG

    Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

    Authors: Shuyao Shi, Neiwen Ling, Zhehao Jiang, Xuan Huang, Yuze He, Xiaoguang Zhao, Bufang Yang, Chen Bian, Jingfei Xia, Zhenyu Yan, Raymond Yeung, Guoliang Xing

    Abstract: Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components ca… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  42. arXiv:2404.03425  [pdf, other

    eess.IV cs.AI cs.CV

    ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model

    Authors: Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, Naoto Yokoya

    Abstract: Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on… ▽ More

    Submitted 30 December, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted by IEEE TGRS: https://ieeexplore.ieee.org/document/10565926

    Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-20, 2024, Art no. 4409720

  43. arXiv:2403.07919  [pdf, ps, other

    cs.IT eess.SP

    Freshness-aware Resource Allocation for Non-orthogonal Wireless-powered IoT Networks

    Authors: Yunfeng Chen, Yong Liu, Jinhao Xiao, Qunying Wu, Han Zhang, Fen Hou

    Abstract: This paper investigates a wireless-powered Internet of Things (IoT) network comprising a hybrid access point (HAP) and two devices. The HAP facilitates downlink wireless energy transfer (WET) for device charging and uplink wireless information transfer (WIT) to collect status updates from the devices. To keep the information fresh, concurrent WET and WIT are allowed, and orthogonal multiple access… ▽ More

    Submitted 27 February, 2024; originally announced March 2024.

  44. arXiv:2402.04356  [pdf, other

    cs.SD cs.CV eess.AS

    Bidirectional Autoregressive Diffusion Model for Dance Generation

    Authors: Canyu Zhang, Youbao Tang, Ning Zhang, Ruei-Sung Lin, Mei Han, Jing Xiao, Song Wang

    Abstract: Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create… ▽ More

    Submitted 22 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  45. Wideband Beamforming for RIS Assisted Near-Field Communications

    Authors: Ji Wang, Jian Xiao, Yixuan Zou, Wenwu Xie, Yuanwei Liu

    Abstract: A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid… ▽ More

    Submitted 7 January, 2025; v1 submitted 20 January, 2024; originally announced January 2024.

    Journal ref: IEEE Transactions on Wireless Communications,2024

  46. arXiv:2401.08166  [pdf, other

    eess.AS cs.SD

    ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

    Authors: Haobin Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

    Abstract: Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our p… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  47. arXiv:2401.08096  [pdf, other

    cs.SD eess.AS

    Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

    Authors: Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

    Abstract: Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  48. arXiv:2401.08049  [pdf, other

    cs.CV cs.SD eess.AS

    EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

    Authors: Bingyuan Zhang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao, Jianzong Wang

    Abstract: In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singul… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

  49. arXiv:2311.08670  [pdf, other

    cs.SD eess.AS

    CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

    Authors: Yimin Deng, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

    Abstract: Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard ne… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

  50. arXiv:2311.07965  [pdf, other

    cs.SD eess.AS

    DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

    Authors: Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a seq… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Accepted by the 13th IEEE International Conference on Big Data and Cloud Computing (IEEE BDCloud 2023)