Search | arXiv e-print repository

doi 10.1145/3731715.3733471

Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning

Authors: Xiaofeng Pan, Jing Chen, Haitong Zhang, Menglin Xing, Jiayi Wei, Xuefeng Mu, Zhongqian Xie

Abstract: Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fai… ▽ More Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: ICMR 2025

arXiv:2504.13394 [pdf, other]

A Deep Learning-Based Supervised Transfer Learning Framework for DOA Estimation with Array Imperfections

Authors: Bo Zhou, Kaijie Xu, Yinghui Quan, Mengdao Xing

Abstract: In practical scenarios, processes such as sensor design, manufacturing, and installation will introduce certain errors. Furthermore, mutual interference occurs when the sensors receive signals. These defects in array systems are referred to as array imperfections, which can significantly degrade the performance of Direction of Arrival (DOA) estimation. In this study, we propose a deep-learning bas… ▽ More In practical scenarios, processes such as sensor design, manufacturing, and installation will introduce certain errors. Furthermore, mutual interference occurs when the sensors receive signals. These defects in array systems are referred to as array imperfections, which can significantly degrade the performance of Direction of Arrival (DOA) estimation. In this study, we propose a deep-learning based transfer learning approach, which effectively mitigates the degradation of deep-learning based DOA estimation performance caused by array imperfections. In the proposed approach, we highlight three major contributions. First, we propose a Vision Transformer (ViT) based method for DOA estimation, which achieves excellent performance in scenarios with low signal-to-noise ratios (SNR) and limited snapshots. Second, we introduce a transfer learning framework that extends deep learning models from ideal simulation scenarios to complex real-world scenarios with array imperfections. By leveraging prior knowledge from ideal simulation data, the proposed transfer learning framework significantly improves deep learning-based DOA estimation performance in the presence of array imperfections, without the need for extensive real-world data. Finally, we incorporate visualization and evaluation metrics to assess the performance of DOA estimation algorithms, which allow for a more thorough evaluation of algorithms and further validate the proposed method. Our code can be accessed at https://github.com/zzb-nice/DOA_est_Master. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2412.08237 [pdf, other]

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Authors: Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu

Abstract: It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely o… ▽ More It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data. △ Less

Submitted 12 December, 2024; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: Technical Report

arXiv:2111.08387 [pdf, other]

S-DCCRN: Super Wide Band DCCRN with learnable complex feature for speech enhancement

Authors: Shubo Lv, Yihui Fu, Mengtao Xing, Jiayao Sun, Lei Xie, Jun Huang, Yannan Wang, Tao Yu

Abstract: In speech enhancement, complex neural network has shown promising performance due to their effectiveness in processing complex-valued spectrum. Most of the recent speech enhancement approaches mainly focus on wide-band signal with a sampling rate of 16K Hz. However, research on super wide band (e.g., 32K Hz) or even full-band (48K) denoising is still lacked due to the difficulty of modeling more f… ▽ More In speech enhancement, complex neural network has shown promising performance due to their effectiveness in processing complex-valued spectrum. Most of the recent speech enhancement approaches mainly focus on wide-band signal with a sampling rate of 16K Hz. However, research on super wide band (e.g., 32K Hz) or even full-band (48K) denoising is still lacked due to the difficulty of modeling more frequency bands and particularly high frequency components. In this paper, we extend our previous deep complex convolution recurrent neural network (DCCRN) substantially to a super wide band version -- S-DCCRN, to perform speech denoising on speech of 32K Hz sampling rate. We first employ a cascaded sub-band and full-band processing module, which consists of two small-footprint DCCRNs -- one operates on sub-band signal and one operates on full-band signal, aiming at benefiting from both local and global frequency information. Moreover, instead of simply adopting the STFT feature as input, we use a complex feature encoder trained in an end-to-end manner to refine the information of different frequency bands. We also use a complex feature decoder to revert the feature to time-frequency domain. Finally, a learnable spectrum compression method is adopted to adjust the energy of different frequency bands, which is beneficial for neural network learning. The proposed model, S-DCCRN, has surpassed PercepNet as well as several competitive models and achieves state-of-the-art performance in terms of speech quality and intelligibility. Ablation studies also demonstrate the effectiveness of different contributions. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: Submitted to ICASSP 2022

arXiv:2107.06170 [pdf]

Robust Blind Source Separation by Soft Decision-Directed Non-Unitary Joint Diagonalization

Authors: Wenjuan Liu, Dazheng Feng, Bingnan Pei, Mengdao Xing, Xinhong Meng, Qianru Wei

Abstract: Approximate joint diagonalization of a set of matrices provides a powerful framework for numerous statistical signal processing applications. For non-unitary joint diagonalization (NUJD) based on the least-squares (LS) criterion, outliers, also referred to as anomaly or discordant observations, have a negative influence on the performance, since squaring the residuals magnifies the effects of them… ▽ More Approximate joint diagonalization of a set of matrices provides a powerful framework for numerous statistical signal processing applications. For non-unitary joint diagonalization (NUJD) based on the least-squares (LS) criterion, outliers, also referred to as anomaly or discordant observations, have a negative influence on the performance, since squaring the residuals magnifies the effects of them. To solve this problem, we propose a novel cost function that incorporates the soft decision-directed scheme into the least-squares algorithm and develops an efficient algorithm. The influence of the outliers is mitigated by applying decision-directed weights which are associated with the residual error at each iterative step. Specifically, the mixing matrix is estimated by a modified stationary point method, in which the updating direction is determined based on the linear approximation to the gradient function. Simulation results demonstrate that the proposed algorithm outperforms conventional non-unitary diagonalization algorithms in terms of both convergence performance and robustness to outliers. △ Less

Submitted 28 June, 2021; originally announced July 2021.

Comments: 19 pages, 9 figures

arXiv:2011.02131 [pdf, other]

DESNet: A Multi-channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation

Authors: Yihui Fu, Jian Wu, Yanxin Hu, Mengtao Xing, Lei Xie

Abstract: In this paper, we propose a multi-channel network for simultaneous speech dereverberation, enhancement and separation (DESNet). To enable gradient propagation and joint optimization, we adopt the attentional selection mechanism of the multi-channel features, which is originally proposed in end-to-end unmixing, fixed-beamforming and extraction (E2E-UFE) structure. Furthermore, the novel deep comple… ▽ More In this paper, we propose a multi-channel network for simultaneous speech dereverberation, enhancement and separation (DESNet). To enable gradient propagation and joint optimization, we adopt the attentional selection mechanism of the multi-channel features, which is originally proposed in end-to-end unmixing, fixed-beamforming and extraction (E2E-UFE) structure. Furthermore, the novel deep complex convolutional recurrent network (DCCRN) is used as the structure of the speech unmixing and the neural network based weighted prediction error (WPE) is cascaded beforehand for speech dereverberation. We also introduce the staged SNR strategy and symphonic loss for the training of the network to further improve the final performance. Experiments show that in non-dereverberated case, the proposed DESNet outperforms DCCRN and most state-of-the-art structures in speech enhancement and separation, while in dereverberated scenario, DESNet also shows improvements over the cascaded WPE-DCCRN networks. △ Less

Submitted 14 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

Comments: Accepted at IEEE SLT 2021

arXiv:2008.00264 [pdf, other]

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Authors: Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie

Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued netwo… ▽ More Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS). △ Less

Submitted 22 September, 2020; v1 submitted 1 August, 2020; originally announced August 2020.

Comments: Accepted by Interspeech 2020

arXiv:2004.03379 [pdf]

Granular Computing: An Augmented Scheme of Degranulation Through a Modified Partition Matrix

Authors: Kaijie Xu, Witold Pedrycz, Zhiwu Li, Mengdao Xing

Abstract: As an important technology in artificial intelligence Granular Computing (GrC) has emerged as a new multi-disciplinary paradigm and received much attention in recent years. Information granules forming an abstract and efficient characterization of large volumes of numeric data have been considered as the fundamental constructs of GrC. By generating prototypes and partition matrix, fuzzy clustering… ▽ More As an important technology in artificial intelligence Granular Computing (GrC) has emerged as a new multi-disciplinary paradigm and received much attention in recent years. Information granules forming an abstract and efficient characterization of large volumes of numeric data have been considered as the fundamental constructs of GrC. By generating prototypes and partition matrix, fuzzy clustering is a commonly encountered way of information granulation. Degranulation involves data reconstruction completed on a basis of the granular representatives. Previous studies have shown that there is a relationship between the reconstruction error and the performance of the granulation process. Typically, the lower the degranulation error is, the better performance of granulation is. However, the existing methods of degranulation usually cannot restore the original numeric data, which is one of the important reasons behind the occurrence of the reconstruction error. To enhance the quality of degranulation, in this study, we develop an augmented scheme through modifying the partition matrix. By proposing the augmented scheme, we dwell on a novel collection of granulation-degranulation mechanisms. In the constructed approach, the prototypes can be expressed as the product of the dataset matrix and the partition matrix. Then, in the degranulation process, the reconstructed numeric data can be decomposed into the product of the partition matrix and the matrix of prototypes. Both the granulation and degranulation are regarded as generalized rotation between the data subspace and the prototype subspace with the partition matrix and the fuzzification factor. By modifying the partition matrix, the new partition matrix is constructed through a series of matrix operations. We offer a thorough analysis of the developed scheme. The experimental results are in agreement with the underlying conceptual framework △ Less

Submitted 2 April, 2020; originally announced April 2020.

Report number: No. CYB-E-2018-06-1082

arXiv:2001.00769 [pdf]

doi 10.1109/MGRS.2019.2955120

InSAR Phase Denoising: A Review of Current Technologies and Future Directions

Authors: Gang Xu, Yandong Gao, Jinwei Li, Mengdao Xing

Abstract: Nowadays, interferometric synthetic aperture radar (InSAR) has been a powerful tool in remote sensing by enhancing the information acquisition. During the InSAR processing, phase denoising of interferogram is a mandatory step for topography mapping and deformation monitoring. Over the last three decades, a large number of effective algorithms have been developed to do efforts on this topic. In thi… ▽ More Nowadays, interferometric synthetic aperture radar (InSAR) has been a powerful tool in remote sensing by enhancing the information acquisition. During the InSAR processing, phase denoising of interferogram is a mandatory step for topography mapping and deformation monitoring. Over the last three decades, a large number of effective algorithms have been developed to do efforts on this topic. In this paper, we give a comprehensive overview of InSAR phase denoising methods, classifying the established and emerging algorithms into four main categories. The first two parts refer to the categories of traditional local filters and transformed-domain filters, respectively. The third part focuses on the category of nonlocal (NL) filters, considering their outstanding performances. Latter, some advanced methods based on new concept of signal processing are also introduced to show their potentials in this field. Moreover, several popular phase denoising methods are illustrated and compared by performing the numerical experiments using both simulated and measured data. The purpose of this paper is intended to provide necessary guideline and inspiration to related researchers by promoting the architecture development of InSAR signal processing. △ Less

Submitted 19 December, 2020; v1 submitted 3 January, 2020; originally announced January 2020.

Comments: G. Xu, Y. Gao, J. Li and M. Xing, "InSAR Phase Denoising: A Review of Current Technologies and Future Directions," IEEE Geoscience and Remote Sensing Magazine, vol. 8, no. 2, pp. 64-82, June 2020. DOI 10.1109/MGRS.2019.2955120

Showing 1–9 of 9 results for author: Xing, M