-
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Authors:
Sifei Li,
Mining Tan,
Feier Shen,
Minyan Luo,
Zijiao Yin,
Fan Tang,
Weiming Dong,
Changsheng Xu
Abstract:
Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multi…
▽ More
Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System
Authors:
Hao-Han Guo,
Yao Hu,
Fei-Yu Shen,
Xu Tang,
Yi-Chen Wu,
Feng-Long Xie,
Kun Xie
Abstract:
In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from th…
▽ More
In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from the text via a language model in an auto-regressive manner. Meanwhile, the semantic-to-acoustic decoding module simultaneously translates generated semantic tokens into the speech signal in a streaming way. We implement two approaches to achieve this module: 1) a chunk-wise streamable flow-matching approach, and 2) a multi-stream language model-based approach. They both present high-quality and streamable speech generation but differ in real-time factor (RTF) and latency. Specifically, flow-matching decoding can generate speech by chunks, presenting a lower RTF of 0.1 but a higher latency of 300ms. Instead, the multi-stream language model generates speech by frames in an autoregressive manner, presenting a higher RTF of 0.3 but a low latency of 150ms. In experiments on zero-shot voice cloning, the objective results validate FireRedTTS-1S as a high-quality foundation model with comparable intelligibility and speaker similarity over industrial baseline systems. Furthermore, the subjective score of FireRedTTS-1S highlights its impressive synthesis performance, achieving comparable quality to the ground-truth recordings. These results validate FireRedTTS-1S as a high-quality streaming foundation TTS system.
△ Less
Submitted 26 May, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement
Authors:
Zizhen Lin,
Junyu Wang,
Ruili Li,
Fei Shen,
Xi Xuan
Abstract:
Single-channel speech enhancement is a challenging ill-posed problem focused on estimating clean speech from degraded signals. Existing studies have demonstrated the competitive performance of combining convolutional neural networks (CNNs) with Transformers in speech enhancement tasks. However, existing frameworks have not sufficiently addressed computational efficiency and have overlooked the nat…
▽ More
Single-channel speech enhancement is a challenging ill-posed problem focused on estimating clean speech from degraded signals. Existing studies have demonstrated the competitive performance of combining convolutional neural networks (CNNs) with Transformers in speech enhancement tasks. However, existing frameworks have not sufficiently addressed computational efficiency and have overlooked the natural multi-scale distribution of the spectrum. Additionally, the potential of CNNs in speech enhancement has yet to be fully realized. To address these issues, this study proposes a Deep Separable Dilated Dense Block (DSDDB) and a Group Prime Kernel Feedforward Channel Attention (GPFCA) module. Specifically, the DSDDB introduces higher parameter and computational efficiency to the Encoder/Decoder of existing frameworks. The GPFCA module replaces the position of the Conformer, extracting deep temporal and frequency features of the spectrum with linear complexity. The GPFCA leverages the proposed Group Prime Kernel Feedforward Network (GPFN) to integrate multi-granularity long-range, medium-range, and short-range receptive fields, while utilizing the properties of prime numbers to avoid periodic overlap effects. Experimental results demonstrate that PrimeK-Net, proposed in this study, achieves state-of-the-art (SOTA) performance on the VoiceBank+Demand dataset, reaching a PESQ score of 3.61 with only 1.41M parameters.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Deep Learning-Powered Electrical Brain Signals Analysis: Advancing Neurological Diagnostics
Authors:
Jiahe Li,
Xin Chen,
Fanqi Shen,
Junru Chen,
Yuxin Liu,
Daoze Zhang,
Zhizhang Yuan,
Fang Zhao,
Meng Li,
Yang Yang
Abstract:
Neurological disorders represent significant global health challenges, driving the advancement of brain signal analysis methods. Scalp electroencephalography (EEG) and intracranial electroencephalography (iEEG) are widely used to diagnose and monitor neurological conditions. However, dataset heterogeneity and task variations pose challenges in developing robust deep learning solutions. This review…
▽ More
Neurological disorders represent significant global health challenges, driving the advancement of brain signal analysis methods. Scalp electroencephalography (EEG) and intracranial electroencephalography (iEEG) are widely used to diagnose and monitor neurological conditions. However, dataset heterogeneity and task variations pose challenges in developing robust deep learning solutions. This review systematically examines recent advances in deep learning approaches for EEG/iEEG-based neurological diagnostics, focusing on applications across 7 neurological conditions using 46 datasets. We explore trends in data utilization, model design, and task-specific adaptations, highlighting the importance of pre-trained multi-task models for scalable, generalizable solutions. To advance research, we propose a standardized benchmark for evaluating models across diverse datasets to enhance reproducibility. This survey emphasizes how recent innovations can transform neurological diagnostics and enable the development of intelligent, adaptable healthcare solutions.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
Authors:
Hao-Han Guo,
Yao Hu,
Kun Liu,
Fei-Yu Shen,
Xu Tang,
Yi-Chen Wu,
Feng-Long Xie,
Kun Xie,
Kai-Tuo Xu
Abstract:
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS data…
▽ More
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.
△ Less
Submitted 11 April, 2025; v1 submitted 5 September, 2024;
originally announced September 2024.
-
Lesion-aware network for diabetic retinopathy diagnosis
Authors:
Xue Xia,
Kun Zhan,
Yuming Fang,
Wenhui Jiang,
Fei Shen
Abstract:
Deep learning brought boosts to auto diabetic retinopathy (DR) diagnosis, thus, greatly helping ophthalmologists for early disease detection, which contributes to preventing disease deterioration that may eventually lead to blindness. It has been proved that convolutional neural network (CNN)-aided lesion identifying or segmentation benefits auto DR screening. The key to fine-grained lesion tasks…
▽ More
Deep learning brought boosts to auto diabetic retinopathy (DR) diagnosis, thus, greatly helping ophthalmologists for early disease detection, which contributes to preventing disease deterioration that may eventually lead to blindness. It has been proved that convolutional neural network (CNN)-aided lesion identifying or segmentation benefits auto DR screening. The key to fine-grained lesion tasks mainly lies in: (1) extracting features being both sensitive to tiny lesions and robust against DR-irrelevant interference, and (2) exploiting and re-using encoded information to restore lesion locations under extremely imbalanced data distribution. To this end, we propose a CNN-based DR diagnosis network with attention mechanism involved, termed lesion-aware network, to better capture lesion information from imbalanced data. Specifically, we design the lesion-aware module (LAM) to capture noise-like lesion areas across deeper layers, and the feature-preserve module (FPM) to assist shallow-to-deep feature fusion. Afterward, the proposed lesion-aware network (LANet) is constructed by embedding the LAM and FPM into the CNN decoders for DR-related information utilization. The proposed LANet is then further extended to a DR screening network by adding a classification layer. Through experiments on three public fundus datasets with pixel-level annotations, our method outperforms the mainstream methods with an area under curve of 0.967 in DR screening, and increases the overall average precision by 7.6%, 2.1%, and 1.2% in lesion segmentation on three datasets. Besides, the ablation study validates the effectiveness of the proposed sub-modules.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
On the Effectiveness of Acoustic BPE in Decoder-Only TTS
Authors:
Bohan Li,
Feiyu Shen,
Yiwei Guo,
Shuai Wang,
Xie Chen,
Kai Yu
Abstract:
Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But…
▽ More
Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But the gain in TTS has not been fully investigated, and the proper choice of acoustic BPE remains unclear. In this work, we conduct a comprehensive study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens. Experiments on LibriTTS verify that acoustic BPE uniformly increases the intelligibility and diversity of synthesized speech, while showing different features across BPE settings. Hence, acoustic BPE is a favorable tool for decoder-only TTS.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Low-Complexity Estimation Algorithm and Decoupling Scheme for FRaC System
Authors:
Mengjiang Sun,
Peng Chen,
Zhenxin Cao,
Fei Shen
Abstract:
With the leaping advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been pr…
▽ More
With the leaping advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been proposed to form a DFRC system specifically designed for vehicular applications. In this paper, the three-dimensional (3D) parameter estimation problem in the FRaC is considered. Since the 3D-parameters including range, direction of arrival (DOA) and velocity are coupled in the estimating matrix of the FRaC system, the existing estimation algorithms cannot estimate the 3D-parameters accurately. Hence, a novel decomposed decoupled atomic norm minimization (DANM) method is proposed by splitting the 3D-parameter estimating matrix into multiple 2D matrices with sparsity constraints. Then, the 3D-parameters are estimated and efficiently and separately with the optimized decoupled estimating matrix. Moreover, the Cramér-Rao lower bound (CRLB) of the 3D-parameter estimation are derived, and the computational complexity of the proposed algorithm is analyzed. Simulation results show that the proposed decomposed DANM method exploits the advantage of the virtual aperture in the existence of coupling caused by IM and sparse MIMO array and outperforms the co-estimation algorithm with lower computation complexity.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
BrainWave: A Brain Signal Foundation Model for Clinical Applications
Authors:
Zhizhang Yuan,
Fanqi Shen,
Meng Li,
Yuguo Yu,
Chenhao Tan,
Yang Yang
Abstract:
Neural electrical activity is fundamental to brain function, underlying a range of cognitive and behavioral processes, including movement, perception, decision-making, and consciousness. Abnormal patterns of neural signaling often indicate the presence of underlying brain diseases. The variability among individuals, the diverse array of clinical symptoms from various brain disorders, and the limit…
▽ More
Neural electrical activity is fundamental to brain function, underlying a range of cognitive and behavioral processes, including movement, perception, decision-making, and consciousness. Abnormal patterns of neural signaling often indicate the presence of underlying brain diseases. The variability among individuals, the diverse array of clinical symptoms from various brain disorders, and the limited availability of diagnostic classifications, have posed significant barriers to formulating reliable model of neural signals for diverse application contexts. Here, we present BrainWave, the first foundation model for both invasive and non-invasive neural recordings, pretrained on more than 40,000 hours of electrical brain recordings (13.79 TB of data) from approximately 16,000 individuals. Our analysis show that BrainWave outperforms all other competing models and consistently achieves state-of-the-art performance in the diagnosis and identification of neurological disorders. We also demonstrate robust capabilities of BrainWave in enabling zero-shot transfer learning across varying recording conditions and brain diseases, as well as few-shot classification without fine-tuning, suggesting that BrainWave learns highly generalizable representations of neural signals. We hence believe that open-sourcing BrainWave will facilitate a wide range of clinical applications in medicine, paving the way for AI-driven approaches to investigate brain disorders and advance neuroscience research.
△ Less
Submitted 19 September, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
A Simple Geometric-Aware Indoor Positioning Interpolation Algorithm Based on Manifold Learning
Authors:
Suorong Yang,
Geng Zhang,
Jian Zhao,
Furao Shen
Abstract:
Interpolation methodologies have been widely used within the domain of indoor positioning systems. However, existing indoor positioning interpolation algorithms exhibit several inherent limitations, including reliance on complex mathematical models, limited flexibility, and relatively low precision. To enhance the accuracy and efficiency of indoor positioning interpolation techniques, this paper p…
▽ More
Interpolation methodologies have been widely used within the domain of indoor positioning systems. However, existing indoor positioning interpolation algorithms exhibit several inherent limitations, including reliance on complex mathematical models, limited flexibility, and relatively low precision. To enhance the accuracy and efficiency of indoor positioning interpolation techniques, this paper proposes a simple yet powerful geometric-aware interpolation algorithm for indoor positioning tasks. The key to our algorithm is to exploit the geometric attributes of the local topological manifold using manifold learning principles. Therefore, instead of constructing complicated mathematical models, the proposed algorithm facilitates the more precise and efficient estimation of points grounded in the local topological manifold. Moreover, our proposed method can be effortlessly integrated into any indoor positioning system, thereby bolstering its adaptability. Through a systematic array of experiments and comprehensive performance analyses conducted on both simulated and real-world datasets, we demonstrate that the proposed algorithm consistently outperforms the most commonly used and representative interpolation approaches regarding interpolation accuracy and efficiency. Furthermore, the experimental results also underscore the substantial practical utility of our method and its potential applicability in real-time indoor positioning scenarios.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Acoustic BPE for Speech Generation with Discrete Tokens
Authors:
Feiyu Shen,
Yiwei Guo,
Chenpeng Du,
Xie Chen,
Kai Yu
Abstract:
Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling proces…
▽ More
Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks.
△ Less
Submitted 15 January, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Authors:
Yifan Yang,
Feiyu Shen,
Chenpeng Du,
Ziyang Ma,
Kai Yu,
Daniel Povey,
Xie Chen
Abstract:
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec…
▽ More
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available at https://github.com/k2-fsa/icefall.
△ Less
Submitted 14 December, 2023; v1 submitted 13 September, 2023;
originally announced September 2023.
-
A Long-Tail Friendly Representation Framework for Artist and Music Similarity
Authors:
Haoran Xiang,
Junyu Dai,
Xuchen Song,
Furao Shen
Abstract:
The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important. This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationshi…
▽ More
The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important. This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationship data into a unified metric learning framework, and employs a meta-consistency relationship as a regular term to introduce the Multi-Relationship Loss. Compared to the Graph Neural Network (GNN), our proposed framework improves the representation performance in long-tail scenarios, which are characterized by sparse relationships between artists and music. We conduct experiments and analysis on the AllMusic dataset, and the results demonstrate that our framework provides a favorable generalization of artist and music representation. Specifically, on similar artist/music recommendation tasks, the LTFRF outperforms the baseline by 9.69%/19.42% in Hit Ratio@10, and in long-tail cases, the framework achieves 11.05%/14.14% higher than the baseline in Consistent@10.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
Authors:
Chenpeng Du,
Yiwei Guo,
Feiyu Shen,
Zhijun Liu,
Zheng Liang,
Xie Chen,
Shuai Wang,
Hui Zhang,
Kai Yu
Abstract:
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted…
▽ More
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.
△ Less
Submitted 28 March, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
Authors:
Chenpeng Du,
Yiwei Guo,
Feiyu Shen,
Kai Yu
Abstract:
In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours…
▽ More
In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours data from each speaker can be selected to train the TTS model. Our system is based on the recently proposed VQTTS that utilizes VQ acoustic feature rather than mel-spectrogram. We introduce additional speaker embeddings and language embeddings to VQTTS for controlling the speaker and language information. In the cross-lingual evaluations where we need to synthesize speech in a cross-lingual speaker's voice, we provide a native speaker's embedding to the acoustic model and the target speaker's embedding to the vocoder. In the subjective MOS listening test on naturalness, our system achieves 4.77 which ranks first.
△ Less
Submitted 8 November, 2024; v1 submitted 25 April, 2023;
originally announced April 2023.
-
Learning Efficient, Explainable and Discriminative Representations for Pulmonary Nodules Classification
Authors:
Hanliang Jiang,
Fuhao Shen,
Fei Gao,
Weidong Han
Abstract:
Automatic pulmonary nodules classification is significant for early diagnosis of lung cancers. Recently, deep learning techniques have enabled remarkable progress in this field. However, these deep models are typically of high computational complexity and work in a black-box manner. To combat these challenges, in this work, we aim to build an efficient and (partially) explainable classification mo…
▽ More
Automatic pulmonary nodules classification is significant for early diagnosis of lung cancers. Recently, deep learning techniques have enabled remarkable progress in this field. However, these deep models are typically of high computational complexity and work in a black-box manner. To combat these challenges, in this work, we aim to build an efficient and (partially) explainable classification model. Specially, we use \emph{neural architecture search} (NAS) to automatically search 3D network architectures with excellent accuracy/speed trade-off. Besides, we use the convolutional block attention module (CBAM) in the networks, which helps us understand the reasoning process. During training, we use A-Softmax loss to learn angularly discriminative representations. In the inference stage, we employ an ensemble of diverse neural networks to improve the prediction accuracy and robustness. We conduct extensive experiments on the LIDC-IDRI database. Compared with previous state-of-the-art, our model shows highly comparable performance by using less than 1/40 parameters. Besides, empirical study shows that the reasoning process of learned networks is in conformity with physicians' diagnosis. Related code and results have been released at: https://github.com/fei-hdu/NAS-Lung.
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
3D Spectrum Mapping Based on ROI-Driven UAV Deployment
Authors:
Qihui Wu,
Feng Shen,
Zheng Wang,
Guoru Ding
Abstract:
Given the explosive growth of Internet of Things (IoT) devices ranging from the two-dimensional (2D) ground to the three-dimensional (3D) space, it is a necessity to establish a 3D spectrum map to comprehensively present and effectively manage the 3D spatial spectrum resources in smart city infrastructures. By leveraging the popularity and location flexibility of the unmanned aerial vehicles (UAVs…
▽ More
Given the explosive growth of Internet of Things (IoT) devices ranging from the two-dimensional (2D) ground to the three-dimensional (3D) space, it is a necessity to establish a 3D spectrum map to comprehensively present and effectively manage the 3D spatial spectrum resources in smart city infrastructures. By leveraging the popularity and location flexibility of the unmanned aerial vehicles (UAVs), we are able to execute spatial sampling with these emerging flying spectrum-monitoring devices (SMDs) at will. In this paper, we first present a brief survey to show the state-of-the-art studies on spectrum mapping. Then, we introduce the 3D spectrum mapping model. Next, we propose a 3D spectrum mapping framework which is composed of pre-sampling, spectrum situation estimation, UAV deployment and spectrum recovery. Therein we develop a Region of Interest (ROI)-driven UAV deployment scheme, which selects new sampling points of the highest estimated interest and the lowest energy cost iteratively. Meanwhile, we slice the entire 3D spectrum map into a series of "images" and "repair" those unsampled locations. Furthermore, we provide an exemplary case study on the 3D spectrum mapping, where, for example, an important event is being held and the entire spectrum situation needs to be monitored in real time to deal with malicious interference sources. Lastly, the challenges and open issues are discussed.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
A Global Solution Method for Decentralized Multi-Area SCUC and Savings Allocation Based on MILP Value Functions
Authors:
Xiaodong Zheng,
Haoyong Chen,
Yan Xu,
Feifan Shen,
Zipeng Liang
Abstract:
To address the issue that Lagrangian dual function based algorithms cannot guarantee convergence and global optimality for decentralized multi-area security constrained unit commitment (M-SCUC) problems, a novel decomposition and coordination method using MILP (mixed integer linear programming) value functions is proposed in this paper. Each regional system operator sets the tie-line power injecti…
▽ More
To address the issue that Lagrangian dual function based algorithms cannot guarantee convergence and global optimality for decentralized multi-area security constrained unit commitment (M-SCUC) problems, a novel decomposition and coordination method using MILP (mixed integer linear programming) value functions is proposed in this paper. Each regional system operator sets the tie-line power injections as variational parameters in its regional SCUC model, and utilizes a finite algorithm to generate a MILP value function, which returns the optimal generation cost for any given interchange scheduling. With the value functions available from all system operators, theoretically, a coordinator is able to derive a globally optimal interchange scheduling. Since power exchanges may alter the financial position of each area considerably from what it would have been via scheduling independently, we then propose a fair savings allocation method using the values functions derived above and the Shapley value in cooperative game theory. Numerical experiments on a two-area 12-bus system and a three-area 457-bus system are carried out. The validness of the value functions based method is verified for the decentralized M-SCUC problems. The outcome of savings allocation is compared with that of the locational marginal cost based method.
△ Less
Submitted 17 June, 2019;
originally announced June 2019.
-
Understanding the Temporal Fading in Wireless Industrial Networks: Measurements and Analyses
Authors:
Qilong Zhang,
Qiwei Zhang,
Wuxiong Zhang,
Fei Shen,
Tian Hong Loh,
Fei Qin
Abstract:
The wide deployment of wireless industrial networks still faces the challenge of unreliable service due to severe multipath fading in industrial environments. Such fading effects are not only caused by the massive metal surfaces existing within the industrial environment but also, more significantly, the moving objects including operators and logistical vehicles. As a result, the mature analytical…
▽ More
The wide deployment of wireless industrial networks still faces the challenge of unreliable service due to severe multipath fading in industrial environments. Such fading effects are not only caused by the massive metal surfaces existing within the industrial environment but also, more significantly, the moving objects including operators and logistical vehicles. As a result, the mature analytical framework of mobile fading channel may not be appropriate for the wireless industrial networks especially the majority fixed wireless links. In this paper, we propose a qualitative analysis framework to characterize the temporal fading effects of the fixed wireless links in industrial environments, which reveals the essential reason of correlated temporal variation of both the specular and scattered power. Extensive measurements with both the envelop distribution and impulse response from field experiments validate the proposed qualitative framework, which will be applicable to simulate the industrial multipath fading characteristics and to derive accurate link quality metrics to support reliable wireless network service in various industrial applications.
△ Less
Submitted 28 September, 2018;
originally announced September 2018.