Search | arXiv e-print repository

Efficient Multilingual ASR Finetuning via LoRA Language Experts

Authors: Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian

Abstract: Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to… ▽ More Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to identify multiple languages effectively while sharing model capacity across them. This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper. Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods. Experimental results demonstrate that the proposed models yield approximately 10\% and 15\% relative performance gains in language-aware and language-agnostic scenarios, respectively. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted in Interspeech 2025

arXiv:2506.07520 [pdf, ps, other]

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Authors: Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

Abstract: Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address… ▽ More Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration. △ Less

Submitted 15 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.05891 [pdf, ps, other]

WAKE: Watermarking Audio with Key Enrichment

Authors: Yaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo

Abstract: As deep learning advances in audio generation, challenges in audio security and copyright protection highlight the need for robust audio watermarking. Recent neural network-based methods have made progress but still face three main issues: preventing unauthorized access, decoding initial watermarks after multiple embeddings, and embedding varying lengths of watermarks. To address these issues, we… ▽ More As deep learning advances in audio generation, challenges in audio security and copyright protection highlight the need for robust audio watermarking. Recent neural network-based methods have made progress but still face three main issues: preventing unauthorized access, decoding initial watermarks after multiple embeddings, and embedding varying lengths of watermarks. To address these issues, we propose WAKE, the first key-controllable audio watermark framework. WAKE embeds watermarks using specific keys and recovers them with corresponding keys, enhancing security by making incorrect key decoding impossible. It also resolves the overwriting issue by allowing watermark decoding after multiple embeddings and supports variable-length watermark insertion. WAKE outperforms existing models in both watermarked audio quality and watermark detection accuracy. Code, more results, and demo page: https://thuhcsi.github.io/WAKE. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: Accepted by InterSpeech2025

arXiv:2505.22045 [pdf, ps, other]

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

Authors: Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

Abstract: Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysi… ▽ More Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2505.21527 [pdf, ps, other]

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Authors: Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen

Abstract: Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel A… ▽ More Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR. △ Less

Submitted 29 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.13062 [pdf, other]

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

Authors: Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

Abstract: Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions f… ▽ More Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference. △ Less

Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2504.08661 [pdf, other]

Safe Flow Matching: Robot Motion Planning with Control Barrier Functions

Authors: Xiaobing Dai, Zewen Yang, Dian Yu, Shanshan Zhang, Hamid Sadeghian, Sami Haddadin, Sandra Hirche

Abstract: Recent advances in generative modeling have led to promising results in robot motion planning, particularly through diffusion and flow matching (FM)-based models that capture complex, multimodal trajectory distributions. However, these methods are typically trained offline and remain limited when faced with new environments with constraints, often lacking explicit mechanisms to ensure safety durin… ▽ More Recent advances in generative modeling have led to promising results in robot motion planning, particularly through diffusion and flow matching (FM)-based models that capture complex, multimodal trajectory distributions. However, these methods are typically trained offline and remain limited when faced with new environments with constraints, often lacking explicit mechanisms to ensure safety during deployment. In this work, we propose Safe Flow Matching (SafeFlow), a motion planning framework, for trajectory generation that integrates flow matching with safety guarantees. SafeFlow leverages our proposed flow matching barrier functions (FMBF) to ensure the planned trajectories remain within safe regions across the entire planning horizon. Crucially, our approach enables training-free, real-time safety enforcement at test time, eliminating the need for retraining. We evaluate SafeFlow on a diverse set of tasks, including planar robot navigation and 7-DoF manipulation, demonstrating superior safety and planning performance compared to state-of-the-art generative planners. Comprehensive resources are available on the project website: https://safeflowmatching.github.io △ Less

Submitted 3 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

arXiv:2503.16288 [pdf, other]

doi 10.1109/TCSVT.2025.3552971

Overview of Variable Rate Coding in JPEG AI

Authors: Panqi Jia, Fabian Brand, Dequan Yu, Alexander Karabutov, Elena Alshina, Andre Kaup

Abstract: Empirical evidence has demonstrated that learning-based image compression can outperform classical compression frameworks. This has led to the ongoing standardization of learned-based image codecs, namely Joint Photographic Experts Group (JPEG) AI. The objective of JPEG AI is to enhance compression efficiency and provide a software and hardwarefriendly solution. Based on our research, JPEG AI repr… ▽ More Empirical evidence has demonstrated that learning-based image compression can outperform classical compression frameworks. This has led to the ongoing standardization of learned-based image codecs, namely Joint Photographic Experts Group (JPEG) AI. The objective of JPEG AI is to enhance compression efficiency and provide a software and hardwarefriendly solution. Based on our research, JPEG AI represents the first standardization that can facilitate the implementation of a learned image codec on a mobile device. This article presents an overview of the variable rate coding functionality in JPEG AI, which includes three variable rate adaptations: a threedimensional quality map, a fast bit rate matching algorithm, and a training strategy. The variable rate adaptations offer a continuous rate function up to 2.0 bpp, exhibiting a high level of performance, a flexible bit allocation between different color components, and a region of interest function for the specified use case. The evaluation of performance encompasses both objective and subjective results. With regard to the objective bit rate matching, the main profile with low complexity yielded a 13.1% BD-rate gain over VVC intra, while the high profile with high complexity achieved a 19.2% BD-rate gain over VVC intra. The BD-rate result is calculated as the mean of the seven perceptual metrics defined in the JPEG AI common test conditions. With respect to subjective results, the example of improving the quality of the region of interest is illustrated. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.12936 [pdf, other]

FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks

Authors: Tong Lei, Qinwen Hu, Ziyao Lin, Andong Li, Rilin Chen, Meng Yu, Dong Yu, Jing Lu

Abstract: The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real m… ▽ More The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods. △ Less

Submitted 15 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: 13 pages, 6 figures

arXiv:2502.16897 [pdf, other]

Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

Authors: Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

Abstract: Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textua… ▽ More Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textual LLM to create a codec-based speech language model. This strategy mitigates the modality gap between text and speech, preserving the linguistic reasoning of the original model while enabling high-fidelity speech synthesis. We validate our approach with extensive experiments across multiple tasks, including automatic speech recognition, text-to-speech, speech-to-text translation, and speech-to-speech translation (S2ST), demonstrating that our model achieves superior TTS performance and, notably, the first end-to-end S2ST system based on neural codecs. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.14145 [pdf, other]

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Authors: Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD pred… ▽ More Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS. △ Less

Submitted 24 February, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

Comments: In submission to INTERSPEECH 2025

arXiv:2501.06670 [pdf, other]

A Geometric Analysis-Based Safety Assessment Framework for MASS Route Decision-Making in Restricted Waters

Authors: Zilong Xu, Zihao Wang, He Li, Dingli Yu, Zaili Yang, Jin Wang

Abstract: To enhance the safety of Maritime Autonomous Surface Ships (MASS) navigating in restricted waters, this paper aims to develop a geometric analysis-based route safety assessment (GARSA) framework, specifically designed for their route decision-making in irregularly shaped waterways. Utilizing line and point geometric elements to define waterway boundaries, the framework enables to construct a dynam… ▽ More To enhance the safety of Maritime Autonomous Surface Ships (MASS) navigating in restricted waters, this paper aims to develop a geometric analysis-based route safety assessment (GARSA) framework, specifically designed for their route decision-making in irregularly shaped waterways. Utilizing line and point geometric elements to define waterway boundaries, the framework enables to construct a dynamic width characterization function to quantify spatial safety along intricate waterways. An iterative method is developed to calculate this function, enabling an abstracted spatial property representation of the waterways. Based on this, we introduce a navigational safety index that balances global navigational safety and local risk to determine the safest route. To accommodate ship kinematic constraints, path modifications are applied using a dynamic window approach. A case study in a simulated Port of Hamburg environment shows that GARSA effectively identifies safe routes and avoids the risk of entering narrow waterways in an autonomous manner, thereby prioritizing safety in route decision-making for MASS in confined waters. △ Less

Submitted 11 January, 2025; originally announced January 2025.

arXiv:2501.01102 [pdf, other]

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT

Authors: Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng

Abstract: Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessi… ▽ More Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation. △ Less

Submitted 2 January, 2025; originally announced January 2025.

Comments: Accepted at INTERSPEECH 2019

Journal ref: Proc. Interspeech 2019, pp. 2090-2094

arXiv:2411.16729 [pdf, other]

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2

Authors: Fan Zhang, Siyuan Zhao, Naye Ji, Zhaohan Wang, Jingmei Wu, Fuxing Gao, Zhenqing Ye, Leyao Yan, Lanxin Dai, Weidong Geng, Xin Lyu, Bozuo Zhao, Dingguo Yu, Hui Du, Bin Hu

Abstract: Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamb… ▽ More Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters. △ Less

Submitted 23 November, 2024; originally announced November 2024.

Comments: 13 pages, 11 figures

arXiv:2410.06544 [pdf, other]

SRC-gAudio: Sampling-Rate-Controlled Audio Generation

Authors: Chenxing Li, Manjie Xu, Dong Yu

Abstract: We introduce SRC-gAudio, a novel audio generation model designed to facilitate text-to-audio generation across a wide range of sampling rates within a single model architecture. SRC-gAudio incorporates the sampling rate as part of the generation condition to guide the diffusion-based audio generation process. Our model enables the generation of audio at multiple sampling rates with a single unifie… ▽ More We introduce SRC-gAudio, a novel audio generation model designed to facilitate text-to-audio generation across a wide range of sampling rates within a single model architecture. SRC-gAudio incorporates the sampling rate as part of the generation condition to guide the diffusion-based audio generation process. Our model enables the generation of audio at multiple sampling rates with a single unified model. Furthermore, we explore the potential benefits of large-scale, low-sampling-rate data in enhancing the generation quality of high-sampling-rate audio. Through extensive experiments, we demonstrate that SRC-gAudio effectively generates audio under controlled sampling rates. Additionally, our results indicate that pre-training on low-sampling-rate data can lead to significant improvements in audio quality across various metrics. △ Less

Submitted 9 October, 2024; originally announced October 2024.

Comments: Accepted by APSIPA2024

arXiv:2410.03751 [pdf, other]

Recent Advances in Speech Language Models: A Survey

Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

Abstract: Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is… ▽ More Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey △ Less

Submitted 5 February, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

Comments: Work in progress

arXiv:2410.01150 [pdf, other]

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Authors: Hsin-Tien Chiang, Hao Zhang, Yong Xu, Meng Yu, Dong Yu

Abstract: In challenging environments with significant noise and reverberation, traditional speech enhancement (SE) methods often lead to over-suppressed speech, creating artifacts during listening and harming downstream tasks performance. To overcome these limitations, we propose a novel approach called Restorative SE (RestSE), which combines a lightweight SE module with a generative codec module to progre… ▽ More In challenging environments with significant noise and reverberation, traditional speech enhancement (SE) methods often lead to over-suppressed speech, creating artifacts during listening and harming downstream tasks performance. To overcome these limitations, we propose a novel approach called Restorative SE (RestSE), which combines a lightweight SE module with a generative codec module to progressively enhance and restore speech quality. The SE module initially reduces noise, while the codec module subsequently performs dereverberation and restores speech using generative capabilities. We systematically explore various quantization techniques within the codec module to optimize performance. Additionally, we introduce a weighted loss function and feature fusion that merges the SE output with the original mixture, particularly at segments where the SE output is heavily distorted. Experimental results demonstrate the effectiveness of our proposed method in enhancing speech quality under adverse conditions. Audio demos are available at: https://sophie091524.github.io/RestorativeSE/. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: Paper in submission

arXiv:2409.14709 [pdf, other]

Video-to-Audio Generation with Fine-grained Temporal Semantics

Authors: Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

Abstract: With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first in… ▽ More With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first investigate the video-to-audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub-optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine-grained semantics in video frames to enable VTA to produce better-aligned audio signal. Extensive experiments demonstrate the effectiveness of our system on both objective and subjective evaluation metrics, which shows both better audio quality and fine-grained temporal alignment. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.10819 [pdf, ps, other]

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling techni… ▽ More We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/. △ Less

Submitted 19 June, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

Comments: Accepted at Interspeech 2025

arXiv:2409.08601 [pdf, other]

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Authors: Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, Dong Yu

Abstract: Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both l… ▽ More Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A. △ Less

Submitted 24 March, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

Comments: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2409.07556 [pdf, other]

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

Authors: Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

Abstract: In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited re… ▽ More In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released. △ Less

Submitted 1 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

Comments: ICASSP 2025

arXiv:2409.06954 [pdf, other]

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

Authors: Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu

Abstract: Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage netw… ▽ More Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage network architecture for encoding circular microphone array signals into second-order Ambisonics (SOA) in multi-speaker environments. In addition, we introduce: (i) a novel loss function based on spatial power maps to regularize inter-channel correlations of the Ambisonic signals, and (ii) a channel permutation technique to resolve the ambiguity of encoding vertical information using a horizontal circular array. Evaluation on simulated speech and noise datasets shows that our approach consistently outperforms traditional signal processing (SP) and DL-based methods, providing significantly better timbral and spatial quality and higher source localization accuracy. Binaural audio demos with visualizations are available at https://bridgoon97.github.io/NeuralAmbisonicEncoding/. △ Less

Submitted 16 September, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP 2025

arXiv:2409.01622 [pdf]

T1-contrast Enhanced MRI Generation from Multi-parametric MRI for Glioma Patients with Latent Tumor Conditioning

Authors: Zach Eidex, Mojtaba Safari, Richard L. J. Qiu, David S. Yu, Hui-Kuo Shu, Hui Mao, Xiaofeng Yang

Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly used in MRI scans of patients with gliomas to enhance brain tumor characterization using T1-weighted (T1W) MRI. However, there is growing concern about GBCA toxicity. This study develops a deep-learning framework to generate T1-postcontrast (T1C) from pre-contrast multiparametric MRI. Approach: We propose the tumor-aware vision trans… ▽ More Objective: Gadolinium-based contrast agents (GBCAs) are commonly used in MRI scans of patients with gliomas to enhance brain tumor characterization using T1-weighted (T1W) MRI. However, there is growing concern about GBCA toxicity. This study develops a deep-learning framework to generate T1-postcontrast (T1C) from pre-contrast multiparametric MRI. Approach: We propose the tumor-aware vision transformer (TA-ViT) model that predicts high-quality T1C images. The predicted tumor region is significantly improved (P < .001) by conditioning the transformer layers from predicted segmentation maps through adaptive layer norm zero mechanism. The predicted segmentation maps were generated with the multi-parametric residual (MPR) ViT model and transformed into a latent space to produce compressed, feature-rich representations. The TA-ViT model predicted T1C MRI images of 501 glioma cases. Selected patients were split into training (N=400), validation (N=50), and test (N=51) sets. Main Results: Both qualitative and quantitative results demonstrate that the TA-ViT model performs superior against the benchmark MRP-ViT model. Our method produces synthetic T1C MRI with high soft tissue contrast and more accurately reconstructs both the tumor and whole brain volumes. The synthesized T1C images achieved remarkable improvements in both tumor and healthy tissue regions compared to the MRP-ViT model. For healthy tissue and tumor regions, the results were as follows: NMSE: 8.53 +/- 4.61E-4; PSNR: 31.2 +/- 2.2; NCC: 0.908 +/- .041 and NMSE: 1.22 +/- 1.27E-4, PSNR: 41.3 +/- 4.7, and NCC: 0.879 +/- 0.042, respectively. Significance: The proposed method generates synthetic T1C images that closely resemble real T1C images. Future development and application of this approach may enable contrast-agent-free MRI for brain tumor patients, eliminating the risk of GBCA toxicity and simplifying the MRI scan protocol. △ Less

Submitted 3 September, 2024; originally announced September 2024.

Comments: arXiv admin note: text overlap with arXiv:2407.02616

arXiv:2408.17431 [pdf, other]

Advancing Multi-talker ASR Performance with Large Language Models

Authors: Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu

Abstract: Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcr… ▽ More Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: 8 pages, accepted by IEEE SLT 2024

arXiv:2408.01320 [pdf, ps, other]

Generalized Reduced-WMMSE Approach for Cell-Free Massive MIMO With Per-AP Power Constraints

Authors: Wonsik Yoo, Daesung Yu, Hoon Lee, Seok-Hwan Park

Abstract: The optimization of cooperative beamforming vectors in cell-free massive MIMO (mMIMO) systems is presented where multi-antenna access points (APs) support downlink data transmission of multiple users. Albeit the successes of the weighted minimum mean squared error (WMMSE) algorithm and their variants, they lack careful investigations about computational complexity that scales with the number of an… ▽ More The optimization of cooperative beamforming vectors in cell-free massive MIMO (mMIMO) systems is presented where multi-antenna access points (APs) support downlink data transmission of multiple users. Albeit the successes of the weighted minimum mean squared error (WMMSE) algorithm and their variants, they lack careful investigations about computational complexity that scales with the number of antennas and APs. We propose a generalized and reduced WMMSE (G-R-WMMSE) approach whose complexity is significantly lower than conventional WMMSE. We partition the set of beamforming coefficients into subvectors, with each subvector corresponding to a specific AP. Such a partitioning approach decomposes the original WMMSE problem across individual APs. By leveraging the Lagrange duality analysis, a closed-form solution can be derived for each subproblem, which substantially reduces the computation burden. Additionally, we present a parallel execution of the proposed G-R-WMMSE with adaptive step sizes, aiming at further reducing the time complexity. Numerical results validate that the proposed G-R-WMMSE schemes achieve over 99% complexity savings compared to the conventional WMMSE scheme while maintaining almost the same performance. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: accepted for publication in IEEE Wireless Communications Letters

arXiv:2407.07464 [pdf, other]

Video-to-Audio Generation with Hidden Alignment

Authors: Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu

Abstract: Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techni… ▽ More Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities to advance the challenge of generating synchronized audio from semantic and temporal perspectives. We hope these insights will serve as a stepping stone toward developing more realistic and accurate audio-visual generation models. △ Less

Submitted 11 March, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

Comments: https://sites.google.com/view/vta-ldm

arXiv:2406.11175 [pdf, other]

doi 10.1109/SLT61566.2024.10832279

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Authors: Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu

Abstract: The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, t… ▽ More The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements. △ Less

Submitted 24 January, 2025; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: 8 pages, Accepted to SLT 2024

Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 317-324, 2024

arXiv:2406.09589 [pdf, other]

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (Solo-SF), an innovative method that utilizes a target speaker's isolated speech segment to enhance ASR performance, thereby circumventing the need for conventional inputs like microphone array layouts. We explore effective strategies for selecting optimal solo segments, a crucial aspect for Solo-SF's success. Through evaluations conducted on the AliMeeting dataset and AISHELL-1 simulations, Solo-SF demonstrates superior performance over existing techniques, significantly lowering Character Error Rates (CER) in various test conditions. Our findings highlight Solo-SF's potential as an effective solution for addressing the complexities of multi-channel, multi-speaker ASR tasks. △ Less

Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at Interspeech 2024

arXiv:2406.04350 [pdf, other]

Prompt-guided Precise Audio Editing with Diffusion Models

Authors: Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusi… ▽ More Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusion models and enables precise audio editing. The editing is based on the input textual prompt only and is entirely training-free. We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process. Experimental results highlight the effectiveness of our method in various editing tasks. △ Less

Submitted 11 May, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2406.00976 [pdf, other]

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

Authors: Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef… ▽ More While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. The code is available at \url{https://github.com/youngsheen/GPST}. △ Less

Submitted 1 November, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: Accept in ACL2024-main

arXiv:2404.08549 [pdf]

Practical Guidelines for Cell Segmentation Models Under Optical Aberrations in Microscopy

Authors: Boyuan Peng, Jiaju Chen, P. Bilha Githinji, Ijaz Gul, Qihui Ye, Minjiang Chen, Peiwu Qin, Xingru Huang, Chenggang Yan, Dongmei Yu, Jiansong Ji, Zhenglin Chen

Abstract: Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study evaluates cell image… ▽ More Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study evaluates cell image segmentation models under optical aberrations from fluorescence and bright field microscopy. By simulating different types of aberrations, including astigmatism, coma, spherical aberration, trefoil, and mixed aberrations, we conduct a thorough evaluation of various cell instance segmentation models using the DynamicNuclearNet (DNN) and LIVECell datasets, representing fluorescence and bright field microscopy cell datasets, respectively. We train and test several segmentation models, including the Otsu threshold method and Mask R-CNN with different network heads (FPN, C3) and backbones (ResNet, VGG, Swin Transformer), under aberrated conditions. Additionally, we provide usage recommendations for the Cellpose 2.0 Toolbox on complex cell degradation images. The results indicate that the combination of FPN and SwinS demonstrates superior robustness in handling simple cell images affected by minor aberrations. In contrast, Cellpose 2.0 proves effective for complex cell images under similar conditions. Furthermore, we innovatively propose the Point Spread Function Image Label Classification Model (PLCM). This model can quickly and accurately identify aberration types and amplitudes from PSF images, assisting researchers without optical training. Through PLCM, researchers can better apply our proposed cell segmentation guidelines. △ Less

Submitted 25 August, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.08453 [pdf, other]

Lightweight Multi-System Multivariate Interconnection and Divergence Discovery

Authors: Mulugeta Weldezgina Asres, Christian Walter Omlin, Jay Dittmann, Pavel Parygin, Joshua Hiltbrand, Seth I. Cooper, Grace Cummings, David Yu

Abstract: Identifying outlier behavior among sensors and subsystems is essential for discovering faults and facilitating diagnostics in large systems. At the same time, exploring large systems with numerous multivariate data sets is challenging. This study presents a lightweight interconnection and divergence discovery mechanism (LIDD) to identify abnormal behavior in multi-system environments. The approach… ▽ More Identifying outlier behavior among sensors and subsystems is essential for discovering faults and facilitating diagnostics in large systems. At the same time, exploring large systems with numerous multivariate data sets is challenging. This study presents a lightweight interconnection and divergence discovery mechanism (LIDD) to identify abnormal behavior in multi-system environments. The approach employs a multivariate analysis technique that first estimates the similarity heatmaps among the sensors for each system and then applies information retrieval algorithms to provide relevant multi-level interconnection and discrepancy details. Our experiment on the readout systems of the Hadron Calorimeter of the Compact Muon Solenoid (CMS) experiment at CERN demonstrates the effectiveness of the proposed method. Our approach clusters readout systems and their sensors consistent with the expected calorimeter interconnection configurations, while capturing unusual behavior in divergent clusters and estimating their root causes. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 8 pages, 12 figures

arXiv:2404.08285 [pdf]

A Survey of Neural Network Robustness Assessment in Image Recognition

Authors: Jie Wang, Jun Ai, Minyan Lu, Haoran Su, Dan Yu, Yutao Zhang, Junda Zhu, Jingyu Liu

Abstract: In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models.… ▽ More In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models. Researchers have dedicated efforts to evaluate robustness in diverse perturbation conditions for image recognition tasks. Robustness assessment encompasses two main techniques: robustness verification/ certification for deliberate adversarial attacks and robustness testing for random data corruptions. In this survey, we present a detailed examination of both adversarial robustness (AR) and corruption robustness (CR) in neural network assessment. Analyzing current research papers and standards, we provide an extensive overview of robustness assessment in image recognition. Three essential aspects are analyzed: concepts, metrics, and assessment methods. We investigate the perturbation metrics and range representations used to measure the degree of perturbations on images, as well as the robustness metrics specifically for the robustness conditions of classification models. The strengths and limitations of the existing methods are also discussed, and some potential directions for future research are provided. △ Less

Submitted 15 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: Corrected typos and grammatical errors in Section 5

arXiv:2403.02307 [pdf, other]

Harnessing Intra-group Variations Via a Population-Level Context for Pathology Detection

Authors: P. Bilha Githinji, Xi Yuan, Zhenglin Chen, Ijaz Gul, Dingqi Shang, Wen Liang, Jianming Deng, Dan Zeng, Dongmei yu, Chenggang Yan, Peiwu Qin

Abstract: Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph th… ▽ More Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph theoretic approach to model and incorporate it into the latent code of an autoencoder via a refinement module we term PopuSense. PopuSense seeks to capture additional intra-group variations inherent in biomedical data that a local or global context of the convolutional model might miss or smooth out. Proof-of-concept experiments on contrast-based and texture-based images, with minimal adaptation, encounter the existing preference for intensity-based input. Nevertheless, PopuSense demonstrates improved separability in contrast-based images, presenting an additional avenue for refining representations learned by a model. △ Less

Submitted 25 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.01828 [pdf, other]

Retrieval Augmented End-to-End Spoken Dialog Models

Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

Abstract: We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain dom… ▽ More We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain domain-specific entities, i.e., restaurants, hotels, train stations, and city names, which are difficult to recognize, however, critical for the downstream applications. Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to retrieve text entities mentioned in the audio. The retrieved entities are then added as text inputs to the underlying SLM to bias model predictions. We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance, achieving joint goal accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach is broadly applicable to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Journal ref: Proc. ICASSP 2024

arXiv:2312.06101 [pdf, other]

Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

Authors: Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong

Abstract: Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup ta… ▽ More Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup table (LUT)-based SR schemes that employ simple LUT readout and largely elude CNN computation. Nonetheless, the multi-megabyte LUTs in existing methods still prohibit on-chip storage and necessitate off-chip memory transport. This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models amenable to on-chip cache. Utilizing an asymmetric two-branch multistage network coupled with a suite of specialized kernel patterns, HKLUT demonstrates an uncompromising performance and superior hardware efficiency over existing LUT schemes. Our implementation is publicly available at: https://github.com/jasonli0707/hklut. △ Less

Submitted 8 May, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

arXiv:2311.13075 [pdf, other]

Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer

Authors: Meng Yu, Dong Yu

Abstract: Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular… ▽ More Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular area. In this paper, we proposed a simple yet effective FOV feature, amalgamating all directional attributes within the user-defined field. In conjunction, we've introduced a counter FOV feature capturing directional aspects outside the desired field. Such advancements ensure refined sound capture, particularly emphasizing the FOV's boundaries, and guarantee the enhanced capture of all desired sound sources inside the user-defined field. The results from the experiment demonstrate the efficacy of the introduced angular FOV feature and its seamless incorporation into a low-power subband model suited for real-time applica?tions. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 6 pages, 5 figures

arXiv:2311.07202 [pdf, other]

doi 10.1016/j.apenergy.2024.124472

Real-Time Machine-Learning-Based Optimization Using Input Convex Long Short-Term Memory Network

Authors: Zihao Wang, Donghan Yu, Zhe Wu

Abstract: Neural network-based optimization and control methods, often referred to as black-box approaches, are increasingly gaining attention in energy and manufacturing systems, particularly in situations where first-principles models are either unavailable or inaccurate. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-ti… ▽ More Neural network-based optimization and control methods, often referred to as black-box approaches, are increasingly gaining attention in energy and manufacturing systems, particularly in situations where first-principles models are either unavailable or inaccurate. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-time decision-making processes. To address this challenge, we propose a novel Input Convex Long Short-Term Memory (IC-LSTM) network to enhance the computational efficiency of neural network-based optimization. Through two case studies employing real-time neural network-based optimization for optimizing energy and chemical systems, we demonstrate the superior performance of IC-LSTM-based optimization in terms of runtime. Specifically, in a real-time optimization problem of a real-world solar photovoltaic energy system at LHT Holdings in Singapore, IC-LSTM-based optimization achieved at least 4-fold speedup compared to conventional LSTM-based optimization. These results highlight the potential of IC-LSTM networks to significantly enhance the efficiency of neural network-based optimization and control in practical applications. Source code is available at https://github.com/killingbear999/ICLSTM. △ Less

Submitted 10 September, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Applied Energy

arXiv:2311.00146 [pdf, other]

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Authors: Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Abstract: Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that… ▽ More Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods. △ Less

Submitted 11 June, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

Comments: Accepted for presentation at Interspeech 2024

arXiv:2310.16367 [pdf, other]

UniX-Encoder: A Universal $X$-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

Authors: Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Abstract: The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel sp… ▽ More The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four key areas: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts as a robust upstream model, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across tasks like speech recognition and speaker diarization, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2310.11954 [pdf, other]

MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

Authors: Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

Abstract: AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data… ▽ More AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience. △ Less

Submitted 25 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10992 [pdf, other]

A High Fidelity and Low Complexity Neural Audio Coding

Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.01292 [pdf, other]

Efficient Remote Sensing Segmentation With Generative Adversarial Transformer

Authors: Luyi Qiu, Dayu Yu, Xiaofeng Zhang, Chenxiao Zhang

Abstract: Most deep learning methods that achieve high segmentation accuracy require deep network architectures that are too heavy and complex to run on embedded devices with limited storage and memory space. To address this issue, this paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation while maintaining an extremely efficient size. The… ▽ More Most deep learning methods that achieve high segmentation accuracy require deep network architectures that are too heavy and complex to run on embedded devices with limited storage and memory space. To address this issue, this paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation while maintaining an extremely efficient size. The framework utilizes a Global Transformer Network (GTNet) as the generator, efficiently extracting multi-level features through residual connections. GTNet employs global transformer blocks with progressively linear computational complexity to reassign global features based on a learnable similarity function. To focus on object-level and pixel-level information, the GATrans optimizes the objective function by combining structural similarity losses. We validate the effectiveness of our approach through extensive experiments on the Vaihingen dataset, achieving an average F1 score of 90.17% and an overall accuracy of 91.92%. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00900 [pdf, other]

uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

Authors: Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu

Abstract: Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by p… ▽ More Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by providing multiple types of conditions including self-supervised learning embeddings and proper text prompts to the score-based diffusion model, we can enable controllable generation of the unified speech enhancement and editing model to perform corresponding actions on the source speech. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models, and can perform speech editing given desired environmental sound text description, signal-to-noise ratios (SNR), and room impulse responses (RIR). Demos of the generated speech are available at https://muqiaoy.github.io/usee. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00230 [pdf, other]

SLM: Bridge the thin gap between speech and text foundation models

Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and language models might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities. △ Less

Submitted 29 September, 2023; originally announced October 2023.

arXiv:2309.16049 [pdf, other]

Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression

Authors: Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu

Abstract: Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in e… ▽ More Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in effective adaptive filtering, and estimating covariance metrics for the filter which are crucial for adaptability in dynamic conditions, thereby obtaining improved AHS performance. As a result, the proposed method achieves improved AHS performance compared to both standalone NN and Kalman filter methods. Experimental evaluations validate the effectiveness of our approach. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: Paper in submission

arXiv:2309.16048 [pdf, other]

Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks

Authors: Hao Zhang, Yixuan Zhang, Meng Yu, Dong Yu

Abstract: In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The propose… ▽ More In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The proposed recursive training strategy bridges the gap between training and real-world inference scenarios, marking a departure from previous NN-based methods that typically approach AHS as either noise suppression or acoustic echo cancellation. Within this framework, we explore two methodologies: one exclusively relying on NN and the other combining NN with the traditional Kalman filter. Additionally, we propose strategies, including howling detection and initialization using pre-trained offline models, to bolster trainability and expedite the training process. Experimental results validate that this framework offers a substantial improvement over previous methodologies for acoustic howling suppression. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: Paper in submission

arXiv:2309.09028 [pdf, other]

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Authors: Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu

Abstract: Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynth… ▽ More Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Comments: Paper in submission

arXiv:2309.07432 [pdf, other]

SpatialCodec: Neural Spatial Speech Coding

Authors: Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu

Abstract: In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our app… ▽ More In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec. △ Less

Submitted 8 July, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted by ICASSP2024

arXiv:2309.07416 [pdf, other]

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

Authors: Anton Ratnarajah, Shi-Xiong Zhang, Dong Yu

Abstract: We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensu… ▽ More We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 3) BANC's proficiency in reducing the bandwidth required for compressing binaural speech by 48% compared to compressing individual binaural channels. In our evaluation, we employed speech enhancement, room acoustics, and perceptual metrics to assess the accuracy of BANC's clean speech and spatial cue estimates. △ Less

Submitted 24 November, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: More results and source code are available at https://anton-jeran.github.io/MAD/

Showing 1–50 of 187 results for author: Yu, D