Skip to main content

Showing 1–50 of 103 results for author: Chng, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.13396  [pdf, ps, other

    cs.CL eess.AS

    Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

    Authors: Yizhou Peng, Hexin Liu, Eng Siong Chng

    Abstract: This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions th… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Submitted to Interspeech 2025 MLC-SLM workshop as a Research Paper

  2. arXiv:2506.13339  [pdf, ps, other

    cs.CL eess.AS

    NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

    Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

    Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, languag… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Submitted to Interspeech 2025 MLC-SLM challenge (5th place). System report

  3. arXiv:2506.11403  [pdf, ps, other

    cs.SD cs.AI eess.AS

    A correlation-permutation approach for speech-music encoders model merging

    Authors: Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jeremy H. M Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

    Abstract: Creating a unified speech and music model requires expensive pre-training. Model merging can instead create an unified audio model with minimal computational expense. However, direct merging is challenging when the models are not aligned in the weight space. Motivated by Git Re-Basin, we introduce a correlation-permutation approach that aligns a music encoder's internal layers with a speech encode… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Under review

  4. arXiv:2505.17417  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Speechless: Speech Instruction Training Without Speech for Low Resource Languages

    Authors: Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip

    Abstract: The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good t… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: This paper was accepted by INTERSPEECH 2025

  5. arXiv:2505.17076  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

    Authors: Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Leibny Paola Garcia, Eng Siong Chng

    Abstract: The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typo… ▽ More

    Submitted 13 June, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: 6 pages, 5 figures

    MSC Class: 68T10 ACM Class: I.2.7

  6. arXiv:2505.16211  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

    Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhuo Chen , et al. (7 additional authors not shown)

    Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Technical Report

  7. arXiv:2505.15004  [pdf, ps, other

    eess.AS cs.SD

    EASY: Emotion-aware Speaker Anonymization via Factorized Distillation

    Authors: Jixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie

    Abstract: Emotion plays a significant role in speech interaction, conveyed through tone, pitch, and rhythm, enabling the expression of feelings and intentions beyond words to create a more personalized experience. However, most existing speaker anonymization systems employ parallel disentanglement methods, which only separate speech into linguistic content and speaker identity, often neglecting the preserva… ▽ More

    Submitted 30 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  8. arXiv:2505.13270  [pdf, ps, other

    cs.SD eess.AS

    Distilling a speech and music encoder with task arithmetic

    Authors: Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee

    Abstract: Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Kn… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025

  9. arXiv:2505.13032  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

    Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

    Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Open-source at https://github.com/ddlBoJack/MMAR

  10. arXiv:2505.07235  [pdf, other

    cs.SD eess.AS

    Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

    Authors: Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, Eng Siong Chng

    Abstract: Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mod… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  11. arXiv:2503.05110  [pdf, other

    cs.SD eess.AS

    UniArray: Unified Spectral-Spatial Modeling for Array-Geometry-Agnostic Speech Separation

    Authors: Weiguang Chen, Junjie Zhang, Jielong Yang, Eng Siong Chng, Xionghu Zhong

    Abstract: Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during in… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: 5 pages, Prepirnt

  12. arXiv:2502.16240  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Speech Enhancement Using Continuous Embeddings of Neural Audio Codec

    Authors: Haoyang Li, Jia Qi Yip, Tianyu Fan, Eng Siong Chng

    Abstract: Recent advancements in Neural Audio Codec (NAC) models have inspired their use in various speech processing tasks, including speech enhancement (SE). In this work, we propose a novel, efficient SE approach by leveraging the pre-quantization output of a pretrained NAC encoder. Unlike prior NAC-based SE methods, which process discrete speech tokens using Language Models (LMs), we perform SE within t… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: Accepted to ICASSP 2025

  13. arXiv:2502.02942  [pdf, other

    eess.AS cs.SD

    GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

    Authors: Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie

    Abstract: Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semanti… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: Accepted by ICLR 2025

  14. arXiv:2501.17202  [pdf, other

    cs.SD cs.CL eess.AS

    Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

    Authors: Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Eng Siong Chng

    Abstract: An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task trainin… ▽ More

    Submitted 11 March, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

    Comments: ICLR 2025

  15. arXiv:2501.07246  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

    Authors: Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen

    Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALM… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

  16. arXiv:2412.17778  [pdf, other

    eess.AS cs.AI cs.LG

    From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based Methodology

    Authors: Haoyang Li, Yuchen Hu, Chen Chen, Sabato Marco Siniscalchi, Songting Liu, Eng Siong Chng

    Abstract: Deep neural network (DNN)-based speech enhancement (SE) usually uses conventional activation functions, which lack the expressiveness to capture complex multiscale structures needed for high-fidelity SE. Group-Rational KAN (GR-KAN), a variant of Kolmogorov-Arnold Networks (KAN), retains KAN's expressiveness while improving scalability on complex tasks. We adapt GR-KAN to existing DNN-based SE by r… ▽ More

    Submitted 21 May, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: Accepted to Interspeech2025

  17. arXiv:2411.19770  [pdf, other

    cs.SD cs.CL eess.AS

    Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities

    Authors: Haorui He, Yuchen Song, Yuancheng Wang, Haoyang Li, Xueyao Zhang, Li Wang, Gongping Huang, Eng Siong Chng, Zhizheng Wu

    Abstract: One-shot voice conversion (VC) aims to alter the timbre of speech from a source speaker to match that of a target speaker using just a single reference speech from the target, while preserving the semantic content of the original source speech. Despite advancements in one-shot VC, its effectiveness decreases in real-world scenarios where reference speeches, often sourced from the internet, contain… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

    Comments: Submitted to IEEE OJSP

  18. arXiv:2411.17998  [pdf, other

    eess.AS

    Speech Separation using Neural Audio Codecs with Embedding Loss

    Authors: Jia Qi Yip, Chin Yuen Kwok, Bin Ma, Eng Siong Chng

    Abstract: Neural audio codecs have revolutionized audio processing by enabling speech tasks to be performed on highly compressed representations. Recent work has shown that speech separation can be achieved within these compressed domains, offering faster training and reduced inference costs. However, current approaches still rely on waveform-based loss functions, necessitating unnecessary decoding steps du… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted by APSIPA ASC 2024

  19. NTU-NPU System for Voice Privacy 2024 Challenge

    Authors: Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng

    Abstract: In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: System description for VPC 2024

    Journal ref: 2024 Challenge. Proc. 4th Symposium on Security and Privacy in Speech Communication, 72-79

  20. arXiv:2409.16005  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs

    Authors: Yang Yuhang, Peng Yizhou, Eng Siong Chng, Xionghu Zhong

    Abstract: The integration of large language models (LLMs) with pre-trained speech models has opened up new avenues in automatic speech recognition (ASR). While LLMs excel in multimodal understanding tasks, effectively leveraging their capabilities for ASR remains a significant challenge. This paper presents a novel training approach to enhance LLM performance in ASR tasks. We propose pre-training LLMs on Pi… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted by ISCSLP2024-Special session-Speech Processing in LLM Era

  21. arXiv:2409.14743  [pdf, other

    eess.AS cs.SD

    LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation

    Authors: Hieu-Thi Luong, Haoyang Li, Lin Zhang, Kong Aik Lee, Eng Siong Chng

    Abstract: Previous fake speech datasets were constructed from a defender's perspective to develop countermeasure (CM) systems without considering diverse motivations of attackers. To better align with real-life scenarios, we created LlamaPartialSpoof, a 130-hour dataset that contains both fully and partially fake speech, using a large language model (LLM) and voice cloning technologies to evaluate the robus… ▽ More

    Submitted 5 January, 2025; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: 5 pages, ICASSP 2025

  22. arXiv:2409.14712  [pdf, other

    eess.AS cs.SD

    Room Impulse Responses help attackers to evade Deep Fake Detection

    Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

    Abstract: The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 7 pages, to be presented at SLT 2024

  23. arXiv:2409.09785  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

    Authors: Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

    Abstract: Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This cha… ▽ More

    Submitted 18 October, 2024; v1 submitted 15 September, 2024; originally announced September 2024.

    Comments: IEEE SLT 2024. The initial draft version has been done in December 2023. Post-ASR Text Processing and Understanding Community and LlaMA-7B pre-training correction model: https://huggingface.co/GenSEC-LLM/SLT-Task1-Llama2-7b-HyPo-baseline

  24. arXiv:2409.04173  [pdf, other

    eess.AS

    NPU-NTU System for Voice Privacy 2024 Challenge

    Authors: Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper,… ▽ More

    Submitted 4 February, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

    Comments: System description for VPC 2024

  25. arXiv:2407.17902  [pdf, other

    eess.AS

    Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

    Authors: Ruijie Tao, Zhan Shi, Yidi Jiang, Duc-Tuan Truong, Eng-Siong Chng, Massimo Alioto, Haizhou Li

    Abstract: The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as ``cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framewo… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  26. Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

    Authors: Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng

    Abstract: Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise tha… ▽ More

    Submitted 27 September, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Proceedings of Interspeech

  27. arXiv:2407.02243  [pdf, other

    cs.CL cs.SD eess.AS

    Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

    Authors: Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

    Abstract: In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 12 pages, Work in progress

  28. Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

    Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

    Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  29. arXiv:2406.12434  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Audio Codec-based Speech Separation

    Authors: Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp… ▽ More

    Submitted 5 July, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: This paper was accepted by Interspeech 2024

  30. arXiv:2406.02963  [pdf, other

    cs.SD eess.AS

    Dataset-Distillation Generative Model for Speech Emotion Recognition

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

    Abstract: Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  31. arXiv:2406.00654  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

    Authors: Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

    Abstract: In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even st… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 19 pages, Preprint

  32. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  33. arXiv:2405.10025  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

    Authors: Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suf… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 14 pages, Accepted by ACL 2024

  34. arXiv:2403.05887  [pdf, other

    eess.AS

    Aligning Speech to Languages to Enhance Code-switching Speech Recognition

    Authors: Hexin Liu, Xiangyu Zhang, Haoyang Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

    Abstract: Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we introduce a novel language alignment loss into ASR training to align acoustic features to pseudo-language labels learned from the ASR decoder. This approach enables frame-level language identification without the… ▽ More

    Submitted 2 February, 2025; v1 submitted 9 March, 2024; originally announced March 2024.

    Comments: Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  35. arXiv:2402.10642  [pdf, other

    eess.AS cs.AI

    Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

    Authors: Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia, Eng Siong Chng, Lina Yao

    Abstract: Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches… ▽ More

    Submitted 23 September, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  36. arXiv:2402.06894  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

    Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

  37. arXiv:2402.05457  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

    Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

    Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

  38. arXiv:2401.10446  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by e… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

  39. arXiv:2401.03473  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

    Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

    Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More

    Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  40. arXiv:2312.12153  [pdf, other

    cs.SD eess.AS

    Noise robust distillation of self-supervised speech models via correlation metrics

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Dianwen Ng, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

    Abstract: Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 6 pages

  41. arXiv:2311.17382  [pdf, other

    eess.AS

    Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

    Authors: Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng

    Abstract: This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt. The Mixed Error Rate results show that th… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: 6 pages, 3 figures, 4 tables

  42. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  43. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  44. Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

    Authors: Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

    Abstract: Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automa… ▽ More

    Submitted 14 January, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10336-10340

  45. arXiv:2309.12608  [pdf, other

    eess.AS cs.SD

    SPGM: Prioritizing Local Features for enhanced speech separation performance

    Authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: This paper was accepted by ICASSP 2024

  46. arXiv:2309.09413  [pdf, other

    cs.SD eess.AS

    Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

    Authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  47. arXiv:2309.07466  [pdf, other

    eess.AS cs.SD

    Codec Data Augmentation for Time-domain Heart Sound Classification

    Authors: Ansh Mishra, Jia Qi Yip, Eng Siong Chng

    Abstract: Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. Ho… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICAICTA 2023

  48. arXiv:2307.08029  [pdf, other

    eess.AS cs.LG cs.SD

    Noise-aware Speech Enhancement using Diffusion Probabilistic Model

    Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

    Abstract: With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech, underexploiting the varying noise information in real world. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extract… ▽ More

    Submitted 4 June, 2024; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, Accepted by InterSpeech 2024

  49. arXiv:2306.10567  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

    Comments: 14 pages, 5 figures, Accepted by ACL 2023

  50. arXiv:2306.10563  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

    Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

    Comments: 19 pages, 9 figures, Accepted by ACL 2023