Skip to main content

Showing 1–41 of 41 results for author: Geng, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.11069  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition

    Authors: Tao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu

    Abstract: Accurate recognition of dysarthric and elderly speech remains challenging to date. While privacy concerns have driven a shift from centralized approaches to federated learning (FL) to ensure data confidentiality, this further exacerbates the challenges of data scarcity, imbalanced data distribution and speaker heterogeneity. To this end, this paper conducts a systematic investigation of regularize… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  2. arXiv:2505.24224  [pdf, ps, other

    eess.AS

    MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition

    Authors: Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu

    Abstract: This paper proposes a novel Mixture of Prompt-Experts based Speaker Adaptation approach (MOPSA) for elderly speech recognition. It allows zero-shot, real-time adaptation to unseen speakers, and leverages domain knowledge tailored to elderly speakers. Top-K most distinctive speaker prompt clusters derived using K-means serve as experts. A router network is trained to dynamically combine clustered p… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  3. arXiv:2505.23236  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

    Authors: Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu

    Abstract: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used t… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH2025

  4. arXiv:2505.22608  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

    Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu

    Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr c… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Submitted to Interspeech 2025

  5. arXiv:2505.22072  [pdf, other

    cs.SD eess.AS

    On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

    Authors: Shujie HU, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergenc… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  6. arXiv:2504.10497  [pdf, other

    cs.IR cs.AI cs.HC cs.MA eess.SY

    Exploring Generative AI Techniques in Government: A Case Study

    Authors: Sunyi Liu, Mengzhe Geng, Rebecca Hart

    Abstract: The swift progress of Generative Artificial intelligence (GenAI), notably Large Language Models (LLMs), is reshaping the digital landscape. Recognizing this transformative potential, the National Research Council of Canada (NRC) launched a pilot initiative to explore the integration of GenAI techniques into its daily operation for performance excellence, where 22 projects were launched in May 2024… ▽ More

    Submitted 12 May, 2025; v1 submitted 6 April, 2025; originally announced April 2025.

    Comments: In submission to IEEE Intelligent Systems

  7. arXiv:2501.04379  [pdf, other

    cs.SD eess.AS

    Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

    Authors: Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu

    Abstract: Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

    Comments: ICASSP 2025

  8. arXiv:2501.03643  [pdf, other

    cs.SD cs.AI eess.AS

    Effective and Efficient Mixed Precision Quantization of Speech Foundation Models

    Authors: Haoning Xu, Zhaoqing Li, Zengrui Jin, Huimeng Wang, Youjun Chen, Guinan Li, Mengzhe Geng, Shujie Hu, Jiajun Deng, Xunying Liu

    Abstract: This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the loss… ▽ More

    Submitted 11 January, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

    Comments: To appear at IEEE ICASSP 2025

  9. arXiv:2412.18832  [pdf, other

    eess.AS cs.SD

    Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Zengrui Jin, Tianzi Wang, Mingyu Cui, Guinan Li, Zhaoqing Li, Helen Meng, Xunying Liu

    Abstract: Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning sta… ▽ More

    Submitted 25 December, 2024; originally announced December 2024.

  10. arXiv:2407.13782  [pdf, other

    eess.AS cs.AI cs.SD

    Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu

    Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  11. arXiv:2407.06310  [pdf, other

    cs.SD cs.AI cs.HC cs.LG eess.AS

    Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

    Authors: Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

    Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  12. arXiv:2406.10160  [pdf, other

    cs.SD cs.AI eess.AS

    One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

    Authors: Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

    Abstract: We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrat… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  13. arXiv:2406.10152  [pdf, other

    cs.SD eess.AS

    Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

    Authors: Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

    Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  14. arXiv:2406.10034  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

    Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More

    Submitted 30 August, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

  15. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  16. arXiv:2405.05814  [pdf

    eess.IV cs.CV

    MSDiff: Multi-Scale Diffusion Model for Ultra-Sparse View CT Reconstruction

    Authors: Pinhuang Tan, Mengxiao Geng, Jingya Lu, Liu Shi, Bin Huang, Qiegen Liu

    Abstract: Computed Tomography (CT) technology reduces radiation haz-ards to the human body through sparse sampling, but fewer sampling angles pose challenges for image reconstruction. Score-based generative models are widely used in sparse-view CT re-construction, performance diminishes significantly with a sharp reduction in projection angles. Therefore, we propose an ultra-sparse view CT reconstruction me… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  17. arXiv:2401.00662  [pdf, other

    cs.SD eess.AS

    Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

    Authors: Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu

    Abstract: Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an ext… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Comments: To appear at IEEE ICASSP 2024

  18. arXiv:2312.08641  [pdf, other

    eess.AS cs.SD

    Towards Automatic Data Augmentation for Disordered Speech Recognition

    Authors: Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu

    Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task an… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: To appear at IEEE ICASSP 2024

  19. arXiv:2308.03018  [pdf, other

    cs.CV eess.IV

    Recurrent Spike-based Image Restoration under General Illumination

    Authors: Lin Zhu, Yunlong Zheng, Mengyue Geng, Lizhi Wang, Hua Huang

    Abstract: Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, whi… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: Accepted by ACM MM 2023

  20. arXiv:2307.02909  [pdf, other

    eess.AS cs.AI cs.SD

    Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

    Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu

    Abstract: Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is pro… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  21. arXiv:2306.15265  [pdf, other

    eess.AS cs.LG

    Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition

    Authors: Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu

    Abstract: Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer AS… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: 5 pages, 3 figures, 3 tables, accepted by Interspeech2023

  22. arXiv:2306.14608  [pdf, other

    eess.AS cs.CL

    Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

    Authors: Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

    Abstract: Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately mo… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  23. arXiv:2305.10659  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Use of Speech Impairment Severity for Dysarthric Speech Recognition

    Authors: Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu

    Abstract: A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognit… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH2023

  24. arXiv:2302.14564  [pdf, other

    cs.SD cs.AI eess.AS

    Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng

    Abstract: Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends… ▽ More

    Submitted 22 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: accepted by ICASSP 2023

  25. arXiv:2211.01646  [pdf, other

    eess.AS cs.SD

    Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

    Authors: Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu

    Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personali… ▽ More

    Submitted 19 March, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  26. arXiv:2206.13232  [pdf, other

    eess.AS cs.LG cs.SD

    Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection

    Authors: Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng

    Abstract: Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantl… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: 5 pages, 1 figure, accepted by INTERSPEECH 2022

  27. arXiv:2206.12045  [pdf, other

    eess.AS cs.SD

    Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

    Authors: Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng

    Abstract: A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervi… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: It's accepted to INTERSPEECH 2022. arXiv admin note: text overlap with arXiv:2206.11596

  28. Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

    Authors: Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng

    Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system fea… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: It' s accepted to ISCA 2022

  29. arXiv:2206.07327  [pdf, other

    eess.AS cs.AI

    Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Xunying Liu, Helen Meng

    Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This pa… ▽ More

    Submitted 22 June, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: accepted by INTERSPEECH 2023

  30. Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

    Authors: Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu

    Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role… ▽ More

    Submitted 23 June, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: text overlap with arXiv:2202.10290

  31. arXiv:2203.14593  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

    Authors: Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye, Helen Meng, Xunying Liu

    Abstract: Accurate recognition of dysarthric and elderly speech remain challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender, when aggregated with age and speech impairment, create large diversity among these speakers. Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. To this end, this paper proposes two novel fo… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2023

  32. arXiv:2203.10274  [pdf, other

    eess.AS cs.AI

    Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

    Authors: Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng

    Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: accepted by ICASSP 2022

  33. arXiv:2202.10290  [pdf, other

    eess.AS cs.AI cs.LG cs.SD q-bio.QM

    Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

    Authors: Mengzhe Geng, Xurong Xie, Zi Ye, Tianzi Wang, Guinan Li, Shujie Hu, Xunying Liu, Helen Meng

    Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create… ▽ More

    Submitted 17 March, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

    Comments: In submission to IEEE/ACM Transactions on Audio Speech and Language Processing

  34. arXiv:2201.05845  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Recent Progress in the CUHK Dysarthric Speech Recognition System

    Authors: Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng

    Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at… ▽ More

    Submitted 26 February, 2022; v1 submitted 15 January, 2022; originally announced January 2022.

  35. arXiv:2201.05562  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Investigation of Data Augmentation Techniques for Disordered Speech Recognition

    Authors: Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng

    Abstract: Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal t… ▽ More

    Submitted 14 January, 2022; originally announced January 2022.

    Comments: Proceedings of INTERSPEECH 2020

  36. arXiv:2201.05554  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

    Authors: Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng

    Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recogni… ▽ More

    Submitted 14 January, 2022; originally announced January 2022.

    Comments: Proceedings of INTERSPEECH 2021

  37. arXiv:2201.03943  [pdf, other

    eess.AS cs.SD

    Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

    Authors: Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng

    Abstract: State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural network… ▽ More

    Submitted 28 March, 2022; v1 submitted 8 January, 2022; originally announced January 2022.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2007.08818

  38. arXiv:2108.00899  [pdf, other

    eess.AS cs.SD

    Adversarial Data Augmentation for Disordered Speech Recognition

    Authors: Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng

    Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. To this end, data augmentation techniques play a vital role in current disordered speech recognition system… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: 5 pages, 3 figures, INTERSPEECH 2021

  39. arXiv:2012.04494  [pdf, other

    eess.AS

    Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

    Authors: Shoukang Hu, Xurong Xie, Shansong Liu, Jianwei Yu, Zi Ye, Mengzhe Geng, Xunying Liu, Helen Meng

    Abstract: Discriminative training techniques define state-of-the-art performance for automatic speech recognition systems. However, they are inherently prone to overfitting, leading to poor generalization performance when using limited training data. In order to address this issue, this paper presents a full Bayesian framework to account for model uncertainty in sequence discriminative training of factored… ▽ More

    Submitted 10 May, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

    Comments: Published in TASLP

  40. arXiv:2011.07755  [pdf, other

    eess.AS

    Audio-visual Multi-channel Integration and Recognition of Overlapped Speech

    Authors: Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu

    Abstract: Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to sep… ▽ More

    Submitted 30 August, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

    Comments: TASLP 2021

  41. arXiv:2007.08818  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

    Authors: Shoukang Hu, Xurong Xie, Shansong Liu, Mingyu Cui, Mengzhe Geng, Xunying Liu, Helen Meng

    Abstract: Deep neural networks (DNNs) based automatic speech recognition (ASR) systems are often designed using expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) th… ▽ More

    Submitted 7 February, 2021; v1 submitted 17 July, 2020; originally announced July 2020.

    Comments: Accepted by ICASSP 2021