Search | arXiv e-print repository

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Authors: Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel

Abstract: We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to ge… ▽ More We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted to INTERSPEECH 2025

arXiv:2506.16574 [pdf, ps, other]

Weight Factorization and Centralization for Continual Learning in Speech Recognition

Authors: Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel

Abstract: Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemi… ▽ More Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted to INTERSPEECH 2025

arXiv:2506.00368 [pdf, ps, other]

Neural Network-based Information-Theoretic Transceivers for High-Order Modulation Schemes

Authors: Ngoc Long Pham, Tri Nhu Do

Abstract: Neural network (NN)-based end-to-end (E2E) communication systems, in which each system component may consist of a portion of a neural network, have been investigated as potential tools for developing artificial intelligence (Al)-native E2E systems. In this paper, we propose an NN-based bitwise receiver that improves computational efficiency while maintaining performance comparable to baseline dema… ▽ More Neural network (NN)-based end-to-end (E2E) communication systems, in which each system component may consist of a portion of a neural network, have been investigated as potential tools for developing artificial intelligence (Al)-native E2E systems. In this paper, we propose an NN-based bitwise receiver that improves computational efficiency while maintaining performance comparable to baseline demappers. Building on this foundation, we introduce a novel symbol-wise autoencoder (AE)-based E2E system that jointly optimizes the transmitter and receiver at the physical layer. We evaluate the proposed NN-based receiver using bit-error rate (BER) analysis to confirm that the numerical BER achieved by NN-based receivers or transceivers is accurate. Results demonstrate that the AE-based system outperforms baseline architectures, particularly for higher-order modulation schemes. We further show that the training signal-to-noise ratio (SNR) significantly affects the performance of the systems when inference is conducted at different SNR levels. △ Less

Submitted 30 May, 2025; originally announced June 2025.

arXiv:2410.14997 [pdf, other]

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Authors: Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, Alexander Waibel

Abstract: Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunc… ▽ More Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results. △ Less

Submitted 4 March, 2025; v1 submitted 19 October, 2024; originally announced October 2024.

Comments: accepted at ICASSP 2025

arXiv:2410.08229 [pdf, ps, other]

Improvement of Spiking Neural Network with Bit Planes and Color Models

Authors: Nhan T. Luu, Duong T. Luu, Nam N. Pham, Thang C. Truong

Abstract: Spiking neural network (SNN) has emerged as a promising paradigm in computational neuroscience and artificial intelligence, offering advantages such as low energy consumption and small memory footprint. However, their practical adoption is constrained by several challenges, prominently among them being performance optimization. In this study, we present a novel approach to enhance the performance… ▽ More Spiking neural network (SNN) has emerged as a promising paradigm in computational neuroscience and artificial intelligence, offering advantages such as low energy consumption and small memory footprint. However, their practical adoption is constrained by several challenges, prominently among them being performance optimization. In this study, we present a novel approach to enhance the performance of SNN for images through a new coding method that exploits bit plane representation. Our proposed technique is designed to improve the accuracy of SNN without increasing model size. Also, we investigate the impacts of color models of the proposed coding process. Through extensive experimental validation, we demonstrate the effectiveness of our coding strategy in achieving performance gain across multiple datasets. To the best of our knowledge, this is the first research that considers bit planes and color models in the context of SNN. By leveraging the unique characteristics of bit planes, we hope to unlock new potentials in SNNs performance, potentially paving the way for more efficient and effective SNNs models in future researches and applications. △ Less

Submitted 11 July, 2025; v1 submitted 28 September, 2024; originally announced October 2024.

Comments: 2024 IEEE 16th International Conference on Computational Intelligence and Communication Networks (CICN)

arXiv:2410.03734 [pdf, other]

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

Authors: Tuan Nam Nguyen, Ngoc Quan Pham, Alexander Waibel

Abstract: The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome… ▽ More The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues. Our approach utilizes discrete units, derived from clustering self-supervised representations of native speech, as an intermediary target for accent conversion. Leveraging multi-speaker text-to-speech synthesis, it transforms these discrete representations back into native speech while retaining the speaker identity. Additionally, we develop an efficient data augmentation method to train the system without demanding a lot of non-native resources. Our system is proved to improve non-native speaker fluency, sound like a native accent, and preserve original speaker identity well. △ Less

Submitted 30 September, 2024; originally announced October 2024.

Comments: Accepted at Syndata4genAI

arXiv:2406.19077 [pdf, other]

Parameter Dependent Chen--Fliess Series and Their Nonrecursive Interconnections

Authors: W. Steven Gray, Natalie Pham

Abstract: A class of parameter dependent Chen--Fliess series is introduced where the series coefficients are taken from a noncommutative ring of multivariable differential operators. Such series are shown in the linear case to represent formal solutions to Cauchy initial value problems for nonhomogeneous PDEs and thus are useful for characterizing the input-output maps of distributed control systems. It is… ▽ More A class of parameter dependent Chen--Fliess series is introduced where the series coefficients are taken from a noncommutative ring of multivariable differential operators. Such series are shown in the linear case to represent formal solutions to Cauchy initial value problems for nonhomogeneous PDEs and thus are useful for characterizing the input-output maps of distributed control systems. It is also shown that this class of functional series is almost closed under the set of nonrecursive interconnections, that is, any finite combination of parallel and series interconnections without a closed-loop. Some sufficient conditions are needed for the series interconnection. Specific examples are given involving the transport equation and the wave equation. △ Less

Submitted 27 June, 2024; originally announced June 2024.

MSC Class: 41A58; 93C10; 35C10

arXiv:2401.05425 [pdf]

An Unobtrusive and Lightweight Ear-worn System for Continuous Epileptic Seizure Detection

Authors: Abdul Aziz, Nhat Pham, Neel Vora, Cody Reynolds, Jaime Lehnen, Pooja Venkatesh, Zhuoran Yao, Jay Harvey, Tam Vu, Kan Ding, Phuc Nguyen

Abstract: Epilepsy is one of the most common neurological diseases globally (around 50 million people worldwide). Fortunately, up to 70% of people with epilepsy could live seizure-free if properly diagnosed and treated, and a reliable technique to monitor the onset of seizures could improve the quality of life of patients who are constantly facing the fear of random seizure attacks. The scalp-based EEG test… ▽ More Epilepsy is one of the most common neurological diseases globally (around 50 million people worldwide). Fortunately, up to 70% of people with epilepsy could live seizure-free if properly diagnosed and treated, and a reliable technique to monitor the onset of seizures could improve the quality of life of patients who are constantly facing the fear of random seizure attacks. The scalp-based EEG test, despite being the gold standard for diagnosing epilepsy, is costly, necessitates hospitalization, demands skilled professionals for operation, and is discomforting for users. In this paper, we propose EarSD, a novel lightweight, unobtrusive, and socially acceptable ear-worn system to detect epileptic seizure onsets by measuring the physiological signals from behind the user's ears. EarSD includes an integrated custom-built sensing-computing-communication PCB to collect and amplify the signals of interest, remove the noises caused by motion artifacts and environmental impacts, and stream the data wirelessly to the computer/mobile phone nearby, where data are uploaded to the host computer for further processing. We conducted both in-lab and in-hospital experiments with epileptic seizure patients who were hospitalized for seizure studies. △ Less

Submitted 24 October, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

arXiv:2311.11096 [pdf, other]

On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation

Authors: Duy Minh Ho Nguyen, Tan Ngoc Pham, Nghiem Tuong Diep, Nghi Quoc Phan, Quang Pham, Vinh Tong, Binh T. Nguyen, Ngan Hoang Le, Nhat Ho, Pengtao Xie, Daniel Sonntag, Mathias Niepert

Abstract: Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for… ▽ More Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on developing better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Advances in Neural Information Processing Systems (NeurIPS) 2023, Workshop on robustness of zero/few-shot learning in foundation models

arXiv:2211.11703 [pdf, other]

Towards continually learning new languages

Authors: Ngoc-Quan Pham, Jan Niehues, Alexander Waibel

Abstract: Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in… ▽ More Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in order to counter catastrophic forgetting and facilitate learning new languages quickly. Such combination allowed us to eliminate catastrophic forgetting while still achieving performance for the new languages comparable with having all languages at once, in experiments of learning from an initial 10 languages to achieve 26 languages without catastrophic forgetting and a reasonable performance compared to training all languages from scratch. △ Less

Submitted 17 July, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: Work in progress

arXiv:2211.02592 [pdf]

A Large-Scale Study of a Sleep Tracking and Improving Device with Closed-loop and Personalized Real-time Acoustic Stimulation

Authors: Anh Nguyen, Galen Pogoncheff, Ban Xuan Dong, Nam Bui, Hoang Truong, Nhat Pham, Linh Nguyen, Hoang Huu Nguyen, Sy Duong-Quy, Sangtae Ha, Tam Vu

Abstract: Various intervention therapies ranging from pharmaceutical to hi-tech tailored solutions have been available to treat difficulty in falling asleep commonly caused by insomnia in modern life. However, current techniques largely remain ill-suited, ineffective, and unreliable due to their lack of precise real-time sleep tracking, in-time feedback on the therapies, an ability to keep people asleep dur… ▽ More Various intervention therapies ranging from pharmaceutical to hi-tech tailored solutions have been available to treat difficulty in falling asleep commonly caused by insomnia in modern life. However, current techniques largely remain ill-suited, ineffective, and unreliable due to their lack of precise real-time sleep tracking, in-time feedback on the therapies, an ability to keep people asleep during the night, and a large-scale effectiveness evaluation. Here, we introduce a novel sleep aid system, called Earable, that can continuously sense multiple head-based physiological signals and simultaneously enable closed-loop auditory stimulation to entrain brain activities in time for effective sleep promotion. We develop the system in a lightweight, comfortable, and user-friendly headband with a comprehensive set of algorithms and dedicated own-designed audio stimuli. We conducted multiple protocols from 883 sleep studies on 377 subjects (241 women, 119 men) wearing either a gold-standard device (PSG), Earable, or both concurrently. We demonstrate that our system achieves (1) a strong correlation (0.89 +/- 0.03) between the physiological signals acquired by Earable and those from the gold-standard PSG, (2) an 87.8 +/- 5.3% agreement on sleep scoring using our automatic real-time sleep staging algorithm with the consensus scored by three sleep technicians, and (3) a successful non-pharmacological stimulation alternative to effectively shorten the duration of sleep falling by 24.1 +/- 0.1 minutes. These results show that the efficacy of Earable exceeds existing techniques in intentions to promote fast falling asleep, track sleep state accurately, and achieve high social acceptance for real-time closed-loop personalized neuromodulation-based home sleep care. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: 33 pages, 8 figures

arXiv:2205.12304 [pdf, ps, other]

Adaptive multilingual speech recognition with pretrained models

Authors: Ngoc-Quan Pham, Alex Waibel, Jan Niehues

Abstract: Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretra… ▽ More Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text, together with the adaptive weight techniques to massively improve the recognition quality on the public datasets containing CommonVoice and Europarl. Overall, we noticed an 44% improvement over purely supervised learning, and more importantly, each technique provides a different reinforcement in different languages. We also explore other possibilities to potentially obtain the best model by slightly adding either depth or relative attention to the architecture. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2109.09026 [pdf, other]

Hybrid Data Augmentation and Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for Speech Emotion Recognition

Authors: Nhat Truong Pham, Duc Ngoc Minh Dang, Sy Dzung Nguyen

Abstract: Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the ef… ▽ More Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the effectiveness of HDA methods, a deep learning framework namely (ADCRNN) is designed by integrating deep dilated convolutional-recurrent neural networks with an attention mechanism. Besides, we choose 3D log Mel-spectrogram (MelSpec) features as the inputs for the deep learning framework. Furthermore, we reconfigure a loss function by combining a softmax loss and a center loss to classify the emotions. For validating our proposed methods, we use the EmoDB dataset that consists of several emotions with imbalanced samples. Experimental results prove that the proposed methods achieve better accuracy than the state-of-the-art methods on the EmoDB with 87.12% and 88.47% for the traditional and GAN-based methods, respectively. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: 12 pages, 16 figures, 6 tables

arXiv:2109.03219 [pdf, other]

Fruit-CoV: An Efficient Vision-based Framework for Speedy Detection and Diagnosis of SARS-CoV-2 Infections Through Recorded Cough Sounds

Authors: Long H. Nguyen, Nhat Truong Pham, Van Huong Do, Liu Tai Nguyen, Thanh Tin Nguyen, Van Dung Do, Hai Nguyen, Ngoc Duy Nguyen

Abstract: SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing serv… ▽ More SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing service of SARS-CoV-2 at home. In this study, we introduce Fruit-CoV, a two-stage vision framework, which is capable of detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, we convert sounds into Log-Mel Spectrograms and use the EfficientNet-V2 network to extract its visual features in the first stage. In the second stage, we use 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN to aggregate feature representations of the Log-Mel Spectrograms. Finally, we use the combined features to train a binary classifier. In this study, we use a dataset provided by the AICovidVN 115M Challenge, which includes a total of 7371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results show that our proposed model achieves an AUC score of 92.8% and ranks the 1st place on the leaderboard of the AICovidVN Challenge. More importantly, our proposed framework can be integrated into a call center or a VoIP system to speed up detecting SARS-CoV-2 infections through online/recorded cough sounds. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 4 pages

arXiv:2108.11089 [pdf, other]

Detecting Drill Failure in the Small Short-sound Drill Dataset

Authors: Thanh Tran, Nhat Truong Pham, Jan Lundgren

Abstract: Monitoring the conditions of machines is vital in the manufacturing industry. Early detection of faulty components in machines for stopping and repairing the failed components can minimize the downtime of the machine. This article presents an approach to detect the failure occurring in drill machines based on drill sounds from Valmet AB. The drill dataset includes three classes: anomalous sounds,… ▽ More Monitoring the conditions of machines is vital in the manufacturing industry. Early detection of faulty components in machines for stopping and repairing the failed components can minimize the downtime of the machine. This article presents an approach to detect the failure occurring in drill machines based on drill sounds from Valmet AB. The drill dataset includes three classes: anomalous sounds, normal sounds, and irrelevant sounds, which are also labeled as "Broken", "Normal", and "Other", respectively. Detecting drill failure effectively remains a challenge due to the following reasons. The waveform of drill sound is complex and short for detection. Additionally, in realistic soundscapes, there are sounds and noise in the context at the same time. Moreover, the balanced dataset is small to apply state-of-the-art deep learning techniques. To overcome these aforementioned difficulties, we augmented sounds to increase the number of sounds in the dataset. We then proposed a convolutional neural network (CNN) combined with a long short-term memory (LSTM) to extract features from log-Mel spectrograms and learn global high-level feature representation for the classification of three classes. A leaky rectified linear unit (Leaky ReLU) was utilized as the activation function for our proposed CNN instead of the rectified linear unit (ReLU). Moreover, we deployed an attention mechanism at the frame level after the LSTM layer to learn long-term global feature representations. As a result, the proposed method reached an overall accuracy of 92.35% for the drill failure detection system. △ Less

Submitted 9 November, 2021; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: 8 pages, 10 figures, journal

arXiv:2105.03010 [pdf, ps, other]

Efficient Weight factorization for Multilingual Speech Recognition

Authors: Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stueker, Alexander Waibel

Abstract: End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In th… ▽ More End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In this paper we propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions. The key idea of the method is to assign fast weight matrices for each language by decomposing each weight matrix into a shared component and a language dependent component. The latter is then factorized into vectors using rank-1 assumptions to reduce the number of parameters per language. This efficient factorization scheme is proved to be effective in two multilingual settings with $7$ and $27$ languages, reducing the word error rates by $26\%$ and $27\%$ rel. for two popular architectures LSTM and Transformer, respectively. △ Less

Submitted 6 May, 2021; originally announced May 2021.

Comments: Submitted to Interspeech 2021

arXiv:2005.09940 [pdf, other]

Relative Positional Encoding for Speech Recognition and Direct Translation

Authors: Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stueker, Jan Niehues, Alexander Waibel

Abstract: Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition… ▽ More Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation. △ Less

Submitted 20 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:2003.10022 [pdf, other]

High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

Authors: Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stueker, Alex Waibel

Abstract: Recently sequence-to-sequence models have started to achieve state-of-the-art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to performing run-on recognition on an input stream of audio data while producing recognition results in real-time and with low word-based… ▽ More Recently sequence-to-sequence models have started to achieve state-of-the-art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to performing run-on recognition on an input stream of audio data while producing recognition results in real-time and with low word-based latency, these models face several challenges. For many techniques, the whole audio sequence to be decoded needs to be available at the start of the processing, e.g., for the attention mechanism or the bidirectional LSTM (BLSTM). In this paper, we propose several techniques to mitigate these problems. We introduce an additional loss function controlling the uncertainty of the attention mechanism, a modified beam search identifying partial, stable hypotheses, ways of working with BLSTM in the encoder, and the use of chunked BLSTM. Our experiments show that with the right combination of these techniques, it is possible to perform run-on speech recognition with low word-based latency without sacrificing in word error rate performance. △ Less

Submitted 26 July, 2020; v1 submitted 22 March, 2020; originally announced March 2020.

Comments: To appear in Interspeech 2020

arXiv:1910.05603 [pdf, other]

VAIS ASR: Building a conversational speech recognition system using language model combination

Authors: Quang Minh Nguyen, Thai Binh Nguyen, Ngoc Phuong Pham, The Loc Nguyen

Abstract: Automatic Speech Recognition (ASR) systems have been evolving quickly and reaching human parity in certain cases. The systems usually perform pretty well on reading style and clean speech, however, most of the available systems suffer from situation where the speaking style is conversation and in noisy environments. It is not straight-forward to tackle such problems due to difficulties in data col… ▽ More Automatic Speech Recognition (ASR) systems have been evolving quickly and reaching human parity in certain cases. The systems usually perform pretty well on reading style and clean speech, however, most of the available systems suffer from situation where the speaking style is conversation and in noisy environments. It is not straight-forward to tackle such problems due to difficulties in data collection for both speech and text. In this paper, we attempt to mitigate the problems using language models combination techniques that allows us to utilize both large amount of writing style text and small number of conversation text data. Evaluation on the VLSP 2019 ASR challenges showed that our system achieved 4.85% WER on the VLSP 2018 and 15.09% WER on the VLSP 2019 data sets. △ Less

Submitted 12 October, 2019; originally announced October 2019.

Comments: 3 pages, 1 figures, Vietnamese Language and Speech Processing conference)

arXiv:1908.09766 [pdf]

A Hybrid of Adaptation and Dynamic Routing based on SDN for Improving QoE in HTTP Adaptive VBR Video Streaming

Authors: Hong Thinh Pham, Ngoc Nam Pham, Huu Thanh Nguyen, Alan Marshall, Thu Huong Truong

Abstract: Recently, HTTP Adaptive Streaming HAS has received significant attention from both industry and academia based on its ability to enhancing media streaming services over the Internet. Recent research solutions that have tried to improve HAS by adaptation at the client side only may not be completely effective without interacting with routing decisions in the upper layers. In this paper, we address… ▽ More Recently, HTTP Adaptive Streaming HAS has received significant attention from both industry and academia based on its ability to enhancing media streaming services over the Internet. Recent research solutions that have tried to improve HAS by adaptation at the client side only may not be completely effective without interacting with routing decisions in the upper layers. In this paper, we address the aforementioned issue by proposing a dynamic bandwidth allocation and management architecture for streaming video flows to improve users satisfaction. We also introduce an initial cross layer hybrid method that combines quality adaptation of variable bitrate video streaming over the HTTP protocol at the client side and SDN based dynamical routing. This scheme is enabled by the Software Defined Networking architecture that is now being considered as an emerging paradigm that disassociates the forwarding process from the routing process. SDN brings flexibility and the ability to flexibly change routing solutions, in turn resulting in dynamically improving the services provided in the application layer. Our experimental results show that the proposed solution offers significantly higher overall bitrates as well as smoother viewing experience than existing methods. △ Less

Submitted 26 August, 2019; originally announced August 2019.

Comments: 14 pages, 17 figures, IJCSNS International Journal of Computer Science and Network Security, http://paper.ijcsns.org/07_book/201907/20190708.pdf

Journal ref: VOL.19 No.7, July 2019

arXiv:1904.13377 [pdf, other]

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Authors: Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, Alexander Waibel

Abstract: Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transfor… ▽ More Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure. △ Less

Submitted 3 May, 2019; v1 submitted 30 April, 2019; originally announced April 2019.

Comments: Submitted to INTERSPEECH 2019

Showing 1–21 of 21 results for author: Pham, N