Search | arXiv e-print repository

SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition

Authors: Khanh Le, Tuan Vu Ho, Dung Tran, Duc Thanh Chau

Abstract: RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework. However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases. Existing solutions, such as regularization and data augme… ▽ More RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework. However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases. Existing solutions, such as regularization and data augmentation, often compromise other aspects of performance. We propose SegAug, an alignment-based augmentation technique that generates contextually varied audio-text pairs with low sentence-level semantics. This method encourages the model to focus more on acoustic features while diversifying the learned textual patterns of its internal language model, thereby reducing deletion errors and enhancing overall performance. Evaluations on the LibriSpeech and Tedlium-v3 datasets demonstrate a relative WER reduction of up to 12.5% on small-scale and 6.9% on large-scale settings. Notably, most of the improvement stems from reduced deletion errors, with relative reductions of 45.4% and 18.5%, respectively. These results highlight SegAug's effectiveness in improving RNN-T's robustness, offering a promising solution for enhancing speech recognition performance across diverse and challenging scenarios. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: Accepted to ICASSP 2025

arXiv:2502.14673 [pdf, other]

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription

Authors: Khanh Le, Tuan Vu Ho, Dung Tran, Duc Thanh Chau

Abstract: Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form transcription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive… ▽ More Deploying ASR models at an industrial scale poses significant challenges in hardware resource management, especially for long-form transcription tasks where audio may last for hours. Large Conformer models, despite their capabilities, are limited to processing only 15 minutes of audio on an 80GB GPU. Furthermore, variable input lengths worsen inefficiencies, as standard batching leads to excessive padding, increasing resource consumption and execution time. To address this, we introduce ChunkFormer, an efficient ASR model that uses chunk-wise processing with relative right context, enabling long audio transcriptions on low-memory GPUs. ChunkFormer handles up to 16 hours of audio on an 80GB GPU, 1.5x longer than the current state-of-the-art FastConformer, while also boosting long-form transcription performance with up to 7.7% absolute reduction on word error rate and maintaining accuracy on shorter tasks compared to Conformer. By eliminating the need for padding in standard batching, ChunkFormer's masked batching technique reduces execution time and memory usage by more than 3x in batch processing, substantially reducing costs for a wide range of ASR systems, particularly regarding GPU resources for models serving in real-world applications. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: Accepted to ICASSP 2025

arXiv:2301.10966 [pdf]

Design of Mobile Manipulator for Fire Extinguisher Testing. Part II: Design and Simulation

Authors: Thai Nguyen Chau, Xuan Quang Ngo, Van Tu Duong, Trong Trung Nguyen, Huy Hung Nguyen, Tan Tien Nguyen

Abstract: All flames are extinguished as early as possible, or fire services have to deal with major conflagrations. This leads to the fact that the quality of fire extinguishers has become a very sensitive and important issue in firefighting. Inspired by the development of automatic fire fighting systems, this paper presents a mobile manipulator to evaluate the power of fire extinguishers, which is designe… ▽ More All flames are extinguished as early as possible, or fire services have to deal with major conflagrations. This leads to the fact that the quality of fire extinguishers has become a very sensitive and important issue in firefighting. Inspired by the development of automatic fire fighting systems, this paper presents a mobile manipulator to evaluate the power of fire extinguishers, which is designed according to the standard of fire extinguishers named as ISO 7165:2009 and ISO 11601:2008. A detailed discussion on key specifications solutions and mechanical design of the chassis of the mobile manipulator has been presented in Part I: Key Specifications and Conceptual Design. The focus of this part is on the rest of the mechanical design and controller de-sign of the mobile manipulator. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: 10 pages, 15 figures, the 7th International Conference on Advanced Engineering, Theory and Applications

arXiv:2110.13431 [pdf]

Meter-Range Wireless Motor Drive for Pipeline Transportation

Authors: Wei Liu, K. T. Chau, Hui Wang, Tengbo Yang

Abstract: This paper proposes and implements a meter-range wireless motor drive (WMD) system for promising applications of underground pipeline transportations or in-pipe robots. To power a pipeline network beneath the earth, both the power grid and the control system are usually required to be deployed deep underground, thus increasing the construction cost, maintenance difficulty and system complexity. Th… ▽ More This paper proposes and implements a meter-range wireless motor drive (WMD) system for promising applications of underground pipeline transportations or in-pipe robots. To power a pipeline network beneath the earth, both the power grid and the control system are usually required to be deployed deep underground, thus increasing the construction cost, maintenance difficulty and system complexity. The proposed system newly develops a hybrid repeater to enable the desired meter-range wireless power and drive transfer, which can offer a fault-tolerant network with a robust structure for the underground sensor-free WMD while maintaining a high transmission efficiency. Hence, this wireless pipeline network can reduce the maintenance requirement and regulate the flow rate effectively. A full-scale prototype has been built for practical verification, and the system efficiency can reach 88.8% at a long transfer distance of 150 cm. Theoretical analysis, software simulation and hardware experimentation are given to verify the feasibility of proposed meter-range WMD for underground pipeline transportations. △ Less

Submitted 16 January, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

arXiv:2103.05824 [pdf]

A Cyber-Physical Perspective to Pinning-Decision for Distributed Multi-Agent Control in Microgrid against Stochastic Communication Disruptions

Authors: Samson S. Yu, Tat Kei Chau

Abstract: In this study, we propose a decision-making strategy for pinning-based distributed multi-agent (PDMA) automatic generation control (AGC) in islanded microgrids against stochastic communication disruptions. The target microgrid is construed as a cyber-physical system, wherein the physical microgrid is modeled as an inverter-interfaced autonomous grid with detailed system dynamic formulation, and th… ▽ More In this study, we propose a decision-making strategy for pinning-based distributed multi-agent (PDMA) automatic generation control (AGC) in islanded microgrids against stochastic communication disruptions. The target microgrid is construed as a cyber-physical system, wherein the physical microgrid is modeled as an inverter-interfaced autonomous grid with detailed system dynamic formulation, and the communication network topology is regarded as a cyber-system independent of its physical connection. The primal goal of the proposed method is to decide the minimum number of generators to be pinned and their identities amongst all distributed generators (DGs). The pinningdecisions are made based on complex network theories using the genetic algorithm (GA), for the purpose of synchronizing and regulating the frequencies and voltages of all generator busbars in a PDMA control structure, i.e., without resorting to a central AGC agent. Thereafter, the mapping of cyber-system topology and the pinning decision is constructed using deeplearning (DL) technique, so that the pinning-decision can be made nearly instantly upon detecting a new cyber-system topology after stochastic communication disruptions. The proposed decision-making approach is verified using a 10-generator, 38-bus microgrid through time-domain simulation for transient stability analysis. △ Less

Submitted 9 March, 2021; originally announced March 2021.

Comments: 8 pages, 7 figures, 2 tables

arXiv:2010.15250 [pdf, other]

Semantic video segmentation for autonomous driving

Authors: Minh Triet Chau

Abstract: We aim to solve semantic video segmentation in autonomous driving, namely road detection in real time video, using techniques discussed in (Shelhamer et al., 2016a). While fully convolutional network gives good result, we show that the speed can be halved while preserving the accuracy. The test dataset being used is KITTI, which consists of real footage from Germany's streets. We aim to solve semantic video segmentation in autonomous driving, namely road detection in real time video, using techniques discussed in (Shelhamer et al., 2016a). While fully convolutional network gives good result, we show that the speed can be halved while preserving the accuracy. The test dataset being used is KITTI, which consists of real footage from Germany's streets. △ Less

Submitted 28 October, 2020; originally announced October 2020.

Comments: This work was done around 2017. Some minor changes were added

arXiv:2008.07660 [pdf, ps, other]

Revisiting the Application of Feature Selection Methods to Speech Imagery BCI Datasets

Authors: Javad Rahimipour Anaraki, Jae Moon, Tom Chau

Abstract: Brain-computer interface (BCI) aims to establish and improve human and computer interactions. There has been an increasing interest in designing new hardware devices to facilitate the collection of brain signals through various technologies, such as wet and dry electroencephalogram (EEG) and functional near-infrared spectroscopy (fNIRS) devices. The promising results of machine learning methods ha… ▽ More Brain-computer interface (BCI) aims to establish and improve human and computer interactions. There has been an increasing interest in designing new hardware devices to facilitate the collection of brain signals through various technologies, such as wet and dry electroencephalogram (EEG) and functional near-infrared spectroscopy (fNIRS) devices. The promising results of machine learning methods have attracted researchers to apply these methods to their data. However, some methods can be overlooked simply due to their inferior performance against a particular dataset. This paper shows how relatively simple yet powerful feature selection/ranking methods can be applied to speech imagery datasets and generate significant results. To do so, we introduce two approaches, horizontal and vertical settings, to use any feature selection and ranking methods to speech imagery BCI datasets. Our primary goal is to improve the resulting classification accuracies from support vector machines, $k$-nearest neighbour, decision tree, linear discriminant analysis and long short-term memory recurrent neural network classifiers. Our experimental results show that using a small subset of channels, we can retain and, in most cases, improve the resulting classification accuracies regardless of the classifier. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: 5 pages, 2 figures

ACM Class: I.2.8

arXiv:2007.08668 [pdf, other]

BRP-NAS: Prediction-based NAS using GCNs

Authors: Łukasz Dudziak, Thomas Chau, Mohamed S. Abdelfattah, Royson Lee, Hyeji Kim, Nicholas D. Lane

Abstract: Neural architecture search (NAS) enables researchers to automatically explore broad design spaces in order to improve efficiency of neural networks. This efficiency is especially important in the case of on-device deployment, where improvements in accuracy should be balanced out with computational demands of a model. In practice, performance metrics of model are computationally expensive to obtain… ▽ More Neural architecture search (NAS) enables researchers to automatically explore broad design spaces in order to improve efficiency of neural networks. This efficiency is especially important in the case of on-device deployment, where improvements in accuracy should be balanced out with computational demands of a model. In practice, performance metrics of model are computationally expensive to obtain. Previous work uses a proxy (e.g., number of operations) or a layer-wise measurement of neural network layers to estimate end-to-end hardware performance but the imprecise prediction diminishes the quality of NAS. To address this problem, we propose BRP-NAS, an efficient hardware-aware NAS enabled by an accurate performance predictor-based on graph convolutional network (GCN). What is more, we investigate prediction quality on different metrics and show that sample efficiency of the predictor-based NAS can be improved by considering binary relations of models and an iterative data selection strategy. We show that our proposed method outperforms all prior methods on NAS-Bench-101 and NAS-Bench-201, and that our predictor can consistently learn to extract useful features from the DARTS search space, improving upon the second-order baseline. Finally, to raise awareness of the fact that accurate latency estimation is not a trivial task, we release LatBench -- a latency dataset of NAS-Bench-201 models running on a broad range of devices. △ Less

Submitted 19 January, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: Published at NeurIPS 2020

arXiv:2002.05022 [pdf, other]

Best of Both Worlds: AutoML Codesign of a CNN and its Hardware Accelerator

Authors: Mohamed S. Abdelfattah, Łukasz Dudziak, Thomas Chau, Royson Lee, Hyeji Kim, Nicholas D. Lane

Abstract: Neural architecture search (NAS) has been very successful at outperforming human-designed convolutional neural networks (CNN) in accuracy, and when hardware information is present, latency as well. However, NAS-designed CNNs typically have a complicated topology, therefore, it may be difficult to design a custom hardware (HW) accelerator for such CNNs. We automate HW-CNN codesign using NAS by incl… ▽ More Neural architecture search (NAS) has been very successful at outperforming human-designed convolutional neural networks (CNN) in accuracy, and when hardware information is present, latency as well. However, NAS-designed CNNs typically have a complicated topology, therefore, it may be difficult to design a custom hardware (HW) accelerator for such CNNs. We automate HW-CNN codesign using NAS by including parameters from both the CNN model and the HW accelerator, and we jointly search for the best model-accelerator pair that boosts accuracy and efficiency. We call this Codesign-NAS. In this paper we focus on defining the Codesign-NAS multiobjective optimization problem, demonstrating its effectiveness, and exploring different ways of navigating the codesign search space. For CIFAR-10 image classification, we enumerate close to 4 billion model-accelerator pairs, and find the Pareto frontier within that large search space. This allows us to evaluate three different reinforcement-learning-based search strategies. Finally, compared to ResNet on its most optimal HW accelerator from within our HW design space, we improve on CIFAR-100 classification accuracy by 1.3% while simultaneously increasing performance/area by 41% in just~1000 GPU-hours of running Codesign-NAS. △ Less

Submitted 6 March, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

Comments: accepted at DAC 2020

arXiv:1912.04828 [pdf]

Navigating in Virtual Reality using Thought: The Development and Assessment of a Motor Imagery based Brain-Computer Interface

Authors: Behnam Reyhani-Masoleh, Tom Chau

Abstract: Brain-computer interface (BCI) systems have potential as assistive technologies for individuals with severe motor impairments. Nevertheless, individuals must first participate in many training sessions to obtain adequate data for optimizing the classification algorithm and subsequently acquiring brain-based control. Such traditional training paradigms have been dubbed unengaging and unmotivating f… ▽ More Brain-computer interface (BCI) systems have potential as assistive technologies for individuals with severe motor impairments. Nevertheless, individuals must first participate in many training sessions to obtain adequate data for optimizing the classification algorithm and subsequently acquiring brain-based control. Such traditional training paradigms have been dubbed unengaging and unmotivating for users. In recent years, it has been shown that the synergy of virtual reality (VR) and a BCI can lead to increased user engagement. This study created a 3-class BCI with a rather elaborate EEG signal processing pipeline that heavily utilizes machine learning. The BCI initially presented sham feedback but was eventually driven by EEG associated with motor imagery. The BCI tasks consisted of motor imagery of the feet and left and right hands, which were used to navigate a single-path maze in VR. Ten of the eleven recruited participants achieved online performance superior to chance (p < 0.01), while the majority successfully completed more than 70% of the prescribed navigational tasks. These results indicate that the proposed paradigm warrants further consideration as neurofeedback BCI training tool. A paradigm that allows users, from their perspective, control from the outset without the need for prior data collection sessions. △ Less

Submitted 10 December, 2019; originally announced December 2019.

Comments: 23 pages, 10 figures

Showing 1–10 of 10 results for author: Chau, T