-
Federated Representation Learning for Automatic Speech Recognition
Authors:
Guruprasad V Ramesh,
Gopinath Chennupati,
Milind Rao,
Anit Kumar Sahu,
Ariya Rastrow,
Jasha Droppo
Abstract:
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respec…
▽ More
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.
△ Less
Submitted 7 August, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Federated Self-Learning with Weak Supervision for Speech Recognition
Authors:
Milind Rao,
Gopinath Chennupati,
Gautam Tiwari,
Anit Kumar Sahu,
Anirudh Raju,
Ariya Rastrow,
Jasha Droppo
Abstract:
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine t…
▽ More
Automatic speech recognition (ASR) models with low-footprint are increasingly being deployed on edge devices for conversational agents, which enhances privacy. We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine transcriptions from a stronger ASR model. In particular, we study the performance of a self-learning based scheme, with a paired teacher model updated through an exponential moving average of ASR models. Further, we propose using possibly noisy weak-supervision signals such as feedback scores and natural language understanding semantics determined from user behavior across multiple turns in a session of interactions with the conversational agent. These signals are leveraged in a multi-task policy-gradient training approach to improve the performance of self-learning for ASR. Finally, we show how catastrophic forgetting can be mitigated by combining on-device learning with a memory-replay approach using selected historical datasets. These innovations allow for 10% relative improvement in WER on new use cases with minimal degradation on other test sets in the absence of strong-supervision signals such as ground-truth transcriptions.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Learning When to Trust Which Teacher for Weakly Supervised ASR
Authors:
Aakriti Agrawal,
Milind Rao,
Anit Kumar Sahu,
Gopinath Chennupati,
Andreas Stolcke
Abstract:
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by t…
▽ More
Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Data-driven Thermal Model Inference with ARMAX, in Smart Environments, based on Normalized Mutual Information
Authors:
Zhanhong Jiang,
Jonathan Francis,
Anit Kumar Sahu,
Sirajum Munir,
Charles Shelton,
Anthony Rowe,
Mario Bergés
Abstract:
Understanding the models that characterize the thermal dynamics in a smart building is important for the comfort of its occupants and for its energy optimization. A significant amount of research has attempted to utilize thermodynamics (physical) models for smart building control, but these approaches remain challenging due to the stochastic nature of the intermittent environmental disturbances. T…
▽ More
Understanding the models that characterize the thermal dynamics in a smart building is important for the comfort of its occupants and for its energy optimization. A significant amount of research has attempted to utilize thermodynamics (physical) models for smart building control, but these approaches remain challenging due to the stochastic nature of the intermittent environmental disturbances. This paper presents a novel data-driven approach for indoor thermal model inference, which combines an Autoregressive Moving Average with eXogenous inputs model (ARMAX) with a Normalized Mutual Information scheme (NMI). Based on this information-theoretic method, NMI, causal dependencies between the indoor temperature and exogenous inputs are explicitly obtained as a guideline for the ARMAX model to find the dominating inputs. For validation, we use three datasets based on building energy systems-against which we compare our method to an autoregressive model with exogenous inputs (ARX), a regularized ARMAX model, and state-space models.
△ Less
Submitted 10 June, 2020;
originally announced June 2020.
-
MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling
Authors:
Jianyu Wang,
Anit Kumar Sahu,
Zhouyi Yang,
Gauri Joshi,
Soummya Kar
Abstract:
This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that ca…
▽ More
This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that can achieve a win-win in this error-runtime trade-off for any arbitrary network topology. The main idea of MATCHA is to parallelize inter-node communication by decomposing the topology into matchings. To preserve fast error convergence speed, it identifies and communicates more frequently over critical links, and saves communication time by using other links less frequently. Experiments on a suite of datasets and deep neural networks validate the theoretical analyses and demonstrate that MATCHA takes up to $5\times$ less time than vanilla decentralized SGD to reach the same training loss.
△ Less
Submitted 18 November, 2019; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Distributed stochastic optimization with gradient tracking over strongly-connected networks
Authors:
Ran Xin,
Anit Kumar Sahu,
Usman A. Khan,
Soummya Kar
Abstract:
In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxilia…
▽ More
In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S}$-$\mathcal{AB}$ algorithm employs row- and column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S}$-$\mathcal{AB}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S}$-$\mathcal{AB}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.
△ Less
Submitted 9 April, 2019; v1 submitted 18 March, 2019;
originally announced March 2019.
-
Distributed Sequential Detection for Gaussian Shift-in-Mean Hypothesis Testing
Authors:
Anit Kumar Sahu,
Soummya Kar
Abstract:
This paper studies the problem of sequential Gaussian shift-in-mean hypothesis testing in a distributed multi-agent network. A sequential probability ratio test (SPRT) type algorithm in a distributed framework of the \emph{consensus}+\emph{innovations} form is proposed, in which the agents update their decision statistics by simultaneously processing latest observations (innovations) sensed sequen…
▽ More
This paper studies the problem of sequential Gaussian shift-in-mean hypothesis testing in a distributed multi-agent network. A sequential probability ratio test (SPRT) type algorithm in a distributed framework of the \emph{consensus}+\emph{innovations} form is proposed, in which the agents update their decision statistics by simultaneously processing latest observations (innovations) sensed sequentially over time and information obtained from neighboring agents (consensus). For each pre-specified set of type I and type II error probabilities, local decision parameters are derived which ensure that the algorithm achieves the desired error performance and terminates in finite time almost surely (a.s.) at each network agent. Large deviation exponents for the tail probabilities of the agent stopping time distributions are obtained and it is shown that asymptotically (in the number of agents or in the high signal-to-noise-ratio regime) these exponents associated with the distributed algorithm approach that of the optimal centralized detector. The expected stopping time for the proposed algorithm at each network agent is evaluated and is benchmarked with respect to the optimal centralized algorithm. The efficiency of the proposed algorithm in the sense of the expected stopping times is characterized in terms of network connectivity. Finally, simulation studies are presented which illustrate and verify the analytical findings.
△ Less
Submitted 31 August, 2015; v1 submitted 27 November, 2014;
originally announced November 2014.
-
Fast and Accurate Frequency Estimation Using Sliding DFT
Authors:
Anit Kumar Sahu,
Mrityunjoy Chakraborty
Abstract:
Frequency Estimation of a complex exponential is a problem relevant to a large number of fields. In this paper a computationally efficient and accurate frequency estimator is presented using the guaranteed stable Sliding DFT which gives stability as well as computational efficiency. The estimator approaches Jacobsen's estimator and Candan's estimator for large N with an extra correction term multi…
▽ More
Frequency Estimation of a complex exponential is a problem relevant to a large number of fields. In this paper a computationally efficient and accurate frequency estimator is presented using the guaranteed stable Sliding DFT which gives stability as well as computational efficiency. The estimator approaches Jacobsen's estimator and Candan's estimator for large N with an extra correction term multiplied to it for the stabilization of the sliding DFT. Simulation results show that the performance of the proposed estimator were found to be better than Jacobsen's estimator and Candan's estimator.
△ Less
Submitted 20 February, 2012;
originally announced February 2012.