Skip to main content

Showing 1–14 of 14 results for author: Hermansky, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2303.12908  [pdf, other

    eess.AS cs.SD

    Self-supervised Learning with Speech Modulation Dropout

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-atte… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

  2. arXiv:2303.04187  [pdf, other

    cs.LG

    Stabilized training of joint energy-based models and their practical applications

    Authors: Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distr… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  3. arXiv:2210.00117  [pdf, other

    eess.AS cs.CL cs.SD

    Blind Signal Dereverberation for Machine Speech Recognition

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We present a method to remove unknown convolutive noise introduced to speech by reverberations of recording environments, utilizing some amount of training speech data from the reverberant environment, and any available non-reverberant speech data. Using Fourier transform computed over long temporal windows, which ideally cover the entire room impulse response, we convert room induced convolution… ▽ More

    Submitted 30 September, 2022; originally announced October 2022.

  4. arXiv:2204.00065  [pdf, other

    eess.AS cs.SD

    Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic \textit{information} in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modu… ▽ More

    Submitted 22 March, 2023; v1 submitted 31 March, 2022; originally announced April 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2203.13216  [pdf, other

    cs.SD eess.AS eess.SP

    Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of Speech

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: Conventional Frequency Domain Linear Prediction (FDLP) technique models the squared Hilbert envelope of speech with varied degrees of approximation which can be sampled at the required frame rate and used as features for Automatic Speech Recognition (ASR). Although previously the complex cepstrum of the conventional FDLP model has been used as compact frame-wise speech features, it has lacked inte… ▽ More

    Submitted 31 March, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

  6. arXiv:2103.14129  [pdf, other

    eess.AS cs.SD

    Radically Old Way of Computing Spectra: Applications in End-to-End ASR

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the squared Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. A long context window of 1.5 seconds allows us to capture the low frequency temporal… ▽ More

    Submitted 2 April, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

    Comments: submitted to INTERSPEECH 2021

  7. arXiv:2102.03055  [pdf, other

    cs.SD cs.CL eess.AS

    Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

    Authors: Ruizhi Li, Gregory Sell, Hynek Hermansky

    Abstract: Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-strea… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE SLT 2021

  8. arXiv:1910.10671  [pdf, other

    cs.CL cs.LG eess.AS

    A practical two-stage training strategy for multi-stream end-to-end speech recognition

    Authors: Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, Hynek Hermansky

    Abstract: The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion. Our previous study offered a promising direction within end-to-end automatic speech recognition, where parallel encoders aim to capture diverse information followed by a stream-level fusion based on attention mechanisms to combine the diffe… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: submitted to ICASSP 2019

  9. arXiv:1906.08041  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-Stream End-to-End Speech Recognition

    Authors: Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky

    Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR… ▽ More

    Submitted 18 October, 2019; v1 submitted 17 June, 2019; originally announced June 2019.

    Comments: submitted to IEEE TASLP (In review). arXiv admin note: substantial text overlap with arXiv:1811.04897, arXiv:1811.04903

  10. arXiv:1904.04896  [pdf, other

    cs.CL

    Performance Monitoring for End-to-End Speech Recognition

    Authors: Ruizhi Li, Gregory Sell, Hynek Hermansky

    Abstract: Measuring performance of an automatic speech recognition (ASR) system without ground-truth could be beneficial in many scenarios, especially with data from unseen domains, where performance can be highly inconsistent. In conventional ASR systems, several performance monitoring (PM) techniques have been well-developed to monitor performance by looking at tri-phone posteriors or pre-softmax activati… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019

  11. arXiv:1904.04294  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Methods for the Automatic Detection of Errors in Manual Transcription

    Authors: Xiaofei Wang, Jinyi Yang, Ruizhi Li, Samik Sadhu, Hynek Hermansky

    Abstract: Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transcriptions not only saves time and labors but benefits the performance of tasks that need the training process. Inspired by the success of hybrid automa… ▽ More

    Submitted 21 July, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Submitted in Interspeech 2019

  12. arXiv:1811.04903  [pdf, other

    cs.CL cs.SD eess.AS

    Stream attention-based multi-array end-to-end speech recognition

    Authors: Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

    Abstract: Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framewo… ▽ More

    Submitted 18 February, 2019; v1 submitted 12 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019

  13. arXiv:1811.04897  [pdf, other

    cs.CL

    Multi-encoder multi-resolution framework for end-to-end speech recognition

    Authors: Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

    Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

  14. arXiv:1711.11141  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Stream Attention for far-field multi-microphone ASR

    Authors: Xiaofei Wang, Yonghong Yan, Hynek Hermansky

    Abstract: A stream attention framework has been applied to the posterior probabilities of the deep neural network (DNN) to improve the far-field automatic speech recognition (ASR) performance in the multi-microphone configuration. The stream attention scheme has been realized through an attention vector, which is derived by predicting the ASR performance from the phoneme posterior distribution of individual… ▽ More

    Submitted 29 November, 2017; originally announced November 2017.