Skip to main content

Showing 1–12 of 12 results for author: Hermansky, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.07536  [pdf, ps, other

    eess.AS

    Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing

    Authors: Jin Li, Man-Wai Mak, Johan Rohdin, Kong Aik Lee, Hynek Hermansky

    Abstract: The performance of automatic speaker verification (ASV) and anti-spoofing drops seriously under real-world domain mismatch conditions. The relaxed instance frequency-wise normalization (RFN), which normalizes the frequency components based on the feature statistics along the time and channel axes, is a promising approach to reducing the domain dependence in the feature maps of a speaker embedding… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech2025

  2. arXiv:2303.12908  [pdf, other

    eess.AS cs.SD

    Self-supervised Learning with Speech Modulation Dropout

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-atte… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

  3. arXiv:2210.00117  [pdf, other

    eess.AS cs.CL cs.SD

    Blind Signal Dereverberation for Machine Speech Recognition

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We present a method to remove unknown convolutive noise introduced to speech by reverberations of recording environments, utilizing some amount of training speech data from the reverberant environment, and any available non-reverberant speech data. Using Fourier transform computed over long temporal windows, which ideally cover the entire room impulse response, we convert room induced convolution… ▽ More

    Submitted 30 September, 2022; originally announced October 2022.

  4. arXiv:2204.00065  [pdf, other

    eess.AS cs.SD

    Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic \textit{information} in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modu… ▽ More

    Submitted 22 March, 2023; v1 submitted 31 March, 2022; originally announced April 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2203.13216  [pdf, other

    cs.SD eess.AS eess.SP

    Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of Speech

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: Conventional Frequency Domain Linear Prediction (FDLP) technique models the squared Hilbert envelope of speech with varied degrees of approximation which can be sampled at the required frame rate and used as features for Automatic Speech Recognition (ASR). Although previously the complex cepstrum of the conventional FDLP model has been used as compact frame-wise speech features, it has lacked inte… ▽ More

    Submitted 31 March, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

  6. arXiv:2103.14129  [pdf, other

    eess.AS cs.SD

    Radically Old Way of Computing Spectra: Applications in End-to-End ASR

    Authors: Samik Sadhu, Hynek Hermansky

    Abstract: We propose a technique to compute spectrograms using Frequency Domain Linear Prediction (FDLP) that uses all-pole models to fit the squared Hilbert envelope of speech in different frequency sub-bands. The spectrogram of a complete speech utterance is computed by overlap-add of contiguous all-pole model responses. A long context window of 1.5 seconds allows us to capture the low frequency temporal… ▽ More

    Submitted 2 April, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

    Comments: submitted to INTERSPEECH 2021

  7. arXiv:2102.03055  [pdf, other

    cs.SD cs.CL eess.AS

    Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

    Authors: Ruizhi Li, Gregory Sell, Hynek Hermansky

    Abstract: Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-strea… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE SLT 2021

  8. arXiv:1910.10671  [pdf, other

    cs.CL cs.LG eess.AS

    A practical two-stage training strategy for multi-stream end-to-end speech recognition

    Authors: Ruizhi Li, Gregory Sell, Xiaofei Wang, Shinji Watanabe, Hynek Hermansky

    Abstract: The multi-stream paradigm of audio processing, in which several sources are simultaneously considered, has been an active research area for information fusion. Our previous study offered a promising direction within end-to-end automatic speech recognition, where parallel encoders aim to capture diverse information followed by a stream-level fusion based on attention mechanisms to combine the diffe… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: submitted to ICASSP 2019

  9. arXiv:1906.08041  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-Stream End-to-End Speech Recognition

    Authors: Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky

    Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR… ▽ More

    Submitted 18 October, 2019; v1 submitted 17 June, 2019; originally announced June 2019.

    Comments: submitted to IEEE TASLP (In review). arXiv admin note: substantial text overlap with arXiv:1811.04897, arXiv:1811.04903

  10. arXiv:1904.04294  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Methods for the Automatic Detection of Errors in Manual Transcription

    Authors: Xiaofei Wang, Jinyi Yang, Ruizhi Li, Samik Sadhu, Hynek Hermansky

    Abstract: Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transcriptions not only saves time and labors but benefits the performance of tasks that need the training process. Inspired by the success of hybrid automa… ▽ More

    Submitted 21 July, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Submitted in Interspeech 2019

  11. arXiv:1811.04903  [pdf, other

    cs.CL cs.SD eess.AS

    Stream attention-based multi-array end-to-end speech recognition

    Authors: Xiaofei Wang, Ruizhi Li, Sri Harish Mallid, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

    Abstract: Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framewo… ▽ More

    Submitted 18 February, 2019; v1 submitted 12 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019

  12. arXiv:1711.11141  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Stream Attention for far-field multi-microphone ASR

    Authors: Xiaofei Wang, Yonghong Yan, Hynek Hermansky

    Abstract: A stream attention framework has been applied to the posterior probabilities of the deep neural network (DNN) to improve the far-field automatic speech recognition (ASR) performance in the multi-microphone configuration. The stream attention scheme has been realized through an attention vector, which is derived by predicting the ASR performance from the phoneme posterior distribution of individual… ▽ More

    Submitted 29 November, 2017; originally announced November 2017.