Skip to main content

Showing 1–8 of 8 results for author: Serdyuk, D

Searching in archive eess. Search in all archives.
.
  1. USM RNN-T model weights binarization

    Authors: Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

    Abstract: Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neura… ▽ More

    Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2312.10088  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    On Robustness to Missing Video for Audiovisual Speech Recognition

    Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

    Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovi… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  3. arXiv:2312.09369  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-visual fine-tuning of audio-only ASR models

    Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  4. arXiv:2302.10915  [pdf, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    Conformers are All You Need for Visual Speech Recognition

    Authors: Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

    Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on impr… ▽ More

    Submitted 12 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

  5. arXiv:2201.10439  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image trans… ▽ More

    Submitted 31 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: 5 pages, 3 figures, published at Interspeech 2022

  6. arXiv:1808.05777  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised adversarial domain adaptation for acoustic scene classification

    Authors: Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitriy Serdyuk, Tuomas Virtanen

    Abstract: A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of… ▽ More

    Submitted 17 August, 2018; originally announced August 2018.

  7. arXiv:1804.05374  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.NE

    Twin Regularization for online speech recognition

    Authors: Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio

    Abstract: Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models… ▽ More

    Submitted 11 June, 2018; v1 submitted 15 April, 2018; originally announced April 2018.

    Comments: Accepted at INTESPEECH 2018

  8. arXiv:1802.00300  [pdf, other

    cs.SD eess.AS

    MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upo… ▽ More

    Submitted 1 February, 2018; originally announced February 2018.