Skip to main content

Showing 1–20 of 20 results for author: Serdyuk, D

Searching in archive cs. Search in all archives.
.
  1. USM RNN-T model weights binarization

    Authors: Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

    Abstract: Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neura… ▽ More

    Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2312.10088  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    On Robustness to Missing Video for Audiovisual Speech Recognition

    Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

    Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovi… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  3. arXiv:2312.09369  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-visual fine-tuning of audio-only ASR models

    Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  4. arXiv:2302.10915  [pdf, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    Conformers are All You Need for Visual Speech Recognition

    Authors: Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

    Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on impr… ▽ More

    Submitted 12 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

  5. arXiv:2201.10439  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image trans… ▽ More

    Submitted 31 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: 5 pages, 3 figures, published at Interspeech 2022

  6. arXiv:2109.09536  [pdf, other

    cs.CV cs.LG

    Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolut… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: 7 pages, 2 figures, 4 tables. A draft for a paper accepted to ASRU workshop

  7. arXiv:2103.03098  [pdf, other

    cs.LG stat.ML

    Accounting for Variance in Machine Learning Benchmarks

    Authors: Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent

    Abstract: Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, reve… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Submitted to MLSys2021

  8. arXiv:1808.05777  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised adversarial domain adaptation for acoustic scene classification

    Authors: Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitriy Serdyuk, Tuomas Virtanen

    Abstract: A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of… ▽ More

    Submitted 17 August, 2018; originally announced August 2018.

  9. arXiv:1804.05374  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.NE

    Twin Regularization for online speech recognition

    Authors: Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio

    Abstract: Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models… ▽ More

    Submitted 11 June, 2018; v1 submitted 15 April, 2018; originally announced April 2018.

    Comments: Accepted at INTESPEECH 2018

  10. arXiv:1804.02485  [pdf, other

    stat.ML cs.LG

    Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations

    Authors: Alex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk, Sandeep Subramanian, Ioannis Mitliagkas, Yoshua Bengio

    Abstract: Deep networks have achieved impressive results across a variety of important tasks. However a known weakness is a failure to perform well when evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose Fortified Networks, a simple transformation of existing networks, which fortifies the hidden layers… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.

    Comments: Under Review ICML 2018

  11. arXiv:1802.08395  [pdf, other

    cs.CL

    Towards end-to-end spoken language understanding

    Authors: Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, Yoshua Bengio

    Abstract: Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog sys… ▽ More

    Submitted 23 February, 2018; originally announced February 2018.

    Comments: submitted to ICASSP 2018

  12. arXiv:1802.00300  [pdf, other

    cs.SD eess.AS

    MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

    Authors: Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

    Abstract: Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel deep learning based method that learns long-term temporal patterns and structures of a musical piece. We build upo… ▽ More

    Submitted 1 February, 2018; originally announced February 2018.

  13. arXiv:1708.06742  [pdf, other

    cs.LG stat.ML

    Twin Networks: Matching the Future for Sequence Generation

    Authors: Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris Pal, Yoshua Bengio

    Abstract: We propose a simple technique for encouraging generative RNNs to plan ahead. We train a "backward" recurrent network to generate a given sequence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases m… ▽ More

    Submitted 23 February, 2018; v1 submitted 22 August, 2017; originally announced August 2017.

    Comments: 12 pages, 3 figures, published at ICLR 2018

  14. arXiv:1705.09792  [pdf, other

    cs.NE cs.LG

    Deep Complex Networks

    Authors: Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, Christopher J Pal

    Abstract: At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite… ▽ More

    Submitted 25 February, 2018; v1 submitted 27 May, 2017; originally announced May 2017.

  15. arXiv:1612.01928  [pdf, other

    cs.CL cs.CV cs.LG cs.SD stat.ML

    Invariant Representations for Noisy Speech Recognition

    Authors: Dmitriy Serdyuk, Kartik Audhkhasi, Philémon Brakel, Bhuvana Ramabhadran, Samuel Thomas, Yoshua Bengio

    Abstract: Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neura… ▽ More

    Submitted 27 November, 2016; originally announced December 2016.

    Comments: 5 pages, 1 figure, 1 table, NIPS workshop on end-to-end speech recognition

  16. arXiv:1605.02688  [pdf, other

    cs.SC cs.LG cs.MS

    Theano: A Python framework for fast computation of mathematical expressions

    Authors: The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano , et al. (88 additional authors not shown)

    Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu… ▽ More

    Submitted 9 May, 2016; originally announced May 2016.

    Comments: 19 pages, 5 figures

  17. arXiv:1511.06456  [pdf, other

    cs.LG

    Task Loss Estimation for Sequence Prediction

    Authors: Dzmitry Bahdanau, Dmitriy Serdyuk, Philémon Brakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, Yoshua Bengio

    Abstract: Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for… ▽ More

    Submitted 19 January, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

    Comments: Submitted to ICLR 2016

  18. arXiv:1508.04395  [pdf, other

    cs.CL cs.AI cs.LG cs.NE

    End-to-End Attention-based Large Vocabulary Speech Recognition

    Authors: Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, Yoshua Bengio

    Abstract: Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN)… ▽ More

    Submitted 14 March, 2016; v1 submitted 18 August, 2015; originally announced August 2015.

  19. arXiv:1506.07503  [pdf, other

    cs.CL cs.LG cs.NE stat.ML

    Attention-Based Models for Speech Recognition

    Authors: Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio

    Abstract: Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks in- cluding machine translation, handwriting synthesis and image caption gen- eration. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in reaches… ▽ More

    Submitted 24 June, 2015; originally announced June 2015.

  20. arXiv:1506.00619  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Blocks and Fuel: Frameworks for deep learning

    Authors: Bart van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, Yoshua Bengio

    Abstract: We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training th… ▽ More

    Submitted 1 June, 2015; originally announced June 2015.