Skip to main content

Showing 1–11 of 11 results for author: Gosztolya, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03831  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Conformer-based Ultrasound-to-Speech Conversion

    Authors: Ibrahim Ibrahimov, Zainkó Csaba, Gábor Gosztolya

    Abstract: Deep neural networks have shown promising potential for ultrasound-to-speech conversion task towards Silent Speech Interfaces. In this work, we applied two Conformer-based DNN architectures (Base and one with bi-LSTM) for this task. Speaker-specific models were trained on the data of four speakers from the Ultrasuite-Tal80 dataset, while the generated mel spectrograms were synthesized to audio wav… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: accepted to Interspeech 2025

  2. Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks

    Authors: László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Csapó Tamás Gábor

    Abstract: Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting… ▽ More

    Submitted 17 October, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures, 3 tables

    Journal ref: the Proceedings of Interspeech 2023

  3. arXiv:2107.12051  [pdf, other

    eess.AS cs.AI cs.SD

    Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

    Authors: Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh, Tamás Gábor Csapó

    Abstract: For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use… ▽ More

    Submitted 26 July, 2021; originally announced July 2021.

    Comments: accepted at SSW11. arXiv admin note: text overlap with arXiv:2008.03152

  4. arXiv:2107.02003  [pdf, other

    eess.AS cs.SD

    Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

    Authors: Tamás Gábor Csapó, László Tóth, Gábor Gosztolya, Alexandra Markó

    Abstract: Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based text-to-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-s… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: accepted at SSW11 (11th Speech Synthesis Workshop)

  5. arXiv:2106.04552  [pdf, ps, other

    cs.SD

    Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

    Authors: Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

    Abstract: Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling.… ▽ More

    Submitted 11 June, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: 5 pages, 3 figures, 3 tables

  6. Improving Neural Silent Speech Interface Models by Adversarial Training

    Authors: Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

    Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real sig… ▽ More

    Submitted 23 April, 2021; originally announced April 2021.

    Comments: 11 pages, 3 tables, 2 figures

  7. arXiv:2008.03183  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks

    Authors: Gábor Gosztolya, László Tóth

    Abstract: The 2020 INTERSPEECH Computational Paralinguistics Challenge (ComParE) consists of three Sub-Challenges, where the tasks are to identify the level of arousal and valence of elderly speakers, determine whether the actual speaker wearing a surgical mask, and estimate the actual breathing of the speaker. In our contribution to the Challenge, we focus on the Elderly Emotion and the Mask sub-challenges… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: rejected from Interspeech, ComParE Challenge (Mask & Elderly Emotion Sub-Challenges)

  8. arXiv:2008.03152  [pdf, other

    eess.AS cs.SD

    Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

    Authors: Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, Alexandra Markó

    Abstract: For articulatory-to-acoustic mapping using deep neural networks, typically spectral and excitation parameters of vocoders have been used as the training targets. However, vocoding often results in buzzy and muffled final speech quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a large amount of En… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for publication at Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:1906.09885

  9. arXiv:1906.09885  [pdf, other

    cs.SD eess.AS eess.IV

    Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

    Authors: Tamás Gábor Csapó, Mohammed Salah Al-Radhi, Géza Németh, Gábor Gosztolya, Tamás Grósz, László Tóth, Alexandra Markó

    Abstract: Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even whe… ▽ More

    Submitted 24 June, 2019; originally announced June 2019.

    Comments: 5 pages, 3 figures, accepted for publication at Interspeech 2019

  10. arXiv:1904.05259  [pdf, other

    cs.SD eess.AS

    Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

    Authors: Gábor Gosztolya, Ádám Pintér, László Tóth, Tamás Grósz, Alexandra Markó, Tamás Gábor Csapó

    Abstract: When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the inability to capture the relations between close… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: 8 pages, 6 figures, Accepted to IJCNN 2019

  11. GMM-Free Flat Start Sequence-Discriminative DNN Training

    Authors: Gábor Gosztolya, Tamás Grósz, László Tóth

    Abstract: Recently, attempts have been made to remove Gaussian mixture models (GMM) from the training process of deep neural network-based hidden Markov models (HMM/DNN). For the GMM-free training of a HMM/DNN hybrid we have to solve two problems, namely the initial alignment of the frame-level state labels and the creation of context-dependent states. Although flat-start training via iteratively realigning… ▽ More

    Submitted 11 October, 2016; originally announced October 2016.