Skip to main content

Showing 1–21 of 21 results for author: Csapó, T G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.16996  [pdf, other

    cs.HC cs.LG cs.SD eess.AS q-bio.NC

    Towards Decoding Brain Activity During Passive Listening of Speech

    Authors: Milán András Fodor, Tamás Gábor Csapó, Frigyes Viktor Arthur

    Abstract: The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech. We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods. The goal is to aid the advancement of brain-computer interface (BCI) technology for speech synthesis, and,… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: 27 pages, 7 figures

  2. arXiv:2306.05374  [pdf, other

    physics.med-ph cs.SD eess.AS eess.IV

    Towards Ultrasound Tongue Image prediction from EEG during speech production

    Authors: Tamás Gábor Csapó, Frigyes Viktor Arthur, Péter Nagy, Ádám Boncz

    Abstract: Previous initial research has already been carried out to propose speech-based BCI using brain signals (e.g. non-invasive EEG and invasive sEEG / ECoG), but there is a lack of combined methods that investigate non-invasive brain, articulation, and speech signals together and analyze the cognitive processes in the brain, the kinematics of the articulatory movement and the resulting speech signal. I… ▽ More

    Submitted 18 October, 2023; v1 submitted 22 May, 2023; originally announced June 2023.

    Comments: accepted at Interspeech 2023

    Journal ref: Proceedings of Interspeech 2023

  3. arXiv:2208.07122  [pdf

    cs.SD eess.AS

    Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0

    Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh

    Abstract: Neural network-based Text-to-Speech has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron2, FastSpeech, FastPitch) usually generate Mel-spectrogram from text and then synthesize speech using vocoder (e.g., WaveNet, WaveGlow, HiFiGAN). Compared with traditional parametric approaches (e.g., STRAIGHT and WORLD), neural vocoder based end-to-end models suffer f… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: accepted at EUSIPCO2022

  4. arXiv:2108.01154  [pdf, other

    cs.SD eess.AS

    Speaker Adaptation with Continuous Vocoder-based DNN-TTS

    Authors: Ali Raheem Mandeel, Mohammed Salah Al-Radhi, Tamás Gábor Csapó

    Abstract: Traditional vocoder-based statistical parametric speech synthesis can be advantageous in applications that require low computational complexity. Recent neural vocoders, which can produce high naturalness, still cannot fulfill the requirement of being real-time during synthesis. In this paper, we experiment with our earlier continuous vocoder, in which the excitation is modeled with two one-dimensi… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: 10 pages, 3 figures, 23RD INTERNATIONAL CONFERENCE ON SPEECH AND COMPUTER SPECOM 2021

  5. arXiv:2107.12051  [pdf, other

    eess.AS cs.AI cs.SD

    Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

    Authors: Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh, Tamás Gábor Csapó

    Abstract: For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use… ▽ More

    Submitted 26 July, 2021; originally announced July 2021.

    Comments: accepted at SSW11. arXiv admin note: text overlap with arXiv:2008.03152

  6. arXiv:2107.05550  [pdf, other

    eess.AS cs.SD

    Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

    Authors: Tamás Gábor Csapó

    Abstract: In this paper, we present our first experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent n… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: accepted at SSW11 (11th Speech Synthesis Workshop). arXiv admin note: text overlap with arXiv:2107.02003

  7. arXiv:2107.02003  [pdf, other

    eess.AS cs.SD

    Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

    Authors: Tamás Gábor Csapó, László Tóth, Gábor Gosztolya, Alexandra Markó

    Abstract: Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based text-to-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-s… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: accepted at SSW11 (11th Speech Synthesis Workshop)

  8. arXiv:2106.10481  [pdf

    cs.SD cs.AI eess.AS

    Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters

    Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh

    Abstract: Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, thi… ▽ More

    Submitted 19 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 figures, International Conference on Artificial Intelligence and Speech Technology (AIST2020)

  9. arXiv:2106.06863  [pdf

    cs.SD eess.AS

    Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis

    Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh

    Abstract: To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. WaveNet one of the best models that nearly resembles the human voice, has to generate a waveform in a time consuming sequential manner with an extremely complex structure of its neural networks… ▽ More

    Submitted 12 June, 2021; originally announced June 2021.

    Comments: 5 pages, 4 figures, accepted to the conference of Interspeech 2021

  10. arXiv:2106.04552  [pdf, ps, other

    cs.SD

    Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

    Authors: Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

    Abstract: Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling.… ▽ More

    Submitted 11 June, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: 5 pages, 3 figures, 3 tables

  11. arXiv:2104.14467  [pdf, other

    cs.CV

    Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend

    Authors: Frigyes Viktor Arthur, Tamás Gábor Csapó

    Abstract: Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques as input (e.g. ultrasound tongue imaging, MRI, lip video). The advantage of lip video is that it is easily available and affordable: most modern smartphones have a front camera. There are already a few solutions for lip-to-speech synthesis, but they mostly concentrate on of… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

    Comments: 10 pages, 6 figures

  12. Improving Neural Silent Speech Interface Models by Adversarial Training

    Authors: Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

    Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real sig… ▽ More

    Submitted 23 April, 2021; originally announced April 2021.

    Comments: 11 pages, 3 tables, 2 figures

  13. arXiv:2101.11245  [pdf, other

    cs.CV

    Convolutional Neural Network-Based Age Estimation Using B-Mode Ultrasound Tongue Image

    Authors: Kele Xu, Tamas Gábor Csapó, Ming Feng

    Abstract: Ultrasound tongue imaging is widely used for speech production research, and it has attracted increasing attention as its potential applications seem to be evident in many different fields, such as the visual biofeedback tool for second language acquisition and silent speech interface. Unlike previous studies, here we explore the feasibility of age estimation using the ultrasound tongue image of t… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

    Comments: 5 Figures

  14. arXiv:2008.03152  [pdf, other

    eess.AS cs.SD

    Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

    Authors: Tamás Gábor Csapó, Csaba Zainkó, László Tóth, Gábor Gosztolya, Alexandra Markó

    Abstract: For articulatory-to-acoustic mapping using deep neural networks, typically spectral and excitation parameters of vocoders have been used as the training targets. However, vocoding often results in buzzy and muffled final speech quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a large amount of En… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for publication at Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:1906.09885

  15. arXiv:2008.02470  [pdf, other

    eess.AS cs.SD

    Quantification of Transducer Misalignment in Ultrasound Tongue Imaging

    Authors: Tamás Gábor Csapó, Kele Xu

    Abstract: In speech production research, different imaging modalities have been employed to obtain accurate information about the movement and shaping of the vocal tract. Ultrasound is an affordable and non-invasive imaging modality with relatively high temporal and spatial resolution to study the dynamic behavior of tongue during speech production. However, a long-standing problem for ultrasound tongue ima… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for publication at Interspeech 2020

  16. arXiv:2008.02098  [pdf, other

    eess.AS cs.SD

    Speaker dependent acoustic-to-articulatory inversion using real-time MRI of the vocal tract

    Authors: Tamás Gábor Csapó

    Abstract: Acoustic-to-articulatory inversion (AAI) methods estimate articulatory movements from the acoustic speech signal, which can be useful in several tasks such as speech recognition, synthesis, talking heads and language tutoring. Most earlier inversion studies are based on point-tracking articulatory techniques (e.g. EMA or XRMB). The advantage of rtMRI is that it provides dynamic information about t… ▽ More

    Submitted 4 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for publication at Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:2008.00889

  17. arXiv:2008.00889  [pdf, other

    eess.AS cs.SD eess.IV

    Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

    Authors: Tamás Gábor Csapó

    Abstract: Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high `relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the ve… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for publication at Interspeech 2020

  18. arXiv:1906.09885  [pdf, other

    cs.SD eess.AS eess.IV

    Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

    Authors: Tamás Gábor Csapó, Mohammed Salah Al-Radhi, Géza Németh, Gábor Gosztolya, Tamás Grósz, László Tóth, Alexandra Markó

    Abstract: Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even whe… ▽ More

    Submitted 24 June, 2019; originally announced June 2019.

    Comments: 5 pages, 3 figures, accepted for publication at Interspeech 2019

  19. arXiv:1904.06083  [pdf, other

    cs.SD eess.AS q-bio.TO

    DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

    Authors: Dagoberto Porras, Alexander Sepúlveda-Sepúlveda, Tamás Gábor Csapó

    Abstract: Speech sounds are produced as the coordinated movement of the speaking organs. There are several available methods to model the relation of articulatory movements and the resulting speech signal. The reverse problem is often called as acoustic-to-articulatory inversion (AAI). In this paper we have implemented several different Deep Neural Networks (DNNs) to estimate the articulatory information fr… ▽ More

    Submitted 12 April, 2019; originally announced April 2019.

    Comments: 8 pages, 5 figures, Accepted to IJCNN 2019

  20. arXiv:1904.06075  [pdf

    cs.SD eess.AS

    RNN-based speech synthesis using a continuous sinusoidal model

    Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh

    Abstract: Recently in statistical parametric speech synthesis, we proposed a continuous sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders performance (e.g. similar to STRAIGHT) in synthesized speech. In this paper, we address the use of sequence-to-sequence modeling with recurrent neural networks (R… ▽ More

    Submitted 12 April, 2019; originally announced April 2019.

    Comments: 8 pages, 4 figures, Accepted to IJCNN 2019

  21. arXiv:1904.05259  [pdf, other

    cs.SD eess.AS

    Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

    Authors: Gábor Gosztolya, Ádám Pintér, László Tóth, Tamás Grósz, Alexandra Markó, Tamás Gábor Csapó

    Abstract: When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the inability to capture the relations between close… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: 8 pages, 6 figures, Accepted to IJCNN 2019