Skip to main content

Showing 1–16 of 16 results for author: Alku, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2410.17028  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Can a Machine Distinguish High and Low Amount of Social Creak in Speech?

    Authors: Anne-Maria Laukkanen, Sudarsana Reddy Kadiri, Shrikanth Narayanan, Paavo Alku

    Abstract: Objectives: ncreased prevalence of social creak particularly among female speakers has been reported in several studies. The study of social creak has been previously conducted by combining perceptual evaluation of speech with conventional acoustical parameters such as the harmonic-to-noise ratio and cepstral peak prominence. In the current study, machine learning (ML) was used to automatically di… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

    Comments: Accepted in Journal of Voice

  2. arXiv:2309.14107  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

    Authors: Farhad Javanmardi, Saska Tirronen, Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku

    Abstract: Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the… ▽ More

    Submitted 17 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: in Proc. ICASSP, Rhodes Island, Greece, June 4-10, 2023

  3. arXiv:2309.14080  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Analysis and Detection of Pathological Voice using Glottal Source Features

    Authors: Sudarsana Reddy Kadiri, Paavo Alku

    Abstract: Automatic detection of voice pathology enables objective assessment and earlier intervention for the diagnosis. This study provides a systematic analysis of glottal source features and investigates their effectiveness in voice pathology detection. Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method, using approximat… ▽ More

    Submitted 17 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Copyright 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: IEEE Journal of Selected Topics in Signal Processing, Vol. 14, No. 2, pp. 367-379, February 2020

  4. arXiv:2308.16540  [pdf, other

    eess.AS cs.CL cs.SD eess.SP

    Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

    Authors: Dhananjaya Gowda, Sudarsana Reddy Kadiri, Brad Story, Paavo Alku

    Abstract: In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based o… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28, pp. 1901-1914, 2020

  5. arXiv:2308.09051  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

    Authors: Paavo Alku, Sudarsana Reddy Kadiri, Dhananjaya Gowda

    Abstract: In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Computer Speech and Language, Vol. 81, Article 101515, June 2023

  6. arXiv:2308.09042  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features

    Authors: Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku

    Abstract: Developing objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification of PD. Prior studies have demonstrated that SFF o… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Accepted by INTERSPEECH 2023

  7. arXiv:2308.03226  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

    Authors: Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku

    Abstract: Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classificat… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: Accepted by Computer Speech & Language

  8. arXiv:2201.01525  [pdf, other

    eess.AS cs.LG cs.SD

    Formant Tracking Using Quasi-Closed Phase Forward-Backward Linear Prediction Analysis and Deep Neural Networks

    Authors: Dhananjaya Gowda, Bajibabu Bollepalli, Sudarsana Reddy Kadiri, Paavo Alku

    Abstract: Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (QCP-FB) method. QCP-FB gave the best performance in… ▽ More

    Submitted 5 January, 2022; originally announced January 2022.

    Journal ref: Published in IEEE ACCESS. Vol. 9, 2021, pp. 151631-151640

  9. arXiv:1912.12604  [pdf, other

    cs.SD cs.CL eess.AS

    Glottal Source Processing: from Analysis to Applications

    Authors: Thomas Drugman, Paavo Alku, Abeer Alwan, Bayya Yegnanarayana

    Abstract: The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters. Nonetheless, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant complementarity. Unfortunately, glottal analysis from speech recordings requires specific and more complex… ▽ More

    Submitted 29 December, 2019; originally announced December 2019.

  10. arXiv:1911.01601  [pdf, other

    eess.AS cs.CR cs.SD eess.SP

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

    Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More

    Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

  11. arXiv:1904.03976  [pdf, other

    eess.AS cs.LG cs.SD

    GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

    Authors: Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

    Abstract: Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neur… ▽ More

    Submitted 26 June, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019 accepted version

  12. arXiv:1903.05955  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

    Authors: Bajibabu Bollepalli, Lauri Juvela, Paavo Alku

    Abstract: Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders that parameterize speech into two parts, the glottal excitation and vocal tract, that occur in the human speech production apparatus. Current glottal vocoders generate the glottal excitation waveform by using deep neural networks (DNNs). However, the squared error-base… ▽ More

    Submitted 14 March, 2019; originally announced March 2019.

    Comments: Accepted in Interspeech

    Journal ref: Interspeech-2017

  13. arXiv:1810.12598  [pdf, other

    eess.AS cs.SD stat.ML

    Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

    Authors: Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

    Abstract: The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive resul… ▽ More

    Submitted 30 October, 2018; originally announced October 2018.

    Comments: Submitted to ICASSP 2019

  14. arXiv:1810.12051  [pdf, other

    cs.SD cs.CL eess.AS

    Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

    Authors: Bajibabu Bollepalli, Lauri Juvela, Paavo Alku

    Abstract: Currently, there are increasing interests in text-to-speech (TTS) synthesis to use sequence-to-sequence models with attention. These models are end-to-end meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech with good quality. However, in challeng… ▽ More

    Submitted 29 October, 2018; originally announced October 2018.

    Comments: 5 pages, 5 figures. Submitted to ICASSP 2019

  15. arXiv:1804.09593  [pdf, other

    eess.AS cs.SD stat.ML

    Speaker-independent raw waveform model for glottal excitation

    Authors: Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

    Abstract: Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing t… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

    Comments: Submitted to Interspeech 2018

  16. arXiv:1804.00920  [pdf, ps, other

    eess.AS cs.CL cs.SD stat.ML

    Speech waveform synthesis from MFCC sequences with generative adversarial networks

    Authors: Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

    Abstract: This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information containe… ▽ More

    Submitted 3 April, 2018; originally announced April 2018.