A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation
Authors:
Rodrigo Lima,
Sidney Evaldo Leal,
Arnaldo Candido Junior,
Sandra Maria Aluísio
Abstract:
We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 hours of transcribed audio recordings…
▽ More
We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 hours of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper's. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese
Authors:
Sidney Evaldo Leal,
Magali Sanches Duran,
Carolina Evaristo Scarton,
Nathan Siegle Hartmann,
Sandra Maria Aluísio
Abstract:
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information fro…
▽ More
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.
△ Less
Submitted 17 December, 2021;
originally announced January 2022.