Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

Ayrapetyan, Alexan; Kostandian, Sofia; Yeroyan, Ara; Yerznkanyan, Mher; Karpov, Nikolay; Tadevosyan, Nune; Lavrukhin, Vitaly; Ginsburg, Boris

Computer Science > Sound

arXiv:2501.14788 (cs)

[Submitted on 8 Jan 2025 (v1), last revised 7 Feb 2025 (this version, v2)]

Title:Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

Authors:Alexan Ayrapetyan, Sofia Kostandian, Ara Yeroyan, Mher Yerznkanyan, Nikolay Karpov, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg

View PDF HTML (experimental)

Abstract:This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.

Comments:	The first four authors contributed equally
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2501.14788 [cs.SD]
	(or arXiv:2501.14788v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2501.14788

Submission history

From: Alexan Ayrapetyan [view email]
[v1] Wed, 8 Jan 2025 15:18:42 UTC (660 KB)
[v2] Fri, 7 Feb 2025 07:21:50 UTC (643 KB)

Computer Science > Sound

Title:Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators