MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Le-Duc, Khai; Phan, Phuc; Pham, Tan-Hanh; Tat, Bach Phan; Ngo, Minh-Huong; Ngo, Chris; Nguyen-Tang, Thanh; Hy, Truong-Son

Computer Science > Computation and Language

arXiv:2409.14074 (cs)

[Submitted on 21 Sep 2024 (v1), last revised 15 May 2025 (this version, v3)]

Title:MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Authors:Khai Le-Duc, Phuc Phan, Tan-Hanh Pham, Bach Phan Tat, Minh-Huong Ngo, Chris Ngo, Thanh Nguyen-Tang, Truong-Son Hy

View PDF HTML (experimental)

Abstract:Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online: this https URL.

Comments:	ACL 2025, 38 pages
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2409.14074 [cs.CL]
	(or arXiv:2409.14074v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.14074

Submission history

From: Khai Le-Duc [view email]
[v1] Sat, 21 Sep 2024 09:05:48 UTC (7,773 KB)
[v2] Thu, 9 Jan 2025 10:50:12 UTC (7,824 KB)
[v3] Thu, 15 May 2025 04:35:00 UTC (122 KB)

Computer Science > Computation and Language

Title:MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators