Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Fathallah, Nadeen; Bhole, Monika; Staab, Steffen

Computer Science > Artificial Intelligence

arXiv:2412.00342 (cs)

[Submitted on 30 Nov 2024 (v1), last revised 21 May 2025 (this version, v2)]

Title:Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Authors:Nadeen Fathallah, Monika Bhole, Steffen Staab

View PDF HTML (experimental)

Abstract:In today's digital age, video content is prevalent, serving as a primary source of information, education, and entertainment. However, the Deaf and Hard of Hearing (DHH) community often faces significant challenges in accessing video content due to the inadequacy of automatic speech recognition (ASR) systems in providing accurate and reliable captions. This paper addresses the urgent need to improve video caption quality by leveraging Large Language Models (LLMs). We present a comprehensive study that explores the integration of LLMs to enhance the accuracy and context-awareness of captions generated by ASR systems. Our methodology involves a novel pipeline that corrects ASR-generated captions using advanced LLMs. It explicitly focuses on models like GPT-3.5 and Llama2-13B due to their robust performance in language comprehension and generation tasks. We introduce a dataset representative of real-world challenges the DHH community faces to evaluate our proposed pipeline. Our results indicate that LLM-enhanced captions significantly improve accuracy, as evidenced by a notably lower Word Error Rate (WER) achieved by ChatGPT-3.5 (WER: 9.75%) compared to the original ASR captions (WER: 23.07%), ChatGPT-3.5 shows an approximate 57.72% improvement in WER compared to the original ASR captions.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.00342 [cs.AI]
	(or arXiv:2412.00342v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2412.00342

Submission history

From: Nadeen Fathallah [view email]
[v1] Sat, 30 Nov 2024 03:52:08 UTC (503 KB)
[v2] Wed, 21 May 2025 06:43:46 UTC (661 KB)

Computer Science > Artificial Intelligence

Title:Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators