Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Zhang, Yucong; Zou, Xin; Yang, Jinshan; Chen, Wenjun; Liu, Juan; Liang, Faya; Li, Ming

Computer Science > Sound

arXiv:2409.03597 (cs)

[Submitted on 5 Sep 2024 (v1), last revised 22 Apr 2025 (this version, v3)]

Title:Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Authors:Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Juan Liu, Faya Liang, Ming Li

View PDF HTML (experimental)

Abstract:This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

Comments:	Submitted to CSL
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2409.03597 [cs.SD]
	(or arXiv:2409.03597v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2409.03597

Submission history

From: Yucong Zhang [view email]
[v1] Thu, 5 Sep 2024 14:56:38 UTC (11,620 KB)
[v2] Wed, 27 Nov 2024 03:19:11 UTC (14,779 KB)
[v3] Tue, 22 Apr 2025 15:32:41 UTC (15,890 KB)

Computer Science > Sound

Title:Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators