Phonology-Guided Speech-to-Speech Translation for African Languages

Ochieng, Peter; Kaburu, Dennis

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2410.23323 (eess)

[Submitted on 30 Oct 2024 (v1), last revised 11 Jun 2025 (this version, v3)]

Title:Phonology-Guided Speech-to-Speech Translation for African Languages

Authors:Peter Ochieng, Dennis Kaburu

View PDF HTML (experimental)

Abstract:We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3$\times$ higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment $F_1$ by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1--M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.

Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2410.23323 [eess.AS]
	(or arXiv:2410.23323v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2410.23323

Submission history

From: Peter Ochieng [view email]
[v1] Wed, 30 Oct 2024 09:44:52 UTC (2,463 KB)
[v2] Tue, 10 Jun 2025 08:24:10 UTC (1,001 KB)
[v3] Wed, 11 Jun 2025 10:02:24 UTC (1,001 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Phonology-Guided Speech-to-Speech Translation for African Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Phonology-Guided Speech-to-Speech Translation for African Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators