Synchronising audio and ultrasound by learning cross-modal embeddings

Eshky, Aciel; Ribeiro, Manuel Sam; Richmond, Korin; Renals, Steve

Computer Science > Computation and Language

arXiv:1907.00758 (cs)

[Submitted on 1 Jul 2019 (v1), last revised 27 Nov 2019 (this version, v2)]

Title:Synchronising audio and ultrasound by learning cross-modal embeddings

Authors:Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

View PDF

Abstract:Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

Comments:	5 pages, 1 figure, 4 tables; Interspeech 2019 with the following edits: 1) Loss and accuracy upon convergence were accidentally reported from an older model. Now updated with model described throughout the paper. All other results remain unchanged. 2) Max true offset in the training data corrected from 179ms to 1789ms. 3) Detectability "boundary/range" renamed to detectability "thresholds"
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:1907.00758 [cs.CL]
	(or arXiv:1907.00758v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1907.00758

Submission history

From: Aciel Eshky [view email]
[v1] Mon, 1 Jul 2019 13:22:48 UTC (115 KB)
[v2] Wed, 27 Nov 2019 11:24:26 UTC (113 KB)

Computer Science > Computation and Language

Title:Synchronising audio and ultrasound by learning cross-modal embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Synchronising audio and ultrasound by learning cross-modal embeddings

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators