A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel, Jack; Pang, Bo; Zhu, Zhenhai; Soricut, Radu

Computer Science > Computation and Language

arXiv:1910.02930 (cs)

[Submitted on 7 Oct 2019]

Title:A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Authors:Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut

View PDF

Abstract:Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1910.02930 [cs.CL]
	(or arXiv:1910.02930v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1910.02930
Journal reference:	Published in The SIGNLL Conference on Computational Natural Language Learning (CoNLL) 2019

Submission history

From: Jack Hessel [view email]
[v1] Mon, 7 Oct 2019 17:39:39 UTC (1,607 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2019-10

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Jack Hessel
Bo Pang
Zhenhai Zhu
Radu Soricut

export BibTeX citation

Computer Science > Computation and Language

Title:A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators