HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Cafagna, Michele; van Deemter, Kees; Gatt, Albert

Computer Science > Computation and Language

arXiv:2302.12189 (cs)

[Submitted on 23 Feb 2023 (v1), last revised 25 Sep 2023 (this version, v3)]

Title:HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Authors:Michele Cafagna, Kees van Deemter, Albert Gatt

View PDF

Abstract:Current captioning datasets focus on object-centric captions, describing the visible objects in the image, e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ('people at a holiday resort') and the actions they perform ('people having a picnic'). Such descriptions draw on personal experience and commonsense assumptions. We present the High-Level Dataset a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions, and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2302.12189 [cs.CL]
	(or arXiv:2302.12189v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.12189

Submission history

From: Michele Cafagna [view email]
[v1] Thu, 23 Feb 2023 17:30:18 UTC (5,348 KB)
[v2] Tue, 1 Aug 2023 09:53:21 UTC (8,863 KB)
[v3] Mon, 25 Sep 2023 07:37:20 UTC (8,864 KB)

Computer Science > Computation and Language

Title:HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators