HL Dataset: Grounding High-Level Linguistic Concepts in Vision

Cafagna, Michele; van Deemter, Kees; Gatt, Albert

Computer Science > Computation and Language

arXiv:2302.12189v1 (cs)

[Submitted on 23 Feb 2023 (this version), latest version 25 Sep 2023 (v3)]

Title:HL Dataset: Grounding High-Level Linguistic Concepts in Vision

Authors:Michele Cafagna, Kees van Deemter, Albert Gatt

View PDF

Abstract:Current captioning datasets, focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize the visual content, they lack in expressing trivial abstract concepts, e.g. "people having a picnic". Such concepts are licensed by human's personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset; a dataset extending 14997 images of the COCO dataset with 134973 human-annotated (high-level) abstract captions collected along three axes: scenes, actions and rationales. We describe and release such dataset and we show how it can be used to assess models' multimodal grounding of abstract concepts and enrich models' visio-lingusitic representations. Moreover, we describe potential tasks enabled by this dataset involving high- and low-level concepts interactions.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2302.12189 [cs.CL]
	(or arXiv:2302.12189v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.12189

Submission history

From: Michele Cafagna [view email]
[v1] Thu, 23 Feb 2023 17:30:18 UTC (5,348 KB)
[v2] Tue, 1 Aug 2023 09:53:21 UTC (8,863 KB)
[v3] Mon, 25 Sep 2023 07:37:20 UTC (8,864 KB)

Computer Science > Computation and Language

Title:HL Dataset: Grounding High-Level Linguistic Concepts in Vision

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:HL Dataset: Grounding High-Level Linguistic Concepts in Vision

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators