Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Conwell, Colin; Hamblin, Christopher; Boccagno, Chelsea; Mayo, David; Cummings, Jesse; Isik, Leyla; Barbu, Andrei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.23603 (cs)

[Submitted on 31 Oct 2024]

Title:Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Authors:Colin Conwell, Christopher Hamblin, Chelsea Boccagno, David Mayo, Jesse Cummings, Leyla Isik, Andrei Barbu

View PDF HTML (experimental)

Abstract:When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.23603 [cs.CV]
	(or arXiv:2410.23603v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.23603

Submission history

From: Colin Conwell [view email]
[v1] Thu, 31 Oct 2024 03:37:21 UTC (7,419 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators