Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings

Nastase, Vivi; Merlo, Paola

Computer Science > Computation and Language

arXiv:2312.11272v1 (cs)

[Submitted on 18 Dec 2023]

Title:Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings

Authors:Vivi Nastase, Paola Merlo

View PDF HTML (experimental)

Abstract:Sentence and word embeddings encode structural and semantic information in a distributed manner. Part of the information encoded -- particularly lexical information -- can be seen as continuous, whereas other -- like structural information -- is most often discrete. We explore whether we can compress transformer-based sentence embeddings into a representation that separates different linguistic signals -- in particular, information relevant to subject-verb agreement and verb alternations. We show that by compressing an input sequence that shares a targeted phenomenon into the latent layer of a variational autoencoder-like system, the targeted linguistic information becomes more explicit. A latent layer with both discrete and continuous components captures better the targeted phenomena than a latent layer with only discrete or only continuous components. These experiments are a step towards separating linguistic signals from distributed text embeddings and linking them to more symbolic representations.

Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:2312.11272 [cs.CL]
	(or arXiv:2312.11272v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.11272

Submission history

From: Vivi Nastase [view email]
[v1] Mon, 18 Dec 2023 15:16:54 UTC (7,573 KB)

Computer Science > Computation and Language

Title:Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators