-
Exploring Italian sentence embeddings properties through multi-tasking
Authors:
Vivi Nastase,
Giuseppe Samo,
Chunyang Jiang,
Paola Merlo
Abstract:
We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale -- several Blackbird Language Matrices (BLMs) problems in Italian -- and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level…
▽ More
We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale -- several Blackbird Language Matrices (BLMs) problems in Italian -- and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure -- in terms of sequence of phrases/chunks -- and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.
△ Less
Submitted 29 November, 2024; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement
Authors:
Vivi Nastase,
Chunyang Jiang,
Giuseppe Samo,
Paola Merlo
Abstract:
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and da…
▽ More
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.
△ Less
Submitted 29 November, 2024; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification
Authors:
Vivi Nastase,
Paola Merlo
Abstract:
Analyses of transformer-based models have shown that they encode a variety of linguistic information from their textual input. While these analyses have shed a light on the relation between linguistic information on one side, and internal architecture and parameters on the other, a question remains unanswered: how is this linguistic information reflected in sentence embeddings? Using datasets cons…
▽ More
Analyses of transformer-based models have shown that they encode a variety of linguistic information from their textual input. While these analyses have shed a light on the relation between linguistic information on one side, and internal architecture and parameters on the other, a question remains unanswered: how is this linguistic information reflected in sentence embeddings? Using datasets consisting of sentences with known structure, we test to what degree information about chunks (in particular noun, verb or prepositional phrases), such as grammatical number, or semantic role, can be localized in sentence embeddings. Our results show that such information is not distributed over the entire sentence embedding, but rather it is encoded in specific regions. Understanding how the information from an input text is compressed into sentence embeddings helps understand current transformer models and help build future explainable neural models.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Are there identifiable structural parts in the sentence embedding whole?
Authors:
Vivi Nastase,
Paola Merlo
Abstract:
Sentence embeddings from transformer models encode in a fixed length vector much linguistic information. We explore the hypothesis that these embeddings consist of overlapping layers of information that can be separated, and on which specific types of information -- such as information about chunks and their structural and semantic properties -- can be detected. We show that this is the case using…
▽ More
Sentence embeddings from transformer models encode in a fixed length vector much linguistic information. We explore the hypothesis that these embeddings consist of overlapping layers of information that can be separated, and on which specific types of information -- such as information about chunks and their structural and semantic properties -- can be detected. We show that this is the case using a dataset consisting of sentences with known chunk structure, and two linguistic intelligence datasets, solving which relies on detecting chunks and their grammatical number, and respectively, their semantic roles, and through analyses of the performance on the tasks and of the internal representations built during learning.
△ Less
Submitted 2 July, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings
Authors:
Vivi Nastase,
Paola Merlo
Abstract:
Sentence and word embeddings encode structural and semantic information in a distributed manner. Part of the information encoded -- particularly lexical information -- can be seen as continuous, whereas other -- like structural information -- is most often discrete. We explore whether we can compress transformer-based sentence embeddings into a representation that separates different linguistic si…
▽ More
Sentence and word embeddings encode structural and semantic information in a distributed manner. Part of the information encoded -- particularly lexical information -- can be seen as continuous, whereas other -- like structural information -- is most often discrete. We explore whether we can compress transformer-based sentence embeddings into a representation that separates different linguistic signals -- in particular, information relevant to subject-verb agreement and verb alternations. We show that by compressing an input sequence that shares a targeted phenomenon into the latent layer of a variational autoencoder-like system, the targeted linguistic information becomes more explicit. A latent layer with both discrete and continuous components captures better the targeted phenomena than a latent layer with only discrete or only continuous components. These experiments are a step towards separating linguistic signals from distributed text embeddings and linking them to more symbolic representations.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Grammatical information in BERT sentence embeddings as two-dimensional arrays
Authors:
Vivi Nastase,
Paola Merlo
Abstract:
Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yiel…
▽ More
Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by language models and help move towards developing few-shot learning approaches.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Semantic Relations and Deep Learning
Authors:
Vivi Nastase,
Stan Szpakowicz
Abstract:
The second edition of "Semantic Relations Between Nominals" by Vivi Nastase, Stan Szpakowicz, Preslav Nakov and Diarmuid Ó Séaghdha has been published in April 2021 by Morgan & Claypool (www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=1627). A new Chapter 5 of the book, by Vivi Nastase and Stan Szpakowicz, discusses relation classification/extraction in the deep-learning…
▽ More
The second edition of "Semantic Relations Between Nominals" by Vivi Nastase, Stan Szpakowicz, Preslav Nakov and Diarmuid Ó Séaghdha has been published in April 2021 by Morgan & Claypool (www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=1627). A new Chapter 5 of the book, by Vivi Nastase and Stan Szpakowicz, discusses relation classification/extraction in the deep-learning paradigm which arose after the first edition appeared. This is Chapter 5, made public by the kind permission of Morgan & Claypool.
△ Less
Submitted 15 April, 2021; v1 submitted 11 September, 2020;
originally announced September 2020.
-
Assessing the Difficulty of Classifying ConceptNet Relations in a Multi-Label Classification Setting
Authors:
Maria Becker,
Michael Staniek,
Vivi Nastase,
Anette Frank
Abstract:
Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in CONCEPTNET, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open…
▽ More
Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in CONCEPTNET, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open world multi-label classification approach that focuses on the evaluation of classification accuracy for individual relations. Based on an in-depth study of the specific properties of the CONCEPTNET resource, we investigate the impact of different relation representations and model variations. Our analysis reveals that the complexity of argument types and relation ambiguity are the most important challenges to address. We design a customized evaluation method to address the incompleteness of the resource that can be expanded in future work.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
Analysis of the Impact of Negative Sampling on Link Prediction in Knowledge Graphs
Authors:
Bhushan Kotnis,
Vivi Nastase
Abstract:
Knowledge graphs are large, useful, but incomplete knowledge repositories. They encode knowledge through entities and relations which define each other through the connective structure of the graph. This has inspired methods for the joint embedding of entities and relations in continuous low-dimensional vector spaces, that can be used to induce new edges in the graph, i.e., link prediction in know…
▽ More
Knowledge graphs are large, useful, but incomplete knowledge repositories. They encode knowledge through entities and relations which define each other through the connective structure of the graph. This has inspired methods for the joint embedding of entities and relations in continuous low-dimensional vector spaces, that can be used to induce new edges in the graph, i.e., link prediction in knowledge graphs. Learning these representations relies on contrasting positive instances with negative ones. Knowledge graphs include only positive relation instances, leaving the door open for a variety of methods for selecting negative examples. In this paper we present an empirical study on the impact of negative sampling on the learned embeddings, assessed through the task of link prediction. We use state-of-the-art knowledge graph embeddings -- \rescal , TransE, DistMult and ComplEX -- and evaluate on benchmark datasets -- FB15k and WN18. We compare well known methods for negative sampling and additionally propose embedding based sampling methods. We note a marked difference in the impact of these sampling methods on the two datasets, with the "traditional" corrupting positives method leading to best results on WN18, while embedding based methods benefiting the task on FB15k.
△ Less
Submitted 2 March, 2018; v1 submitted 22 August, 2017;
originally announced August 2017.
-
Learning Knowledge Graph Embeddings with Type Regularizer
Authors:
Bhushan Kotnis,
Vivi Nastase
Abstract:
Learning relations based on evidence from knowledge bases relies on processing the available relation instances. Many relations, however, have clear domain and range, which we hypothesize could help learn a better, more generalizing, model. We include such information in the RESCAL model in the form of a regularization factor added to the loss function that takes into account the types (categories…
▽ More
Learning relations based on evidence from knowledge bases relies on processing the available relation instances. Many relations, however, have clear domain and range, which we hypothesize could help learn a better, more generalizing, model. We include such information in the RESCAL model in the form of a regularization factor added to the loss function that takes into account the types (categories) of the entities that appear as arguments to relations in the knowledge base. We note increased performance compared to the baseline model in terms of mean reciprocal rank and hits@N, N = 1, 3, 10. Furthermore, we discover scenarios that significantly impact the effectiveness of the type regularizer.
△ Less
Submitted 2 March, 2018; v1 submitted 28 June, 2017;
originally announced June 2017.
-
Coarse-grained Cross-lingual Alignment of Comparable Texts with Topic Models and Encyclopedic Knowledge
Authors:
Vivi Nastase,
Angela Fahrni
Abstract:
We present a method for coarse-grained cross-lingual alignment of comparable texts: segments consisting of contiguous paragraphs that discuss the same theme (e.g. history, economy) are aligned based on induced multilingual topics. The method combines three ideas: a two-level LDA model that filters out words that do not convey themes, an HMM that models the ordering of themes in the collection of d…
▽ More
We present a method for coarse-grained cross-lingual alignment of comparable texts: segments consisting of contiguous paragraphs that discuss the same theme (e.g. history, economy) are aligned based on induced multilingual topics. The method combines three ideas: a two-level LDA model that filters out words that do not convey themes, an HMM that models the ordering of themes in the collection of documents, and language-independent concept annotations to serve as a cross-language bridge and to strengthen the connection between paragraphs in the same segment through concept relations. The method is evaluated on English and French data previously used for monolingual alignment. The results show state-of-the-art performance in both monolingual and cross-lingual settings.
△ Less
Submitted 28 November, 2014;
originally announced November 2014.