-
Graph Linearization Methods for Reasoning on Graphs with Large Language Models
Authors:
Christos Xypolopoulos,
Guokan Shang,
Xiao Fei,
Giannis Nikolentzos,
Hadi Abdine,
Iakovos Evdaimon,
Michail Chatzianastasis,
Giorgos Stamou,
Michalis Vazirgiannis
Abstract:
Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph reasoning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term "graph linearization", so that LLMs can handle graphs naturally. We consider that graphs should be…
▽ More
Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph reasoning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term "graph linearization", so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality and degeneracy. These methods are further enhanced using node relabeling techniques. The experimental results demonstrate the effectiveness of our methods compared to the random linearization baseline. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multimodal processing using a unified transformer model.
△ Less
Submitted 25 June, 2025; v1 submitted 25 October, 2024;
originally announced October 2024.
-
Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?
Authors:
Virgile Rennard,
Christos Xypolopoulos,
Michalis Vazirgiannis
Abstract:
Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral vers…
▽ More
Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.
△ Less
Submitted 5 November, 2024; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Neural Graph Generator: Feature-Conditioned Graph Generation using Latent Diffusion Models
Authors:
Iakovos Evdaimon,
Giannis Nikolentzos,
Christos Xypolopoulos,
Ahmed Kammoun,
Michail Chatzianastasis,
Hadi Abdine,
Michalis Vazirgiannis
Abstract:
Graph generation has emerged as a crucial task in machine learning, with significant challenges in generating graphs that accurately reflect specific properties. Existing methods often fall short in efficiently addressing this need as they struggle with the high-dimensional complexity and varied nature of graph properties. In this paper, we introduce the Neural Graph Generator (NGG), a novel appro…
▽ More
Graph generation has emerged as a crucial task in machine learning, with significant challenges in generating graphs that accurately reflect specific properties. Existing methods often fall short in efficiently addressing this need as they struggle with the high-dimensional complexity and varied nature of graph properties. In this paper, we introduce the Neural Graph Generator (NGG), a novel approach which utilizes conditioned latent diffusion models for graph generation. NGG demonstrates a remarkable capacity to model complex graph patterns, offering control over the graph generation process. NGG employs a variational graph autoencoder for graph compression and a diffusion process in the latent vector space, guided by vectors summarizing graph statistics. We demonstrate NGG's versatility across various graph generation tasks, showing its capability to capture desired graph properties and generalize to unseen graphs. We also compare our generator to the graph generation capabilities of different LLMs. This work signifies a shift in graph generation methodologies, offering a more practical and efficient solution for generating diverse graphs with specific characteristics.
△ Less
Submitted 18 September, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Authors:
Iakovos Evdaimon,
Hadi Abdine,
Christos Xypolopoulos,
Stamatis Outsios,
Michalis Vazirgiannis,
Giorgos Stamou
Abstract:
The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its…
▽ More
The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
NLP Research and Resources at DaSciM, Ecole Polytechnique
Authors:
Hadi Abdine,
Yanzhu Guo,
Moussa Kamal Eddine,
Giannis Nikolentzos,
Stamatis Outsios,
Guokan Shang,
Christos Xypolopoulos,
Michalis Vazirgiannis
Abstract:
DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, established in 2013 and since then producing research results in the area of large scale data analysis via methods of machine and deep learning. The group has been specifically active in the area of NLP and text mining with interesting results at methodological and resources level. Here follow our different contributions of inter…
▽ More
DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, established in 2013 and since then producing research results in the area of large scale data analysis via methods of machine and deep learning. The group has been specifically active in the area of NLP and text mining with interesting results at methodological and resources level. Here follow our different contributions of interest to the AFIA community.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets
Authors:
Yanzhu Guo,
Virgile Rennard,
Christos Xypolopoulos,
Michalis Vazirgiannis
Abstract:
We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialized using the general-domain French language model CamemBERT which follows the base architecture of RoBERTa. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named ent…
▽ More
We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialized using the general-domain French language model CamemBERT which follows the base architecture of RoBERTa. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
Evaluation Of Word Embeddings From Large-Scale French Web Content
Authors:
Hadi Abdine,
Christos Xypolopoulos,
Moussa Kamal Eddine,
Michalis Vazirgiannis
Abstract:
Distributed word representations are popularly used in many tasks in natural language processing. Adding that pretrained word vectors on huge text corpus achieved high performance in many different NLP tasks. This paper introduces multiple high-quality word vectors for the French language where two of them are trained on massive crawled French data during this study and the others are trained on a…
▽ More
Distributed word representations are popularly used in many tasks in natural language processing. Adding that pretrained word vectors on huge text corpus achieved high performance in many different NLP tasks. This paper introduces multiple high-quality word vectors for the French language where two of them are trained on massive crawled French data during this study and the others are trained on an already existing French corpus. We also evaluate the quality of our proposed word vectors and the existing French word vectors on the French word analogy task. In addition, we do the evaluation on multiple real NLP tasks that shows the important performance enhancement of the pre-trained word vectors compared to the existing and random ones. Finally, we created a demo web application to test and visualize the obtained word embeddings. The produced French word embeddings are available to the public, along with the finetuning code on the NLU tasks and the demo code.
△ Less
Submitted 10 March, 2022; v1 submitted 5 May, 2021;
originally announced May 2021.
-
How COVID-19 Is Changing Our Language : Detecting Semantic Shift in Twitter Word Embeddings
Authors:
Yanzhu Guo,
Christos Xypolopoulos,
Michalis Vazirgiannis
Abstract:
Words are malleable objects, influenced by events that are reflected in written texts. Situated in the global outbreak of COVID-19, our research aims at detecting semantic shifts in social media language triggered by the health crisis. With COVID-19 related big data extracted from Twitter, we train separate word embedding models for different time periods after the outbreak. We employ an alignment…
▽ More
Words are malleable objects, influenced by events that are reflected in written texts. Situated in the global outbreak of COVID-19, our research aims at detecting semantic shifts in social media language triggered by the health crisis. With COVID-19 related big data extracted from Twitter, we train separate word embedding models for different time periods after the outbreak. We employ an alignment-based approach to compare these embeddings with a general-purpose Twitter embedding unrelated to COVID-19. We also compare our trained embeddings among them to observe diachronic evolution. Carrying out case studies on a set of words chosen by topic detection, we verify that our alignment approach is valid. Finally, we quantify the size of global semantic shift by a stability measure based on back-and-forth rotational alignment.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Performance in the Courtroom: Automated Processing and Visualization of Appeal Court Decisions in France
Authors:
Paul Boniol,
George Panagopoulos,
Christos Xypolopoulos,
Rajaa El Hamdani,
David Restrepo Amariles,
Michalis Vazirgiannis
Abstract:
Artificial Intelligence techniques are already popular and important in the legal domain. We extract legal indicators from judicial judgment to decrease the asymmetry of information of the legal system and the access-to-justice gap. We use NLP methods to extract interesting entities/data from judgments to construct networks of lawyers and judgments. We propose metrics to rank lawyers based on thei…
▽ More
Artificial Intelligence techniques are already popular and important in the legal domain. We extract legal indicators from judicial judgment to decrease the asymmetry of information of the legal system and the access-to-justice gap. We use NLP methods to extract interesting entities/data from judgments to construct networks of lawyers and judgments. We propose metrics to rank lawyers based on their experience, wins/loss ratio and their importance in the network of lawyers. We also perform community detection in the network of judgments and propose metrics to represent the difficulty of cases capitalising on communities features.
△ Less
Submitted 9 July, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
Authors:
Christos Xypolopoulos,
Antoine J. -P. Tixier,
Michalis Vazirgiannis
Abstract:
The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical…
▽ More
The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language.
Code and data are publicly available at https://github.com/ksipos/polysemy-assessment .
The paper was accepted as a long paper at EACL 2021.
△ Less
Submitted 12 February, 2021; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Word Embeddings from Large-Scale Greek Web Content
Authors:
Stamatis Outsios,
Konstantinos Skianis,
Polykarpos Meladianos,
Christos Xypolopoulos,
Michalis Vazirgiannis
Abstract:
Word embeddings are undoubtedly very useful components in many NLP tasks. In this paper, we present word embeddings and other linguistic resources trained on the largest to date digital Greek language corpus. We also present a live web tool for testing the Greek word embeddings, by offering "analogy", "similarity score" and "most similar words" functions. Through our explorer, one could interact w…
▽ More
Word embeddings are undoubtedly very useful components in many NLP tasks. In this paper, we present word embeddings and other linguistic resources trained on the largest to date digital Greek language corpus. We also present a live web tool for testing the Greek word embeddings, by offering "analogy", "similarity score" and "most similar words" functions. Through our explorer, one could interact with the Greek word vectors.
△ Less
Submitted 26 October, 2018; v1 submitted 8 October, 2018;
originally announced October 2018.