-
Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering
Authors:
Imed Keraghel,
Mohamed Nadif
Abstract:
Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition…
▽ More
Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
More Discriminative Sentence Embeddings via Semantic Graph Smoothing
Authors:
Chakib Fettal,
Lazhar Labiod,
Mohamed Nadif
Abstract:
This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential…
▽ More
This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential of semantic graph smoothing in improving sentence embeddings for the supervised and unsupervised document categorization tasks.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Scalable Multi-view Clustering via Explicit Kernel Features Maps
Authors:
Chakib Fettal,
Lazhar Labiod,
Mohamed Nadif
Abstract:
A growing awareness of multi-view learning as an important component in data science and machine learning is a consequence of the increasing prevalence of multiple views in real-world applications, especially in the context of networks. In this paper we introduce a new scalability framework for multi-view subspace clustering. An efficient optimization strategy is proposed, leveraging kernel featur…
▽ More
A growing awareness of multi-view learning as an important component in data science and machine learning is a consequence of the increasing prevalence of multiple views in real-world applications, especially in the context of networks. In this paper we introduce a new scalability framework for multi-view subspace clustering. An efficient optimization strategy is proposed, leveraging kernel feature maps to reduce the computational burden while maintaining good clustering performance. The scalability of the algorithm means that it can be applied to large-scale datasets, including those with millions of data points, using a standard machine, in a few minutes. We conduct extensive experiments on real-world benchmark networks of various sizes in order to evaluate the performance of our algorithm against state-of-the-art multi-view subspace clustering methods and attributed-network multi-view approaches.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Graph Cuts with Arbitrary Size Constraints Through Optimal Transport
Authors:
Chakib Fettal,
Lazhar Labiod,
Mohamed Nadif
Abstract:
A common way of partitioning graphs is through minimum cuts. One drawback of classical minimum cut methods is that they tend to produce small groups, which is why more balanced variants such as normalized and ratio cuts have seen more success. However, we believe that with these variants, the balance constraints can be too restrictive for some applications like for clustering of imbalanced dataset…
▽ More
A common way of partitioning graphs is through minimum cuts. One drawback of classical minimum cut methods is that they tend to produce small groups, which is why more balanced variants such as normalized and ratio cuts have seen more success. However, we believe that with these variants, the balance constraints can be too restrictive for some applications like for clustering of imbalanced datasets, while not being restrictive enough for when searching for perfectly balanced partitions. Here, we propose a new graph cut algorithm for partitioning graphs under arbitrary size constraints. We formulate the graph cut problem as a Gromov-Wasserstein with a concave regularizer problem. We then propose to solve it using an accelerated proximal GD algorithm which guarantees global convergence to a critical point, results in sparse solutions and only incurs an additional ratio of $\mathcal{O}(\log(n))$ compared to the classical spectral clustering algorithm but was seen to be more efficient.
△ Less
Submitted 4 October, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study
Authors:
Imed Keraghel,
Stanislas Morbieu,
Mohamed Nadif
Abstract:
Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, including advancements in Transformer-based methods and Large Language Models (LLMs) that have not had much coverage in other surveys. In…
▽ More
Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, including advancements in Transformer-based methods and Large Language Models (LLMs) that have not had much coverage in other surveys. In addition, we discuss reinforcement learning and graph-based approaches, highlighting their role in enhancing NER performance. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that have never been considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods we compare.
△ Less
Submitted 20 December, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
Spectral Clustering via Ensemble Deep Autoencoder Learning (SC-EDAE)
Authors:
Severine Affeldt,
Lazhar Labiod,
Mohamed Nadif
Abstract:
Recently, a number of works have studied clustering strategies that combine classical clustering algorithms and deep learning methods. These approaches follow either a sequential way, where a deep representation is learned using a deep autoencoder before obtaining clusters with k-means, or a simultaneous way, where deep representation and clusters are learned jointly by optimizing a single objecti…
▽ More
Recently, a number of works have studied clustering strategies that combine classical clustering algorithms and deep learning methods. These approaches follow either a sequential way, where a deep representation is learned using a deep autoencoder before obtaining clusters with k-means, or a simultaneous way, where deep representation and clusters are learned jointly by optimizing a single objective function. Both strategies improve clustering performance, however the robustness of these approaches is impeded by several deep autoencoder setting issues, among which the weights initialization, the width and number of layers or the number of epochs. To alleviate the impact of such hyperparameters setting on the clustering performance, we propose a new model which combines the spectral clustering and deep autoencoder strengths in an ensemble learning framework. Extensive experiments on various benchmark datasets demonstrate the potential and robustness of our approach compared to state-of-the-art deep clustering methods.
△ Less
Submitted 12 June, 2019; v1 submitted 8 January, 2019;
originally announced January 2019.
-
Data Leak Aware Crowdsourcing in Social Network
Authors:
Iheb Ben Amor,
Athman Bougetteya,
Mourad Ouziri,
Salima Benbernou,
Mohamed Nadif
Abstract:
Harnessing human computation for solving complex problems call spawns the issue of finding the unknown competitive group of solvers. In this paper, we propose an approach called Friendlysourcing to build up teams from social network answering a business call, all the while avoiding partial solution disclosure to competitive groups. The contributions of this paper include (i) a clustering based app…
▽ More
Harnessing human computation for solving complex problems call spawns the issue of finding the unknown competitive group of solvers. In this paper, we propose an approach called Friendlysourcing to build up teams from social network answering a business call, all the while avoiding partial solution disclosure to competitive groups. The contributions of this paper include (i) a clustering based approach for discovering collaborative and competitive team in social network (ii) a Markov-chain based algorithm for discovering implicit interactions in the social network.
△ Less
Submitted 28 May, 2013;
originally announced May 2013.