-
Leiden-Fusion Partitioning Method for Effective Distributed Training of Graph Embeddings
Authors:
Yuhe Bai,
Camelia Constantin,
Hubert Naacke
Abstract:
In the area of large-scale training of graph embeddings, effective training frameworks and partitioning methods are critical for handling large networks. However, they face two major challenges: 1) existing synchronized distributed frameworks require continuous communication to access information from other machines, and 2) the inability of current partitioning methods to ensure that subgraphs rem…
▽ More
In the area of large-scale training of graph embeddings, effective training frameworks and partitioning methods are critical for handling large networks. However, they face two major challenges: 1) existing synchronized distributed frameworks require continuous communication to access information from other machines, and 2) the inability of current partitioning methods to ensure that subgraphs remain connected components without isolated nodes, which is essential for effective training of GNNs since training relies on information aggregation from neighboring nodes. To address these issues, we introduce a novel partitioning method, named Leiden-Fusion, designed for large-scale training of graphs with minimal communication. Our method extends the Leiden community detection algorithm with a greedy algorithm that merges the smallest communities with highly connected neighboring communities. Our method guarantees that, for an initially connected graph, each partition is a densely connected subgraph with no isolated nodes. After obtaining the partitions, we train a GNN for each partition independently, and finally integrate all embeddings for node classification tasks, which significantly reduces the need for network communication and enhances the efficiency of distributed graph training. We demonstrate the effectiveness of our method through extensive evaluations on several benchmark datasets, achieving high efficiency while preserving the quality of the graph embeddings for node classification tasks.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
ATEM: A Topic Evolution Model for the Detection of Emerging Topics in Scientific Archives
Authors:
Hamed Rahimi,
Hubert Naacke,
Camelia Constantin,
Bernd Amann
Abstract:
This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM is based on dynamic topic modeling and dynamic graph embedding techniques that explore the dynamics of content and citations of documents within a scientific corpus. ATEM explores a new notion of contextual emergence for the discovery of emerging interdisciplinary research topics based on the dyna…
▽ More
This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM is based on dynamic topic modeling and dynamic graph embedding techniques that explore the dynamics of content and citations of documents within a scientific corpus. ATEM explores a new notion of contextual emergence for the discovery of emerging interdisciplinary research topics based on the dynamics of citation links in topic clusters. Our experiments show that ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP archive of over five million computer science articles.
△ Less
Submitted 3 June, 2023;
originally announced June 2023.
-
Contextualized Topic Coherence Metrics
Authors:
Hamed Rahimi,
Jacob Louis Hoover,
David Mimno,
Hubert Naacke,
Camelia Constantin,
Bernd Amann
Abstract:
The recent explosion in work on neural topic modeling has been criticized for optimizing automated topic evaluation metrics at the expense of actual meaningful topic identification. But human annotation remains expensive and time-consuming. We propose LLM-based methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence (CTC). We evaluate both…
▽ More
The recent explosion in work on neural topic modeling has been criticized for optimizing automated topic evaluation metrics at the expense of actual meaningful topic identification. But human annotation remains expensive and time-consuming. We propose LLM-based methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence (CTC). We evaluate both a fully automated version as well as a semi-automated CTC that allows human-centered evaluation of coherence while maintaining the efficiency of automated methods. We evaluate CTC relative to five other metrics on six topic models and find that it outperforms automated topic coherence methods, works well on short documents, and is not susceptible to meaningless but high-scoring topics.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics
Authors:
Hamed Rahimi,
Hubert Naacke,
Camelia Constantin,
Bernd Amann
Abstract:
This paper presents an algorithmic family of dynamic topic models called Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms to provide a modular framework for discovering evolving topics. ANTM maintains the temporal continuity of evolving topics by extracting time-aware features from documents using advanced pre-trained Large Language Models (LLMs) and employing an over…
▽ More
This paper presents an algorithmic family of dynamic topic models called Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms to provide a modular framework for discovering evolving topics. ANTM maintains the temporal continuity of evolving topics by extracting time-aware features from documents using advanced pre-trained Large Language Models (LLMs) and employing an overlapping sliding window algorithm for sequential document clustering. This overlapping sliding window algorithm identifies a different number of topics within each time frame and aligns semantically similar document clusters across time periods. This process captures emerging and fading trends across different periods and allows for a more interpretable representation of evolving topics. Experiments on four distinct datasets show that ANTM outperforms probabilistic dynamic topic models in terms of topic coherence and diversity metrics. Moreover, it improves the scalability and flexibility of dynamic topic models by being accessible and adaptable to different types of algorithms. Additionally, a Python package is developed for researchers and scientists who wish to study the trends and evolving patterns of topics in large-scale textual data.
△ Less
Submitted 4 June, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Localisable Monads
Authors:
Carmen Constantin,
Nuiok Dicaire,
Chris Heunen
Abstract:
Monads govern computational side-effects in programming semantics. They can be combined in a ''bottom-up'' way to handle several instances of such effects. Indexed monads and graded monads do this in a modular way. Here, instead, we equip monads with fine-grained structure in a ''top-down'' way, using techniques from tensor topology. This provides an intrinsic theory of local computational effects…
▽ More
Monads govern computational side-effects in programming semantics. They can be combined in a ''bottom-up'' way to handle several instances of such effects. Indexed monads and graded monads do this in a modular way. Here, instead, we equip monads with fine-grained structure in a ''top-down'' way, using techniques from tensor topology. This provides an intrinsic theory of local computational effects without needing to know how constituent effects interact beforehand.
Specifically, any monoidal category decomposes as a sheaf of local categories over a base space. We identify a notion of localisable monads which characterises when a monad decomposes as a sheaf of monads. Equivalently, localisable monads are formal monads in an appropriate presheaf 2-category, whose algebras we characterise. Three extended examples demonstrate how localisable monads can interpret the base space as locations in a computer memory, as sites in a network of interacting agents acting concurrently, and as time in stochastic processes.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.