-
SoK: The Faults in our Graph Benchmarks
Authors:
Puneet Mehrotra,
Vaastav Anand,
Daniel Margo,
Milad Rezaei Hajidehi,
Margo Seltzer
Abstract:
Graph-structured data is prevalent in domains such as social networks, financial transactions, brain networks, and protein interactions. As a result, the research community has produced new databases and analytics engines to process such data. Unfortunately, there is not yet widespread benchmark standardization in graph processing, and the heterogeneity of evaluations found in the literature can l…
▽ More
Graph-structured data is prevalent in domains such as social networks, financial transactions, brain networks, and protein interactions. As a result, the research community has produced new databases and analytics engines to process such data. Unfortunately, there is not yet widespread benchmark standardization in graph processing, and the heterogeneity of evaluations found in the literature can lead researchers astray. Evaluations frequently ignore datasets' statistical idiosyncrasies, which significantly affect system performance. Scalability studies often use datasets that fit easily in memory on a modest desktop. Some studies rely on synthetic graph generators, but these generators produce graphs with unnatural characteristics that also affect performance, producing misleading results. Currently, the community has no consistent and principled manner with which to compare systems and provide guidance to developers who wish to select the system most suited to their application.
We provide three different systematizations of benchmarking practices. First, we present a 12-year literary review of graph processing benchmarking, including a summary of the prevalence of specific datasets and benchmarks used in these papers. Second, we demonstrate the impact of two statistical properties of datasets that drastically affect benchmark performance. We show how different assignments of IDs to vertices, called vertex orderings, dramatically alter benchmark performance due to the caching behavior they induce. We also show the impact of zero-degree vertices on the runtime of benchmarks such as breadth-first search and single-source shortest path. We show that these issues can cause performance to change by as much as 38% on several popular graph processing systems. Finally, we suggest best practices to account for these issues when evaluating graph systems.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
CUTTANA: Scalable Graph Partitioning for Faster Distributed Graph Databases and Analytics
Authors:
Milad Rezaei Hajidehi,
Sraavan Sridhar,
Margo Seltzer
Abstract:
Graph partitioning plays a pivotal role in various distributed graph processing applications, including graph analytics, graph neural network training, and distributed graph databases. Graphs that require distributed settings are often too large to fit in the main memory of a single machine. This challenge renders traditional in-memory graph partitioners infeasible, leading to the emergence of str…
▽ More
Graph partitioning plays a pivotal role in various distributed graph processing applications, including graph analytics, graph neural network training, and distributed graph databases. Graphs that require distributed settings are often too large to fit in the main memory of a single machine. This challenge renders traditional in-memory graph partitioners infeasible, leading to the emergence of streaming solutions. Streaming partitioners produce lower-quality partitions because they work from partial information and must make premature decisions before they have a complete view of a vertex's neighborhood. We introduce CUTTANA, a streaming graph partitioner that partitions massive graphs (Web/Twitter scale) with superior quality compared to existing streaming solutions. CUTTANA uses a novel buffering technique that prevents the premature assignment of vertices to partitions and a scalable coarsening and refinement technique that enables a complete graph view, improving the intermediate assignment made by a streaming partitioner. We implemented a parallel version for CUTTANA that offers nearly the same partitioning latency as existing streaming partitioners.
Our experimental analysis shows that CUTTANA consistently yields better partitioning quality than existing state-of-the-art streaming vertex partitioners in terms of both edge-cut and communication volume metrics. We also evaluate the workload latencies that result from using CUTTANA and other partitioners in distributed graph analytics and databases. CUTTANA outperforms the other methods in most scenarios (algorithms, datasets). In analytics applications, CUTTANA improves runtime performance by up to 59% compared to various streaming partitioners (HDRF, Fennel, Ginger, HeiStream). In graph database tasks, CUTTANA results in higher query throughput by up to 23%, without hurting tail latency.
△ Less
Submitted 9 December, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
CS-TGN: Community Search via Temporal Graph Neural Networks
Authors:
Farnoosh Hashemi,
Ali Behrouz,
Milad Rezaei Hajidehi
Abstract:
Searching for local communities is an important research challenge that allows for personalized community discovery and supports advanced data analysis in various complex networks, such as the World Wide Web, social networks, and brain networks. The evolution of these networks over time has motivated several recent studies to identify local communities in temporal networks. Given any query nodes,…
▽ More
Searching for local communities is an important research challenge that allows for personalized community discovery and supports advanced data analysis in various complex networks, such as the World Wide Web, social networks, and brain networks. The evolution of these networks over time has motivated several recent studies to identify local communities in temporal networks. Given any query nodes, Community Search aims to find a densely connected subgraph containing query nodes. However, existing community search approaches in temporal networks have two main limitations: (1) they adopt pre-defined subgraph patterns to model communities, which cannot find communities that do not conform to these patterns in real-world networks, and (2) they only use the aggregation of disjoint structural information to measure quality, missing the dynamic of connections and temporal properties. In this paper, we propose a query-driven Temporal Graph Convolutional Network (CS-TGN) that can capture flexible community structures by learning from the ground-truth communities in a data-driven manner. CS-TGN first combines the local query-dependent structure and the global graph embedding in each snapshot of the network and then uses a GRU cell with contextual attention to learn the dynamics of interactions and update node embeddings over time. We demonstrate how this model can be used for interactive community search in an online setting, allowing users to evaluate the found communities and provide feedback. Experiments on real-world temporal graphs with ground-truth communities validate the superior quality of the solutions obtained and the efficiency of our model in both temporal and interactive static settings.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.