-
Learning Graph Node Embeddings by Smooth Pair Sampling
Authors:
Konstantin Kutzkov
Abstract:
Random walk-based node embedding algorithms have attracted a lot of attention due to their scalability and ease of implementation. Previous research has focused on different walk strategies, optimization objectives, and embedding learning models. Inspired by observations on real data, we take a different approach and propose a new regularization technique. More precisely, the frequencies of node p…
▽ More
Random walk-based node embedding algorithms have attracted a lot of attention due to their scalability and ease of implementation. Previous research has focused on different walk strategies, optimization objectives, and embedding learning models. Inspired by observations on real data, we take a different approach and propose a new regularization technique. More precisely, the frequencies of node pairs generated by the skip-gram model on random walk node sequences follow a highly skewed distribution which causes learning to be dominated by a fraction of the pairs. We address the issue by designing an efficient sampling procedure that generates node pairs according to their {\em smoothed frequency}. Theoretical and experimental results demonstrate the advantages of our approach.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
LoNe Sampler: Graph node embeddings by coordinated local neighborhood sampling
Authors:
Konstantin Kutzkov
Abstract:
Local graph neighborhood sampling is a fundamental computational problem that is at the heart of algorithms for node representation learning. Several works have presented algorithms for learning discrete node embeddings where graph nodes are represented by discrete features such as attributes of neighborhood nodes. Discrete embeddings offer several advantages compared to continuous word2vec-like n…
▽ More
Local graph neighborhood sampling is a fundamental computational problem that is at the heart of algorithms for node representation learning. Several works have presented algorithms for learning discrete node embeddings where graph nodes are represented by discrete features such as attributes of neighborhood nodes. Discrete embeddings offer several advantages compared to continuous word2vec-like node embeddings: ease of computation, scalability, and interpretability. We present LoNe Sampler, a suite of algorithms for generating discrete node embeddings by Local Neighborhood Sampling, and address two shortcomings of previous work. First, our algorithms have rigorously understood theoretical properties. Second, we show how to generate approximate explicit vector maps that avoid the expensive computation of a Gram matrix for the training of a kernel model. Experiments on benchmark datasets confirm the theoretical findings and demonstrate the advantages of the proposed methods.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
COLOGNE: Coordinated Local Graph Neighborhood Sampling
Authors:
Konstantin Kutzkov
Abstract:
Representation learning for graphs enables the application of standard machine learning algorithms and data analysis tools to graph data. Replacing discrete unordered objects such as graph nodes by real-valued vectors is at the heart of many approaches to learning from graph data. Such vector representations, or embeddings, capture the discrete relationships in the original data by representing no…
▽ More
Representation learning for graphs enables the application of standard machine learning algorithms and data analysis tools to graph data. Replacing discrete unordered objects such as graph nodes by real-valued vectors is at the heart of many approaches to learning from graph data. Such vector representations, or embeddings, capture the discrete relationships in the original data by representing nodes as vectors in a high-dimensional space.
In most applications graphs model the relationship between real-life objects and often nodes contain valuable meta-information about the original objects. While being a powerful machine learning tool, embeddings are not able to preserve such node attributes. We address this shortcoming and consider the problem of learning discrete node embeddings such that the coordinates of the node vector representations are graph nodes. This opens the door to designing interpretable machine learning algorithms for graphs as all attributes originally present in the nodes are preserved.
We present a framework for coordinated local graph neighborhood sampling (COLOGNE) such that each node is represented by a fixed number of graph nodes, together with their attributes. Individual samples are coordinated and they preserve the similarity between node neighborhoods. We consider different notions of similarity for which we design scalable algorithms. We show theoretical results for all proposed algorithms. Experiments on benchmark graphs evaluate the quality of the designed embeddings and demonstrate how the proposed embeddings can be used in training interpretable machine learning algorithms for graph data.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
Query-Efficient Correlation Clustering
Authors:
David García-Soriano,
Konstantin Kutzkov,
Francesco Bonchi,
Charalampos Tsourakakis
Abstract:
Correlation clustering is arguably the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters.
A main drawback of correlation clustering is that it requires as input the $Θ(n^2)$ pairwise simi…
▽ More
Correlation clustering is arguably the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters.
A main drawback of correlation clustering is that it requires as input the $Θ(n^2)$ pairwise similarities. This is often infeasible to compute or even just to store. In this paper we study \emph{query-efficient} algorithms for correlation clustering. Specifically, we devise a correlation clustering algorithm that, given a budget of $Q$ queries, attains a solution whose expected number of disagreements is at most $3\cdot OPT + O(\frac{n^3}{Q})$, where $OPT$ is the optimal cost for the instance. Its running time is $O(Q)$, and can be easily made non-adaptive (meaning it can specify all its queries at the outset and make them in parallel) with the same guarantees. Up to constant factors, our algorithm yields a provably optimal trade-off between the number of queries $Q$ and the worst-case error attained, even for adaptive algorithms.
Finally, we perform an experimental study of our proposed method on both synthetic and real data, showing the scalability and the accuracy of our algorithm.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
KONG: Kernels for ordered-neighborhood graphs
Authors:
Moez Draief,
Konstantin Kutzkov,
Kevin Scaman,
Milan Vojnovic
Abstract:
We present novel graph kernels for graphs with node and edge labels that have ordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order. Combining convolutional subgraph kernels and string kernels, we design new scalable algorithms for generation…
▽ More
We present novel graph kernels for graphs with node and edge labels that have ordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order. Combining convolutional subgraph kernels and string kernels, we design new scalable algorithms for generation of explicit graph feature maps using sketching techniques. We obtain precise bounds for the approximation accuracy and computational complexity of the proposed approaches and demonstrate their applicability on real datasets. In particular, our experiments demonstrate that neighborhood ordering results in more informative features. For the special case of general graphs, i.e. graphs without ordered neighborhoods, the new graph kernels yield efficient and simple algorithms for the comparison of label distributions between graphs.
△ Less
Submitted 29 May, 2018; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Learning Convolutional Neural Networks for Graphs
Authors:
Mathias Niepert,
Mohamed Ahmed,
Konstantin Kutzkov
Abstract:
Numerous important problems can be framed as learning from graph data. We propose a framework for learning convolutional neural networks for arbitrary graphs. These graphs may be undirected, directed, and with both discrete and continuous node and edge attributes. Analogous to image-based convolutional networks that operate on locally connected regions of the input, we present a general approach t…
▽ More
Numerous important problems can be framed as learning from graph data. We propose a framework for learning convolutional neural networks for arbitrary graphs. These graphs may be undirected, directed, and with both discrete and continuous node and edge attributes. Analogous to image-based convolutional networks that operate on locally connected regions of the input, we present a general approach to extracting locally connected regions from graphs. Using established benchmark data sets, we demonstrate that the learned feature representations are competitive with state of the art graph kernels and that their computation is highly efficient.
△ Less
Submitted 8 June, 2016; v1 submitted 17 May, 2016;
originally announced May 2016.
-
Triangle counting in dynamic graph streams
Authors:
Laurent Bulteau,
Vincent Froese,
Konstantin Kutzkov,
Rasmus Pagh
Abstract:
Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm e…
▽ More
Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm estimating the number of triangles in {\em dynamic} graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. The result is achieved by a novel approach combining sampling of vertex triples and sparsification of the input graph. In the course of the analysis of the algorithm we present a lower bound on the number of pairwise independent 2-paths in general graphs which might be of independent interest. At the end of the paper we discuss lower bounds on the space complexity of triangle counting algorithms that make no assumptions on the structure of the graph.
△ Less
Submitted 14 July, 2015; v1 submitted 18 April, 2014;
originally announced April 2014.
-
Consistent Subset Sampling
Authors:
Konstantin Kutzkov,
Rasmus Pagh
Abstract:
Consistent sampling is a technique for specifying, in small space, a subset $S$ of a potentially large universe $U$ such that the elements in $S$ satisfy a suitably chosen sampling condition. Given a subset $\mathcal{I}\subseteq U$ it should be possible to quickly compute $\mathcal{I}\cap S$, i.e., the elements in $\mathcal{I}$ satisfying the sampling condition. Consistent sampling has important a…
▽ More
Consistent sampling is a technique for specifying, in small space, a subset $S$ of a potentially large universe $U$ such that the elements in $S$ satisfy a suitably chosen sampling condition. Given a subset $\mathcal{I}\subseteq U$ it should be possible to quickly compute $\mathcal{I}\cap S$, i.e., the elements in $\mathcal{I}$ satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream.
In this paper we generalize consistent sampling to the setting where we are interested in sampling size-$k$ subsets occurring in some set in a collection of sets of bounded size $b$, where $k$ is a small integer. This can be done by applying standard consistent sampling to the $k$-subsets of each set, but that approach requires time $Θ(b^k)$. Using a carefully designed hash function, for a given sampling probability $p \in (0,1]$, we show how to improve the time complexity to $Θ(b^{\lceil k/2\rceil}\log \log b + pb^k)$ in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is $Θ(b^{\lceil k/4\rceil})$.
We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent $k$-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.
△ Less
Submitted 18 April, 2014;
originally announced April 2014.
-
Local correlation clustering
Authors:
Francesco Bonchi,
David García-Soriano,
Konstantin Kutzkov
Abstract:
Correlation clustering is perhaps the most natural formulation of clustering. Given $n$ objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. Despite its theoretical appeal, the practical relevance of correlation clustering still remains la…
▽ More
Correlation clustering is perhaps the most natural formulation of clustering. Given $n$ objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. Despite its theoretical appeal, the practical relevance of correlation clustering still remains largely unexplored, mainly due to the fact that correlation clustering requires the $Θ(n^2)$ pairwise similarities as input.
In this paper we initiate the investigation into \emph{local} algorithms for correlation clustering. In \emph{local correlation clustering} we are given the identifier of a single object and we want to return the cluster to which it belongs in some globally consistent near-optimal clustering, using a small number of similarity queries. Local algorithms for correlation clustering open the door to \emph{sublinear-time} algorithms, which are particularly useful when the similarity between items is costly to compute, as it is often the case in many practical application domains. They also imply $(i)$ distributed and streaming clustering algorithms, $(ii)$ constant-time estimators and testers for cluster edit distance, and $(iii)$ property-preserving parallel reconstruction algorithms for clusterability.
Specifically, we devise a local clustering algorithm attaining a $(3, \varepsilon)$-approximation in time $O(1/\varepsilon^2)$ independently of the dataset size. An explicit approximate clustering for all objects can be produced in time $O(n/\varepsilon)$ (which is provably optimal). We also provide a fully additive $(1,\varepsilon)$-approximation with local query complexity $poly(1/\varepsilon)$ and time complexity $2^{poly(1/\varepsilon)}$. The latter yields the fastest polynomial-time approximation scheme for correlation clustering known to date.
△ Less
Submitted 18 December, 2013;
originally announced December 2013.
-
On Parallelizing Matrix Multiplication by the Column-Row Method
Authors:
Andrea Campagna,
Konstantin Kutzkov,
Rasmus Pagh
Abstract:
We consider the problem of sparse matrix multiplication by the column row method in a distributed setting where the matrix product is not necessarily sparse. We present a surprisingly simple method for "consistent" parallel processing of sparse outer products (column-row vector products) over several processors, in a communication-avoiding setting where each processor has a copy of the input. The…
▽ More
We consider the problem of sparse matrix multiplication by the column row method in a distributed setting where the matrix product is not necessarily sparse. We present a surprisingly simple method for "consistent" parallel processing of sparse outer products (column-row vector products) over several processors, in a communication-avoiding setting where each processor has a copy of the input. The method is consistent in the sense that a given output entry is always assigned to the same processor independently of the specific structure of the outer product. We show guarantees on the work done by each processor, and achieve linear speedup down to the point where the cost is dominated by reading the input. Our method gives a way of distributing (or parallelizing) matrix product computations in settings where the main bottlenecks are storing the result matrix, and inter-processor communication. Motivated by observations on real data that often the absolute values of the entries in the product adhere to a power law, we combine our approach with frequent items mining algorithms and show how to obtain a tight approximation of the weight of the heaviest entries in the product matrix.
As a case study we present the application of our approach to frequent pair mining in transactional data streams, a problem that can be phrased in terms of sparse ${0,1}$-integer matrix multiplication by the column-row method. Experimental evaluation of the proposed method on real-life data supports the theoretical findings.
△ Less
Submitted 19 November, 2012; v1 submitted 1 October, 2012;
originally announced October 2012.
-
Deterministic algorithms for skewed matrix products
Authors:
Konstantin Kutzkov
Abstract:
Recently, Pagh presented a randomized approximation algorithm for the multiplication of real-valued matrices building upon work for detecting the most frequent items in data streams. We continue this line of research and present new {\em deterministic} matrix multiplication algorithms.
Motivated by applications in data mining, we first consider the case of real-valued, nonnegative $n$-by-$n$ inp…
▽ More
Recently, Pagh presented a randomized approximation algorithm for the multiplication of real-valued matrices building upon work for detecting the most frequent items in data streams. We continue this line of research and present new {\em deterministic} matrix multiplication algorithms.
Motivated by applications in data mining, we first consider the case of real-valued, nonnegative $n$-by-$n$ input matrices $A$ and $B$, and show how to obtain a deterministic approximation of the weights of individual entries, as well as the entrywise $p$-norm, of the product $AB$. The algorithm is simple, space efficient and runs in one pass over the input matrices. For a user defined $b \in (0, n^2)$ the algorithm runs in time $O(nb + n\cdot\text{Sort}(n))$ and space $O(n + b)$ and returns an approximation of the entries of $AB$ within an additive factor of $\|AB\|_{E1}/b$, where $\|C\|_{E1} = \sum_{i, j} |C_{ij}|$ is the entrywise 1-norm of a matrix $C$ and $\text{Sort}(n)$ is the time required to sort $n$ real numbers in linear space. Building upon a result by Berinde et al. we show that for skewed matrix products (a common situation in many real-life applications) the algorithm is more efficient and achieves better approximation guarantees than previously known randomized algorithms.
When the input matrices are not restricted to nonnegative entries, we present a new deterministic group testing algorithm detecting nonzero entries in the matrix product with large absolute value. The algorithm is clearly outperformed by randomized matrix multiplication algorithms, but as a byproduct we obtain the first $O(n^{2 + \varepsilon})$-time deterministic algorithm for matrix products with $O(\sqrt{n})$ nonzero entries.
△ Less
Submitted 20 September, 2012;
originally announced September 2012.
-
Using CSP To Improve Deterministic 3-SAT
Authors:
Konstantin Kutzkov,
Dominik Scheder
Abstract:
We show how one can use certain deterministic algorithms for higher-value constraint satisfaction problems (CSPs) to speed up deterministic local search for 3-SAT. This way, we improve the deterministic worst-case running time for 3-SAT to O(1.439^n).
We show how one can use certain deterministic algorithms for higher-value constraint satisfaction problems (CSPs) to speed up deterministic local search for 3-SAT. This way, we improve the deterministic worst-case running time for 3-SAT to O(1.439^n).
△ Less
Submitted 26 July, 2010; v1 submitted 7 July, 2010;
originally announced July 2010.