Search | arXiv e-print repository

Tensor Decomposition Meets RKHS: Efficient Algorithms for Smooth and Misaligned Data

Authors: Brett W. Larsen, Tamara G. Kolda, Anru R. Zhang, Alex H. Williams

Abstract: The canonical polyadic (CP) tensor decomposition decomposes a multidimensional data array into a sum of outer products of finite-dimensional vectors. Instead, we can replace some or all of the vectors with continuous functions (infinite-dimensional vectors) from a reproducing kernel Hilbert space (RKHS). We refer to tensors with some infinite-dimensional modes as quasitensors, and the approach of… ▽ More The canonical polyadic (CP) tensor decomposition decomposes a multidimensional data array into a sum of outer products of finite-dimensional vectors. Instead, we can replace some or all of the vectors with continuous functions (infinite-dimensional vectors) from a reproducing kernel Hilbert space (RKHS). We refer to tensors with some infinite-dimensional modes as quasitensors, and the approach of decomposing a tensor with some continuous RKHS modes is referred to as CP-HiFi (hybrid infinite and finite dimensional) tensor decomposition. An advantage of CP-HiFi is that it can enforce smoothness in the infinite dimensional modes. Further, CP-HiFi does not require the observed data to lie on a regular and finite rectangular grid and naturally incorporates misaligned data. We detail the methodology and illustrate it on a synthetic example. △ Less

Submitted 10 August, 2024; originally announced August 2024.

arXiv:2305.06927 [pdf, other]

Convergence of Alternating Gradient Descent for Matrix Factorization

Authors: Rachel Ward, Tamara G. Kolda

Abstract: We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{σ_1(\mathbf{A})}{σ_r(\mathbf{A})})^2 \log(1/ε)$ iterations of alternating gradient descent suffice to reach an $ε$-optimal factorization… ▽ More We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective. We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{σ_1(\mathbf{A})}{σ_r(\mathbf{A})})^2 \log(1/ε)$ iterations of alternating gradient descent suffice to reach an $ε$-optimal factorization $\| \mathbf{A} - \mathbf{X} \mathbf{Y}^{T} \|^2 \leq ε\| \mathbf{A}\|^2$ with high probability starting from an atypical random initialization. The factors have rank $d \geq r$ so that $\mathbf{X}_{T}\in\mathbb{R}^{m \times d}$ and $\mathbf{Y}_{T} \in\mathbb{R}^{n \times d}$, and mild overparameterization suffices for the constant $C$ in the iteration complexity $T$ to be an absolute constant. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves the convergence rate of gradient descent in practice. Our proof is conceptually simple: a uniform Polyak-Łojasiewicz (PL) inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems. △ Less

Submitted 7 February, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2202.06930 [pdf, other]

Tensor Moments of Gaussian Mixture Models: Theory and Applications

Authors: João M. Pereira, Joe Kileel, Tamara G. Kolda

Abstract: Gaussian mixture models (GMMs) are fundamental tools in statistical and data sciences. We study the moments of multivariate Gaussians and GMMs. The $d$-th moment of an $n$-dimensional random variable is a symmetric $d$-way tensor of size $n^d$, so working with moments naively is assumed to be prohibitively expensive for $d>2$ and larger values of $n$. In this work, we develop theory and numerical… ▽ More Gaussian mixture models (GMMs) are fundamental tools in statistical and data sciences. We study the moments of multivariate Gaussians and GMMs. The $d$-th moment of an $n$-dimensional random variable is a symmetric $d$-way tensor of size $n^d$, so working with moments naively is assumed to be prohibitively expensive for $d>2$ and larger values of $n$. In this work, we develop theory and numerical methods for \emph{implicit computations} with moment tensors of GMMs, reducing the computational and storage costs to $\mathcal{O}(n^2)$ and $\mathcal{O}(n^3)$, respectively, for general covariance matrices, and to $\mathcal{O}(n)$ and $\mathcal{O}(n)$, respectively, for diagonal ones. We derive concise analytic expressions for the moments in terms of symmetrized tensor products, relying on the correspondence between symmetric tensors and homogeneous polynomials, and combinatorial identities involving Bell polynomials. The primary application of this theory is to estimating GMM parameters (means and covariances) from a set of observations, when formulated as a moment-matching optimization problem. If there is a known and common covariance matrix, we also show it is possible to debias the data observations, in which case the problem of estimating the unknown means reduces to symmetric CP tensor decomposition. Numerical results validate and illustrate the numerical efficiency of our approaches. This work potentially opens the door to the competitiveness of the method of moments as compared to expectation maximization methods for parameter estimation of GMMs. △ Less

Submitted 21 March, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

arXiv:2201.10638 [pdf, ps, other]

Sketching Matrix Least Squares via Leverage Scores Estimates

Authors: Brett W. Larsen, Tamara G. Kolda

Abstract: We consider the matrix least squares problem of the form $\| \mathbf{A} \mathbf{X}-\mathbf{B} \|_F^2$ where the design matrix $\mathbf{A} \in \mathbb{R}^{N \times r}$ is tall and skinny with $N \gg r$. We propose to create a sketched version $\| \tilde{\mathbf{A}}\mathbf{X}-\tilde{\mathbf{B}} \|_F^2$ where the sketched matrices $\tilde{\mathbf{A}}$ and $\tilde{\mathbf{B}}$ contain weighted subsets… ▽ More We consider the matrix least squares problem of the form $\| \mathbf{A} \mathbf{X}-\mathbf{B} \|_F^2$ where the design matrix $\mathbf{A} \in \mathbb{R}^{N \times r}$ is tall and skinny with $N \gg r$. We propose to create a sketched version $\| \tilde{\mathbf{A}}\mathbf{X}-\tilde{\mathbf{B}} \|_F^2$ where the sketched matrices $\tilde{\mathbf{A}}$ and $\tilde{\mathbf{B}}$ contain weighted subsets of the rows of $\mathbf{A}$ and $\mathbf{B}$, respectively. The subset of rows is determined via random sampling based on leverage score estimates for each row. We say that the sketched problem is $ε$-accurate if its solution $\tilde{\mathbf{X}}_{\rm \text{opt}} = \text{argmin } \| \tilde{\mathbf{A}}\mathbf{X}-\tilde{\mathbf{B}} \|_F^2$ satisfies $\|\mathbf{A}\tilde{\mathbf{X}}_{\rm \text{opt}}-\mathbf{B} \|_F^2 \leq (1+ε) \min \| \mathbf{A}\mathbf{X}-\mathbf{B} \|_F^2$ with high probability. We prove that the number of samples required for an $ε$-accurate solution is $O(r/(βε))$ where $β\in (0,1]$ is a measure of the quality of the leverage score estimates. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: This is detailed and standalone derivation of a result that already appears in (arXiv:2006.16438, Appendix A). arXiv admin note: substantial text overlap with arXiv:2006.16438

arXiv:2110.14514 [pdf, other]

Streaming Generalized Canonical Polyadic Tensor Decompositions

Authors: Eric Phipps, Nick Johnson, Tamara G. Kolda

Abstract: In this paper, we develop a method which we call OnlineGCP for computing the Generalized Canonical Polyadic (GCP) tensor decomposition of streaming data. GCP differs from traditional canonical polyadic (CP) tensor decompositions as it allows for arbitrary objective functions which the CP model attempts to minimize. This approach can provide better fits and more interpretable models when the observ… ▽ More In this paper, we develop a method which we call OnlineGCP for computing the Generalized Canonical Polyadic (GCP) tensor decomposition of streaming data. GCP differs from traditional canonical polyadic (CP) tensor decompositions as it allows for arbitrary objective functions which the CP model attempts to minimize. This approach can provide better fits and more interpretable models when the observed tensor data is strongly non-Gaussian. In the streaming case, tensor data is gradually observed over time and the algorithm must incrementally update a GCP factorization with limited access to prior data. In this work, we extend the GCP formalism to the streaming context by deriving a GCP optimization problem to be solved as new tensor data is observed, formulate a tunable history term to balance reconstruction of recently observed data with data observed in the past, develop a scalable solution strategy based on segregated solves using stochastic gradient descent methods, describe a software implementation that provides performance and portability to contemporary CPU and GPU architectures and integrates with Matlab for enhanced useability, and demonstrate the utility and performance of the approach and software on several synthetic and real tensor data sets. △ Less

Submitted 27 October, 2021; originally announced October 2021.

arXiv:2104.11079 [pdf, other]

doi 10.2172/1807223

Randomized Algorithms for Scientific Computing (RASC)

Authors: Aydin Buluc, Tamara G. Kolda, Stefan M. Wild, Mihai Anitescu, Anthony DeGennaro, John Jakeman, Chandrika Kamath, Ramakrishnan Kannan, Miles E. Lopes, Per-Gunnar Martinsson, Kary Myers, Jelani Nelson, Juan M. Restrepo, C. Seshadhri, Draguna Vrabie, Brendt Wohlberg, Stephen J. Wright, Chao Yang, Peter Zwart

Abstract: Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and sc… ▽ More Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021. △ Less

Submitted 21 March, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

arXiv:1909.04801 [pdf, ps, other]

doi 10.1093/imaiai/iaaa028

Faster Johnson-Lindenstrauss Transforms via Kronecker Products

Authors: Ruhui Jin, Tamara G. Kolda, Rachel Ward

Abstract: The Kronecker product is an important matrix operation with a wide range of applications in supporting fast linear transforms, including signal processing, graph theory, quantum computing and deep learning. In this work, we introduce a generalization of the fast Johnson-Lindenstrauss projection for embedding vectors with Kronecker product structure, the Kronecker fast Johnson-Lindenstrauss transfo… ▽ More The Kronecker product is an important matrix operation with a wide range of applications in supporting fast linear transforms, including signal processing, graph theory, quantum computing and deep learning. In this work, we introduce a generalization of the fast Johnson-Lindenstrauss projection for embedding vectors with Kronecker product structure, the Kronecker fast Johnson-Lindenstrauss transform (KFJLT). The KFJLT reduces the embedding cost to an exponential factor of the standard fast Johnson-Lindenstrauss transform (FJLT)'s cost when applied to vectors with Kronecker structure, by avoiding explicitly forming the full Kronecker products. We prove that this computational gain comes with only a small price in embedding power: given $N = \prod_{k=1}^d n_k$, consider a finite set of $p$ points in a tensor product of $d$ constituent Euclidean spaces $\bigotimes_{k=d}^{1}\mathbb{R}^{n_k} \subset \mathbb{R}^{N}$. With high probability, a random KFJLT matrix of dimension $N \times m$ embeds the set of points up to multiplicative distortion $(1\pm \varepsilon)$ provided by $m \gtrsim \varepsilon^{-2} \cdot \log^{2d - 1} (p) \cdot \log N$. We conclude by describing a direct application of the KFJLT to the efficient solution of large-scale Kronecker-structured least squares problems for fitting the CP tensor decomposition. △ Less

Submitted 30 July, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: Information and Inference: A Journal of the IMA, 2020

arXiv:1906.01687 [pdf, other]

doi 10.1137/19m1266265

Stochastic Gradients for Large-Scale Tensor Decomposition

Authors: Tamara G. Kolda, David Hong

Abstract: Tensor decomposition is a well-known tool for multiway data analysis. This work proposes using stochastic gradients for efficient generalized canonical polyadic (GCP) tensor decomposition of large-scale tensors. GCP tensor decomposition is a recently proposed version of tensor decomposition that allows for a variety of loss functions such as Bernoulli loss for binary data or Huber loss for robust… ▽ More Tensor decomposition is a well-known tool for multiway data analysis. This work proposes using stochastic gradients for efficient generalized canonical polyadic (GCP) tensor decomposition of large-scale tensors. GCP tensor decomposition is a recently proposed version of tensor decomposition that allows for a variety of loss functions such as Bernoulli loss for binary data or Huber loss for robust estimation. The stochastic gradient is formed from randomly sampled elements of the tensor and is efficient because it can be computed using the sparse matricized-tensor-times-Khatri-Rao product (MTTKRP) tensor kernel. For dense tensors, we simply use uniform sampling. For sparse tensors, we propose two types of stratified sampling that give precedence to sampling nonzeros. Numerical results demonstrate the advantages of the proposed approach and its scalability to large-scale problems. △ Less

Submitted 7 July, 2020; v1 submitted 4 June, 2019; originally announced June 2019.

Journal ref: SIAM Journal on Mathematics of Data Science, Vol. 2, No. 4, pp. 1066-1095, 2020

arXiv:1901.06043 [pdf, other]

doi 10.1145/3378445

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Authors: Grey Ballard, Alicia Klinvex, Tamara G. Kolda

Abstract: Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its tru… ▽ More Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 terabyte and 6.7 terabyte data sets distributed across 100s of nodes (1000s of MPI processes), achieving compression rates between 100-200,000$\times$ which equates to 99-99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel filesystem. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics. △ Less

Submitted 21 August, 2019; v1 submitted 17 January, 2019; originally announced January 2019.

Journal ref: ACM Transactions on Mathematical Software, Vol. 46, No. 2, Article 13, June 2020

arXiv:1809.09175 [pdf, other]

doi 10.1137/18M1210691

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Authors: Eric Phipps, Tamara G. Kolda

Abstract: In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to differen… ▽ More In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP) which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures. △ Less

Submitted 21 January, 2019; v1 submitted 24 September, 2018; originally announced September 2018.

Journal ref: SIAM Journal on Scientific Computing, Vol. 41, No. 3, pp. C269-C290, 22 pages, 2019

arXiv:1808.07510 [pdf, other]

XPCA: Extending PCA for a Combination of Discrete and Continuous Variables

Authors: Clifford Anderson-Bergman, Tamara G. Kolda, Kina Kincher-Winoto

Abstract: Principal component analysis (PCA) is arguably the most popular tool in multivariate exploratory data analysis. In this paper, we consider the question of how to handle heterogeneous variables that include continuous, binary, and ordinal. In the probabilistic interpretation of low-rank PCA, the data has a normal multivariate distribution and, therefore, normal marginal distributions for each colum… ▽ More Principal component analysis (PCA) is arguably the most popular tool in multivariate exploratory data analysis. In this paper, we consider the question of how to handle heterogeneous variables that include continuous, binary, and ordinal. In the probabilistic interpretation of low-rank PCA, the data has a normal multivariate distribution and, therefore, normal marginal distributions for each column. If some marginals are continuous but not normal, the semiparametric copula-based principal component analysis (COCA) method is an alternative to PCA that combines a Gaussian copula with nonparametric marginals. If some marginals are discrete or semi-continuous, we propose a new extended PCA (XPCA) method that also uses a Gaussian copula and nonparametric marginals and accounts for discrete variables in the likelihood calculation by integrating over appropriate intervals. Like PCA, the factors produced by XPCA can be used to find latent structure in data, build predictive models, and perform dimensionality reduction. We present the new model, its induced likelihood function, and a fitting algorithm which can be applied in the presence of missing data. We demonstrate how to use XPCA to produce an estimated full conditional distribution for each data point, and use this to produce to provide estimates for missing data that are automatically range respecting. We compare the methods as applied to simulated and real-world data sets that have a mixture of discrete and continuous variables. △ Less

Submitted 22 August, 2018; originally announced August 2018.

arXiv:1808.07452 [pdf, other]

doi 10.1137/18M1203626

Generalized Canonical Polyadic Tensor Decomposition

Authors: David Hong, Tamara G. Kolda, Jed A. Duersch

Abstract: Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. This work develops a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. For instance, we can use logistic loss or Kullback-Leibler divergence, enabling tensor deco… ▽ More Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. This work develops a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. For instance, we can use logistic loss or Kullback-Leibler divergence, enabling tensor decomposition for binary or count data. We present a variety statistically-motivated loss functions for various scenarios. We provide a generalized framework for computing gradients and handling missing data that enables the use of standard optimization methods for fitting the model. We demonstrate the flexibility of GCP on several real-world examples including interactions in a social network, neural activity in a mouse, and monthly rainfall measurements in India. △ Less

Submitted 21 January, 2019; v1 submitted 22 August, 2018; originally announced August 2018.

Journal ref: SIAM Review, Vol. 62, No. 1, pp. 133-163, 2020

arXiv:1607.08673 [pdf, other]

doi 10.1093/comnet/cnx001

Measuring and Modeling Bipartite Graphs with Community Structure

Authors: Sinan Aksoy, Tamara G. Kolda, Ali Pinar

Abstract: Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This paper is focused on generative models of large-scale bipartite graphs, also known as two-way graphs or two-mode networks. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively, but quant… ▽ More Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This paper is focused on generative models of large-scale bipartite graphs, also known as two-way graphs or two-mode networks. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively, but quantitatively. The characteristics we consider are the degree distributions and the metamorphosis coefficient. The metamorphosis coefficient, a bipartite analogue of the clustering coefficient, is the proportion of length-three paths that participate in length-four cycles. Having a high metamorphosis coefficient is a necessary condition for close-knit community structure. We define edge, node, and degreewise metamorphosis coefficients, enabling a more detailed understanding of the bipartite connectivity that is not explained by degree distribution alone. Our first model, bipartite Chung-Lu (CL), is able to reproduce real-world degree distributions, and our second model, bipartite block two-level Erdös-Rényi (BTER), reproduces both the degree distributions as well as the degreewise metamorphosis coefficients. We demonstrate the effectiveness of these models on several real-world data sets. △ Less

Submitted 29 October, 2016; v1 submitted 28 July, 2016; originally announced July 2016.

Journal ref: Journal of Complex Networks, Vol. 5, No. 4, pp. 581-603, 2017

arXiv:1510.06689 [pdf, ps, other]

doi 10.1109/IPDPS.2016.67

Parallel Tensor Compression for Large-Scale Scientific Data

Authors: Woody Austin, Grey Ballard, Tamara G. Kolda

Abstract: As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can comput… ▽ More As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments. △ Less

Submitted 23 February, 2016; v1 submitted 22 October, 2015; originally announced October 2015.

Journal ref: IPDPS'16: Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium, pp. 912-922, May 2016

arXiv:1506.03872 [pdf, other]

doi 10.1109/ICDM.2015.46

Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search

Authors: Grey Ballard, Ali Pinar, Tamara G. Kolda, C. Seshadhri

Abstract: Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach t… ▽ More Given two sets of vectors, $A = \{{a_1}, \dots, {a_m}\}$ and $B=\{{b_1},\dots,{b_n}\}$, our problem is to find the top-$t$ dot products, i.e., the largest $|{a_i}\cdot{b_j}|$ among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all $mn$ dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of $A$ and $B$. The probability of selecting a diamond corresponding to pair $(i,j)$ is proportional to $({a_i}\cdot{b_j})^2$, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. △ Less

Submitted 18 June, 2015; v1 submitted 11 June, 2015; originally announced June 2015.

Journal ref: ICDM 2015: Proceedings of the 2015 IEEE International Conference on Data Mining, pp. 11-20, November 2015

arXiv:1404.5874 [pdf, ps, other]

Using Triangles to Improve Community Detection in Directed Networks

Authors: Christine Klymko, David Gleich, Tamara G. Kolda

Abstract: In a graph, a community may be loosely defined as a group of nodes that are more closely connected to one another than to the rest of the graph. While there are a variety of metrics that can be used to specify the quality of a given community, one common theme is that flows tend to stay within communities. Hence, we expect cycles to play an important role in community detection. For undirected gra… ▽ More In a graph, a community may be loosely defined as a group of nodes that are more closely connected to one another than to the rest of the graph. While there are a variety of metrics that can be used to specify the quality of a given community, one common theme is that flows tend to stay within communities. Hence, we expect cycles to play an important role in community detection. For undirected graphs, the importance of triangles -- an undirected 3-cycle -- has been known for a long time and can be used to improve community detection. In directed graphs, the situation is more nuanced. The smallest cycle is simply two nodes with a reciprocal connection, and using information about reciprocation has proven to improve community detection. Our new idea is based on the four types of directed triangles that contain cycles. To identify communities in directed networks, then, we propose an undirected edge-weighting scheme based on the type of the directed triangles in which edges are involved. We also propose a new metric on quality of the communities that is based on the number of 3-cycles that are split across communities. To demonstrate the impact of our new weighting, we use the standard METIS graph partitioning tool to determine communities and show experimentally that the resulting communities result in fewer 3-cycles being cut. The magnitude of the effect varies between a 10 and 50% reduction, and we also find evidence that this weighting scheme improves a task where plausible ground-truth communities are known. △ Less

Submitted 23 April, 2014; originally announced April 2014.

Comments: 10 pages, 3 figures

arXiv:1403.2226 [pdf, ps, other]

Accelerating Community Detection by Using K-core Subgraphs

Authors: Chengbin Peng, Tamara G. Kolda, Ali Pinar

Abstract: Community detection is expensive, and the cost generally depends at least linearly on the number of vertices in the graph. We propose working with a reduced graph that has many fewer nodes but nonetheless captures key community structure. The K-core of a graph is the largest subgraph within which each node has at least K connections. We propose a framework that accelerates community detection by a… ▽ More Community detection is expensive, and the cost generally depends at least linearly on the number of vertices in the graph. We propose working with a reduced graph that has many fewer nodes but nonetheless captures key community structure. The K-core of a graph is the largest subgraph within which each node has at least K connections. We propose a framework that accelerates community detection by applying an expensive algorithm (modularity optimization, the Louvain method, spectral clustering, etc.) to the K-core and then using an inexpensive heuristic (such as local modularity maximization) to infer community labels for the remaining nodes. Our experiments demonstrate that the proposed framework can reduce the running time by more than 80% while preserving the quality of the solutions. Recent theoretical investigations provide support for using the K-core as a reduced representation. △ Less

Submitted 13 October, 2014; v1 submitted 10 March, 2014; originally announced March 2014.

Comments: 15 pages, 8 figures

arXiv:1309.3321 [pdf, other]

doi 10.1002/sam.11224

Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

Authors: C. Seshadhri, Ali Pinar, Tamara G. Kolda

Abstract: Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and ed… ▽ More Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives. △ Less

Submitted 14 January, 2014; v1 submitted 12 September, 2013; originally announced September 2013.

Comments: Full version of SDM 2013 paper "Triadic Measures on Graphs: The Power of Wedge Sampling" (arxiv:1202.5230)

Journal ref: Statistical Analysis and Data Mining, Vol. 7, No. 4, pp. 294-307, August 2014

arXiv:1303.6385 [pdf, other]

Dynamics of Trust Reciprocation in Heterogenous MMOG Networks

Authors: Ayush Singhal, Karthik Subbian, Jaideep Srivastava, Tamara G. Kolda, Ali Pinar

Abstract: Understanding the dynamics of reciprocation is of great interest in sociology and computational social science. The recent growth of Massively Multi-player Online Games (MMOGs) has provided unprecedented access to large-scale data which enables us to study such complex human behavior in a more systematic manner. In this paper, we consider three different networks in the EverQuest2 game: chat, trad… ▽ More Understanding the dynamics of reciprocation is of great interest in sociology and computational social science. The recent growth of Massively Multi-player Online Games (MMOGs) has provided unprecedented access to large-scale data which enables us to study such complex human behavior in a more systematic manner. In this paper, we consider three different networks in the EverQuest2 game: chat, trade, and trust. The chat network has the highest level of reciprocation (33%) because there are essentially no barriers to it. The trade network has a lower rate of reciprocation (27%) because it has the obvious barrier of requiring more goods or money for exchange; morever, there is no clear benefit to returning a trade link except in terms of social connections. The trust network has the lowest reciprocation (14%) because this equates to sharing certain within-game assets such as weapons, and so there is a high barrier for such connections because they require faith in the players that are granted such high access. In general, we observe that reciprocation rate is inversely related to the barrier level in these networks. We also note that reciprocation has connections across the heterogeneous networks. Our experiments indicate that players make use of the medium-barrier reciprocations to strengthen a relationship. We hypothesize that lower-barrier interactions are an important component to predicting higher-barrier ones. We verify our hypothesis using predictive models for trust reciprocations using features from trade interactions. Using the number of trades (both before and after the initial trust link) boosts our ability to predict if the trust will be reciprocated up to 11% with respect to the AUC. △ Less

Submitted 18 April, 2013; v1 submitted 26 March, 2013; originally announced March 2013.

arXiv:1302.6636 [pdf, other]

doi 10.1137/130914218

A Scalable Generative Graph Model with Community Structure

Authors: Tamara G. Kolda, Ali Pinar, Todd Plantenga, C. Seshadhri

Abstract: Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as socia… ▽ More Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications. △ Less

Submitted 9 December, 2013; v1 submitted 26 February, 2013; originally announced February 2013.

Journal ref: SIAM Journal on Scientific Computing, Vol. 36, No. 5, pp. C424-C452, September 2014

arXiv:1302.6220 [pdf, ps, other]

Directed closure measures for networks with reciprocity

Authors: C. Seshadhri, Ali Pinar, Nurcan Durak, Tamara G. Kolda

Abstract: The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the \emph{transitivity}, i.e., the fraction of paths of length $2$ that participate in triangles. Real-world networks are often directed, and it can be difficult to "measure" this network structure meaningfully. We propose a collection of \emph{directed closure values} for measuring triangles in di… ▽ More The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the \emph{transitivity}, i.e., the fraction of paths of length $2$ that participate in triangles. Real-world networks are often directed, and it can be difficult to "measure" this network structure meaningfully. We propose a collection of \emph{directed closure values} for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks. △ Less

Submitted 23 April, 2014; v1 submitted 25 February, 2013; originally announced February 2013.

Comments: Updated version; new results on expected directed closures for reciprocal configuration model

arXiv:1301.7744 [pdf, ps, other]

doi 10.1137/130907215

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Authors: Martin D. Schatz, Tze Meng Low, Robert A. van de Geijn, Tamara G. Kolda

Abstract: Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric te… ▽ More Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$ and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$, where $ m $ is the order of the tensor. However, as the analysis shows, care must be taken in choosing the correct block size to ensure these storage and computational benefits are achieved (particularly for low-order tensors). An implementation demonstrates that storage is greatly reduced and the complexity introduced by storing and computing with tensors by blocks is manageable. Preliminary results demonstrate that computational time is also reduced. The paper concludes with a discussion of how insights in this paper point to opportunities for generalizing recent advances in the domain of linear algebra libraries to the field of multi-linear computation. △ Less

Submitted 9 April, 2014; v1 submitted 31 January, 2013; originally announced January 2013.

MSC Class: 15-02 (Primary)

Journal ref: SIAM Journal on Scientific Computing, Vol. 36, No. 5, pp. C453-C479, September 2014

arXiv:1301.5887 [pdf, other]

doi 10.1137/13090729X

Counting Triangles in Massive Graphs with MapReduce

Authors: Tamara G. Kolda, Ali Pinar, Todd Plantenga, C. Seshadhri, Christine Task

Abstract: Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood… ▽ More Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 seconds per million edges plus overhead (approximately 225 seconds total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date. △ Less

Submitted 9 December, 2013; v1 submitted 24 January, 2013; originally announced January 2013.

Journal ref: SIAM Journal on Scientific Computing, Vol. 36, No. 5, pp. S44-S77, October 2014

arXiv:1210.5288 [pdf, other]

A Scalable Null Model for Directed Graphs Matching All Degree Distributions: In, Out, and Reciprocal

Authors: Nurcan Durak, Tamara G. Kolda, Ali Pinar, C. Seshadhri

Abstract: Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as a good null model to compare algorithms or perform experimental studies. Furthermore, there are scalable algorithms that implement these models and they are invaluable in th… ▽ More Degree distributions are arguably the most important property of real world networks. The classic edge configuration model or Chung-Lu model can generate an undirected graph with any desired degree distribution. This serves as a good null model to compare algorithms or perform experimental studies. Furthermore, there are scalable algorithms that implement these models and they are invaluable in the study of graphs. However, networks in the real-world are often directed, and have a significant proportion of reciprocal edges. A stronger relation exists between two nodes when they each point to one another (reciprocal edge) as compared to when only one points to the other (one-way edge). Despite their importance, reciprocal edges have been disregarded by most directed graph models. We propose a null model for directed graphs inspired by the Chung-Lu model that matches the in-, out-, and reciprocal-degree distributions of the real graphs. Our algorithm is scalable and requires $O(m)$ random numbers to generate a graph with $m$ edges. We perform a series of experiments on real datasets and compare with existing graph models. △ Less

Submitted 25 April, 2013; v1 submitted 18 October, 2012; originally announced October 2012.

Comments: Camera ready version for IEEE Workshop on Network Science; fixed some typos in table

Journal ref: Proceedings of IEEE 2013 2nd International Network Science Workshop (NSW 2013), pp. 22--30

arXiv:1207.7125 [pdf, other]

doi 10.1145/2396761.2398503

Degree Relations of Triangles in Real-world Networks and Models

Authors: Nurcan Durak, Ali Pinar, Tamara G. Kolda, C. Seshadhri

Abstract: Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles -- specifically, degree homogeneity versus heterogeneity in triangles. This yields ne… ▽ More Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles -- specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1% of degrees participate in the vast majority of triangles in heterogeneous graphs. We also ask the question of whether or not current graph models reproduce the types of triangles that are observed in real data and showed that most models fail to accurately capture these salient features. △ Less

Submitted 30 July, 2012; originally announced July 2012.

Journal ref: CIKM '12: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, pp. 1712-1716, 2012

arXiv:1202.5230 [pdf, other]

doi 10.1137/1.9781611972832.2

Triadic Measures on Graphs: The Power of Wedge Sampling

Authors: C. Seshadhri, Ali Pinar, Tamara G. Kolda

Abstract: Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This ver… ▽ More Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of a graph. Some of the most useful graph metrics, especially those measuring social cohesion, are based on triangles. Despite the importance of these triadic measures, associated algorithms can be extremely expensive. We propose a new method based on wedge sampling. This versatile technique allows for the fast and accurate approximation of all current variants of clustering coefficients and enables rapid uniform sampling of the triangles of a graph. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our results will enable more wide-scale adoption of triadic measures for analysis of extremely large graphs, as demonstrated on several real-world examples. △ Less

Submitted 18 October, 2012; v1 submitted 23 February, 2012; originally announced February 2012.

Journal ref: SDM13: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 10-18, May 2013

arXiv:1112.3644 [pdf, other]

doi 10.1103/PhysRevE.85.056109

Community structure and scale-free collections of Erdös-Rényi graphs

Authors: C. Seshadhri, Tamara G. Kolda, Ali Pinar

Abstract: Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erdös-Rényi (ER) subgra… ▽ More Community structure plays a significant role in the analysis of social networks and similar graphs, yet this structure is little understood and not well captured by most models. We formally define a community to be a subgraph that is internally highly connected and has no deeper substructure. We use tools of combinatorics to show that any such community must contain a dense Erdös-Rényi (ER) subgraph. Based on mathematical arguments, we hypothesize that any graph with a heavy-tailed degree distribution and community structure must contain a scale free collection of dense ER subgraphs. These theoretical observations corroborate well with empirical evidence. From this, we propose the Block Two-Level Erdös-Rényi (BTER) model, and demonstrate that it accurately captures the observable properties of many real-world social networks. △ Less

Submitted 15 December, 2011; originally announced December 2011.

Journal ref: Physical Review E 85(5):056109, 2012

arXiv:1110.4925 [pdf, other]

The Similarity between Stochastic Kronecker and Chung-Lu Graph Models

Authors: Ali Pinar, C. Seshadhri, Tamara G. Kolda

Abstract: The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy par… ▽ More The analysis of massive graphs is now becoming a very important part of science and industrial research. This has led to the construction of a large variety of graph models, each with their own advantages. The Stochastic Kronecker Graph (SKG) model has been chosen by the Graph500 steering committee to create supercomputer benchmarks for graph algorithms. The major reasons for this are its easy parallelization and ability to mirror real data. Although SKG is easy to implement, there is little understanding of the properties and behavior of this model. We show that the parallel variant of the edge-configuration model given by Chung and Lu (referred to as CL) is notably similar to the SKG model. The graph properties of an SKG are extremely close to those of a CL graph generated with the appropriate parameters. Indeed, the final probability matrix used by SKG is almost identical to that of a CL model. This implies that the graph distribution represented by SKG is almost the same as that given by a CL model. We also show that when it comes to fitting real data, CL performs as well as SKG based on empirical studies of graph properties. CL has the added benefit of a trivially simple fitting procedure and exactly matching the degree distribution. Our results suggest that users of the SKG model should consider the CL model because of its similar properties, simpler structure, and ability to fit a wider range of degree distributions. At the very least, CL is a good control model to compare against. △ Less

Submitted 26 October, 2011; v1 submitted 21 October, 2011; originally announced October 2011.

Journal ref: SDM12: Proceedings of the Twelfth SIAM International Conference on Data Mining, pp. 1071-1082, April 2012

arXiv:1103.2068 [pdf, other]

doi 10.1109/ICDM.2011.39

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Authors: Justin D. Basilico, M. Arthur Munson, Tamara G. Kolda, Kevin R. Dixon, W. Philip Kegelmeyer

Abstract: COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset… ▽ More COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more. △ Less

Submitted 8 September, 2011; v1 submitted 10 March, 2011; originally announced March 2011.

ACM Class: I.5; I.2.6; H.2.8

Journal ref: ICDM 2011: Proceedings of the 2011 IEEE International Conference on Data Mining, pp. 41-50, 2011

arXiv:1102.5046 [pdf, other]

doi 10.1145/2450142.2450149

An In-Depth Analysis of Stochastic Kronecker Graphs

Authors: C. Seshadhri, Ali Pinar, Tamara G. Kolda

Abstract: Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy p… ▽ More Graph analysis is playing an increasingly important role in science and industry. Due to numerous limitations in sharing real-world graphs, models for generating massive graphs are critical for developing better algorithms. In this paper, we analyze the stochastic Kronecker graph model (SKG), which is the foundation of the Graph500 supercomputer benchmark due to its favorable properties and easy parallelization. Our goal is to provide a deeper understanding of the parameters and properties of this model so that its functionality as a benchmark is increased. We develop a rigorous mathematical analysis that shows this model cannot generate a power-law distribution or even a lognormal distribution. However, we formalize an enhanced version of the SKG model that uses random noise for smoothing. We prove both in theory and in practice that this enhancement leads to a lognormal distribution. Additionally, we provide a precise analysis of isolated vertices, showing that the graphs that are produced by SKG might be quite different than intended. For example, between 50% and 75% of the vertices in the Graph500 benchmarks will be isolated. Finally, we show that this model tends to produce extremely small core numbers (compared to most social networks and other real graphs) for common parameter choices. △ Less

Submitted 2 January, 2013; v1 submitted 24 February, 2011; originally announced February 2011.

Journal ref: Journal of the ACM 60(2):13 (32 pages), April 2013

Showing 1–30 of 30 results for author: Kolda, T G