-
Thermodynamics of the Minimum Description Length on Community Detection
Authors:
Juan Ignacio Perotti,
Claudio Juan Tessone,
Aaron Clauset,
Guido Caldarelli
Abstract:
Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occam's razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we…
▽ More
Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occam's razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we introduce the Boltzmannian MDL (BMDL), a formalization of the principle of MDL with a parametric complexity conveniently formulated as the free-energy of an artificial thermodynamic system. In this way, we leverage on the rich theoretical and technical background of statistical mechanics, to show the crucial importance that phase transitions and other thermodynamic concepts have on the problem of statistical modeling from an information theoretic point of view. For example, we provide information theoretic justifications of why a high-temperature series expansion can be used to compute systematic approximations of the BMDL when the formalism is used to model data, and why statistically significant model selections can be identified with ordered phases when the BMDL is used to model models. To test the introduced formalism, we compute approximations of BMDL for the problem of community detection in complex networks, where we obtain a principled MDL derivation of the Girvan-Newman (GN) modularity and the Zhang-Moore (ZM) community detection method. Here, by means of analytical estimations and numerical experiments on synthetic and empirical networks, we find that BMDL-based correction terms of the GN modularity improve the quality of the detected communities and we also find an information theoretic justification of why the ZM criterion for estimation of the number of network communities is better than alternative approaches such as the bare minimization of a free energy.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
-
Detectability thresholds and optimal algorithms for community structure in dynamic networks
Authors:
Amir Ghasemian,
Pan Zhang,
Aaron Clauset,
Cristopher Moore,
Leto Peel
Abstract:
We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a…
▽ More
We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a function of the rate of change and the strength of the communities. Below this threshold, we claim that no algorithm can identify the communities better than chance. We then give two algorithms that are optimal in the sense that they succeed all the way down to this limit. The first uses belief propagation (BP), which gives asymptotically optimal accuracy, and the second is a fast spectral clustering algorithm, based on linearizing the BP equations. We verify our analytic and algorithmic results via numerical simulation, and close with a brief discussion of extensions and open questions.
△ Less
Submitted 19 June, 2015;
originally announced June 2015.
-
The performance of modularity maximization in practical contexts
Authors:
Benjamin H. Good,
Yves-Alexandre de Montjoye,
Aaron Clauset
Abstract:
Although widely used in practice, the behavior and accuracy of the popular module identification technique called modularity maximization is not well understood in practical contexts. Here, we present a broad characterization of its performance in such situations. First, we revisit and clarify the resolution limit phenomenon for modularity maximization. Second, we show that the modularity function…
▽ More
Although widely used in practice, the behavior and accuracy of the popular module identification technique called modularity maximization is not well understood in practical contexts. Here, we present a broad characterization of its performance in such situations. First, we revisit and clarify the resolution limit phenomenon for modularity maximization. Second, we show that the modularity function Q exhibits extreme degeneracies: it typically admits an exponential number of distinct high-scoring solutions and typically lacks a clear global maximum. Third, we derive the limiting behavior of the maximum modularity Q_max for one model of infinitely modular networks, showing that it depends strongly both on the size of the network and on the number of modules it contains. Finally, using three real-world metabolic networks as examples, we show that the degenerate solutions can fundamentally disagree on many, but not all, partition properties such as the composition of the largest modules and the distribution of module sizes. These results imply that the output of any modularity maximization procedure should be interpreted cautiously in scientific contexts. They also explain why many heuristics are often successful at finding high-scoring partitions in practice and why different heuristics can disagree on the modular structure of the same network. We conclude by discussing avenues for mitigating some of these behaviors, such as combining information from many degenerate solutions or using generative models.
△ Less
Submitted 1 April, 2010; v1 submitted 1 October, 2009;
originally announced October 2009.
-
Power-law distributions in empirical data
Authors:
Aaron Clauset,
Cosma Rohilla Shalizi,
M. E. J. Newman
Abstract:
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution -- the part of the distribution representing large but rare events -- and by the diffic…
▽ More
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution -- the part of the distribution representing large but rare events -- and by the difficulty of identifying the range over which power-law behavior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law distributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data while in others the power law is ruled out.
△ Less
Submitted 2 February, 2009; v1 submitted 7 June, 2007;
originally announced June 2007.
-
Finding local community structure in networks
Authors:
Aaron Clauset
Abstract:
Although the inference of global community structure in networks has recently become a topic of great interest in the physics community, all such algorithms require that the graph be completely known. Here, we define both a measure of local community structure and an algorithm that infers the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time. This a…
▽ More
Although the inference of global community structure in networks has recently become a topic of great interest in the physics community, all such algorithms require that the graph be completely known. Here, we define both a measure of local community structure and an algorithm that infers the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time. This algorithm runs in time O(d*k^2) for general graphs when $d$ is the mean degree and k is the number of vertices to be explored. For graphs where exploring a new vertex is time-consuming, the running time is linear, O(k). We show that on computer-generated graphs this technique compares favorably to algorithms that require global knowledge. We also use this algorithm to extract meaningful local clustering information in the large recommender network of an online retailer and show the existence of mesoscopic structure.
△ Less
Submitted 4 March, 2005;
originally announced March 2005.
-
On the Bias of Traceroute Sampling; or, Power-law Degree Distributions in Regular Graphs
Authors:
Dimitris Achlioptas,
Aaron Clauset,
David Kempe,
Cristopher Moore
Abstract:
Understanding the structure of the Internet graph is a crucial step for building accurate network models and designing efficient algorithms for Internet applications. Yet, obtaining its graph structure is a surprisingly difficult task, as edges cannot be explicitly queried. Instead, empirical studies rely on traceroutes to build what are essentially single-source, all-destinations, shortest-path…
▽ More
Understanding the structure of the Internet graph is a crucial step for building accurate network models and designing efficient algorithms for Internet applications. Yet, obtaining its graph structure is a surprisingly difficult task, as edges cannot be explicitly queried. Instead, empirical studies rely on traceroutes to build what are essentially single-source, all-destinations, shortest-path trees. These trees only sample a fraction of the network's edges, and a recent paper by Lakhina et al. found empirically that the resuting sample is intrinsically biased. For instance, the observed degree distribution under traceroute sampling exhibits a power law even when the underlying degree distribution is Poisson.
In this paper, we study the bias of traceroute sampling systematically, and, for a very general class of underlying degree distributions, calculate the likely observed distributions explicitly. To do this, we use a continuous-time realization of the process of exposing the BFS tree of a random graph with a given degree distribution, calculate the expected degree distribution of the tree, and show that it is sharply concentrated. As example applications of our machinery, we show how traceroute sampling finds power-law degree distributions in both delta-regular and Poisson-distributed random graphs. Thus, our work puts the observations of Lakhina et al. on a rigorous footing, and extends them to nearly arbitrary degree distributions.
△ Less
Submitted 29 March, 2006; v1 submitted 3 March, 2005;
originally announced March 2005.
-
Scale Invariance in Global Terrorism
Authors:
Aaron Clauset,
Maxwell Young
Abstract:
Traditional analyses of international terrorism have not sought to explain the emergence of rare but extremely severe events. Using the tools of extremal statistics to analyze the set of terrorist attacks worldwide between 1968 and 2004, as compiled by the National Memorial Institute for the Prevention of Terrorism (MIPT), we find that the relationship between the frequency and severity of terro…
▽ More
Traditional analyses of international terrorism have not sought to explain the emergence of rare but extremely severe events. Using the tools of extremal statistics to analyze the set of terrorist attacks worldwide between 1968 and 2004, as compiled by the National Memorial Institute for the Prevention of Terrorism (MIPT), we find that the relationship between the frequency and severity of terrorist attacks exhibits the ``scale-free'' property with an exponent of close to two. This property is robust, even when we restrict our analysis to events from a single type of weapon or events within major industrialized nations. We also find that the distribution of event sizes has changed very little over the past 37 years, suggesting that scale invariance is an inherent feature of global terrorism.
△ Less
Submitted 1 May, 2005; v1 submitted 3 February, 2005;
originally announced February 2005.
-
Accuracy and Scaling Phenomena in Internet Mapping
Authors:
Aaron Clauset,
Cristopher Moore
Abstract:
A great deal of effort has been spent measuring topological features of the Internet. However, it was recently argued that sampling based on taking paths or traceroutes through the network from a small number of sources introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For Erdos-Renyi random graphs with mean degree c, we show…
▽ More
A great deal of effort has been spent measuring topological features of the Internet. However, it was recently argued that sampling based on taking paths or traceroutes through the network from a small number of sources introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For Erdos-Renyi random graphs with mean degree c, we show analytically that traceroute sampling gives an observed degree distribution P(k) ~ 1/k for k < c, even though the underlying degree distribution is Poisson. For graphs whose degree distributions have power-law tails P(k) ~ k^-alpha, traceroute sampling from a small number of sources can significantly underestimate the value of αwhen the graph has a large excess (i.e., many more edges than vertices). We find that in order to obtain a good estimate of alpha it is necessary to use a number of sources which grows linearly in the average degree of the underlying graph. Based on these observations we comment on the accuracy of the published values of alpha for the Internet.
△ Less
Submitted 4 October, 2004;
originally announced October 2004.
-
Finding community structure in very large networks
Authors:
Aaron Clauset,
M. E. J. Newman,
Cristopher Moore
Abstract:
The discovery and analysis of community structure in networks is a topic of considerable recent interest within the physics community, but most methods proposed so far are unsuitable for very large networks because of their computational cost. Here we present a hierarchical agglomeration algorithm for detecting community structure which is faster than many competing algorithms: its running time…
▽ More
The discovery and analysis of community structure in networks is a topic of considerable recent interest within the physics community, but most methods proposed so far are unsuitable for very large networks because of their computational cost. Here we present a hierarchical agglomeration algorithm for detecting community structure which is faster than many competing algorithms: its running time on a network with n vertices and m edges is O(m d log n) where d is the depth of the dendrogram describing the community structure. Many real-world networks are sparse and hierarchical, with m ~ n and d ~ log n, in which case our algorithm runs in essentially linear time, O(n log^2 n). As an example of the application of this algorithm we use it to analyze a network of items for sale on the web-site of a large online retailer, items in the network being linked if they are frequently purchased by the same buyer. The network has more than 400,000 vertices and 2 million edges. We show that our algorithm can extract meaningful communities from this network, revealing large-scale patterns present in the purchasing habits of customers.
△ Less
Submitted 30 August, 2004; v1 submitted 9 August, 2004;
originally announced August 2004.
-
Why Mapping the Internet is Hard
Authors:
Aaron Clauset,
Cristopher Moore
Abstract:
Despite great effort spent measuring topological features of large networks like the Internet, it was recently argued that sampling based on taking paths through the network (e.g., traceroutes) introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For classic random graphs with mean degree c, we show analytically that traceroute…
▽ More
Despite great effort spent measuring topological features of large networks like the Internet, it was recently argued that sampling based on taking paths through the network (e.g., traceroutes) introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For classic random graphs with mean degree c, we show analytically that traceroute sampling gives an observed degree distribution P(k) ~ 1/k for k < c, even though the underlying degree distribution is Poisson. For graphs whose degree distributions have power-law tails P(k) ~ k^-alpha, the accuracy of traceroute sampling is highly sensitive to the population of low-degree vertices. In particular, when the graph has a large excess (i.e., many more edges than vertices), traceroute sampling can significantly misestimate alpha.
△ Less
Submitted 13 July, 2004;
originally announced July 2004.
-
Traceroute sampling makes random graphs appear to have power law degree distributions
Authors:
Aaron Clauset,
Cristopher Moore
Abstract:
The topology of the Internet has typically been measured by sampling traceroutes, which are roughly shortest paths from sources to destinations. The resulting measurements have been used to infer that the Internet's degree distribution is scale-free; however, many of these measurements have relied on sampling traceroutes from a small number of sources. It was recently argued that sampling in thi…
▽ More
The topology of the Internet has typically been measured by sampling traceroutes, which are roughly shortest paths from sources to destinations. The resulting measurements have been used to infer that the Internet's degree distribution is scale-free; however, many of these measurements have relied on sampling traceroutes from a small number of sources. It was recently argued that sampling in this way can introduce a fundamental bias in the degree distribution, for instance, causing random (Erdos-Renyi) graphs to appear to have power law degree distributions. We explain this phenomenon analytically using differential equations to model the growth of a breadth-first tree in a random graph G(n,p=c/n) of average degree c, and show that sampling from a single source gives an apparent power law degree distribution P(k) ~ 1/k for k < c.
△ Less
Submitted 8 February, 2004; v1 submitted 29 December, 2003;
originally announced December 2003.
-
How Do Networks Become Navigable?
Authors:
Aaron Clauset,
Cristopher Moore
Abstract:
Networks created and maintained by social processes, such as the human friendship network and the World Wide Web, appear to exhibit the property of navigability: namely, not only do short paths exist between any pair of nodes, but such paths can easily be found using only local information. It has been shown that for networks with an underlying metric, algorithms using only local information per…
▽ More
Networks created and maintained by social processes, such as the human friendship network and the World Wide Web, appear to exhibit the property of navigability: namely, not only do short paths exist between any pair of nodes, but such paths can easily be found using only local information. It has been shown that for networks with an underlying metric, algorithms using only local information perform extremely well if there is a power-law distribution of link lengths. However, it is not clear why or how real networks might develop this distribution. In this paper we define a decentralized ``rewiring'' process, inspired by surfers on the Web, in which each surfer attempts to travel from their home page to a random destination, and updates the outgoing link from their home page if this journey takes too long. We show that this process does indeed cause the link length distribution to converge to a power law, achieving a routing time of O(log^2 n) on networks of size n. We also study finite-size effects on the optimal exponent, and show that it converges polylogarithmically slowly as the lattice size goes to infinity.
△ Less
Submitted 13 October, 2003; v1 submitted 17 September, 2003;
originally announced September 2003.