-
Fast algorithms to improve fair information access in networks
Authors:
Dennis Robert Windham,
Caroline J. Wendt,
Alex Crane,
Madelyn J Warr,
Freda Shi,
Sorelle A. Friedler,
Blair D. Sullivan,
Aaron Clauset
Abstract:
We consider the problem of selecting $k$ seed nodes in a network to maximize the minimum probability of activation under an independent cascade beginning at these seeds. The motivation is to promote fairness by ensuring that even the least advantaged members of the network have good access to information. Our problem can be viewed as a variant of the classic influence maximization objective, but i…
▽ More
We consider the problem of selecting $k$ seed nodes in a network to maximize the minimum probability of activation under an independent cascade beginning at these seeds. The motivation is to promote fairness by ensuring that even the least advantaged members of the network have good access to information. Our problem can be viewed as a variant of the classic influence maximization objective, but it appears somewhat more difficult to solve: only heuristics are known. Moreover, the scalability of these methods is sharply constrained by the need to repeatedly estimate access probabilities.
We design and evaluate a suite of $10$ new scalable algorithms which crucially do not require probability estimation. To facilitate comparison with the state-of-the-art, we make three more contributions which may be of broader interest. We introduce a principled method of selecting a pairwise information transmission parameter used in experimental evaluations, as well as a new performance metric which allows for comparison of algorithms across a range of values for the parameter $k$. Finally, we provide a new benchmark corpus of $174$ networks drawn from $6$ domains. Our algorithms retain most of the performance of the state-of-the-art while reducing running time by orders of magnitude. Specifically, a meta-learner approach is on average only $20\%$ less effective than the state-of-the-art on held-out data, but about $75-130$ times faster. Further, the meta-learner's performance exceeds the state-of the-art on about $20\%$ of networks, and the magnitude of its running time advantage is maintained on much larger networks.
△ Less
Submitted 19 February, 2025; v1 submitted 4 September, 2024;
originally announced September 2024.
-
Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns
Authors:
Xie He,
Amir Ghasemian,
Eun Lee,
Alice Schwarze,
Aaron Clauset,
Peter J. Mucha
Abstract:
Real-world network datasets are typically obtained in ways that fail to capture all edges. The patterns of missing data are often non-uniform as they reflect biases and other shortcomings of different data collection methods. Nevertheless, uniform missing data is a common assumption made when no additional information is available about the underlying missing-edge pattern, and link prediction meth…
▽ More
Real-world network datasets are typically obtained in ways that fail to capture all edges. The patterns of missing data are often non-uniform as they reflect biases and other shortcomings of different data collection methods. Nevertheless, uniform missing data is a common assumption made when no additional information is available about the underlying missing-edge pattern, and link prediction methods are frequently tested against uniformly missing edges. To investigate the impact of different missing-edge patterns on link prediction accuracy, we employ 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that we categorize into 5 groups. Our comparative simulation study, spanning 250 real-world network datasets from 6 different domains, provides a detailed picture of the significant variations in the performance of different link prediction algorithms in these different settings. With this study, we aim to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data, considering the data collection process and application domain.
△ Less
Submitted 30 April, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Scientific productivity as a random walk
Authors:
Sam Zhang,
Nicholas LaBerge,
Samuel F. Way,
Daniel B. Larremore,
Aaron Clauset
Abstract:
The expectation that scientific productivity follows regular patterns over a career underpins many scholarly evaluations, including hiring, promotion and tenure, awards, and grant funding. However, recent studies of individual productivity patterns reveal a puzzle: on the one hand, the average number of papers published per year robustly follows the "canonical trajectory" of a rapid rise to an ear…
▽ More
The expectation that scientific productivity follows regular patterns over a career underpins many scholarly evaluations, including hiring, promotion and tenure, awards, and grant funding. However, recent studies of individual productivity patterns reveal a puzzle: on the one hand, the average number of papers published per year robustly follows the "canonical trajectory" of a rapid rise to an early peak followed by a gradual decline, but on the other hand, only about 20% of individual productivity trajectories follow this pattern. We resolve this puzzle by modeling scientific productivity as a parameterized random walk, showing that the canonical pattern can be explained as a decrease in the variance in changes to productivity in the early-to-mid career. By empirically characterizing the variable structure of 2,085 productivity trajectories of computer science faculty at 205 PhD-granting institutions, spanning 29,119 publications over 1980--2016, we (i) discover remarkably simple patterns in both early-career and year-to-year changes to productivity, and (ii) show that a random walk model of productivity both reproduces the canonical trajectory in the average productivity and captures much of the diversity of individual-level trajectories. These results highlight the fundamental role of a panoply of contingent factors in shaping individual scientific productivity, opening up new avenues for characterizing how systemic incentives and opportunities can be directed for aggregate effect.
△ Less
Submitted 13 March, 2025; v1 submitted 8 September, 2023;
originally announced September 2023.
-
An Open-Source Cultural Consensus Approach to Name-Based Gender Classification
Authors:
Ian Van Buskirk,
Aaron Clauset,
Daniel B. Larremore
Abstract:
Name-based gender classification has enabled hundreds of otherwise infeasible scientific studies of gender. Yet, the lack of standardization, proliferation of ad hoc methods, reliance on paid services, understudied limitations, and conceptual debates cast a shadow over many applications. To address these problems we develop and evaluate an ensemble-based open-source method built on publicly availa…
▽ More
Name-based gender classification has enabled hundreds of otherwise infeasible scientific studies of gender. Yet, the lack of standardization, proliferation of ad hoc methods, reliance on paid services, understudied limitations, and conceptual debates cast a shadow over many applications. To address these problems we develop and evaluate an ensemble-based open-source method built on publicly available data of empirical name-gender associations. Our method integrates 36 distinct sources-spanning over 150 countries and more than a century-via a meta-learning algorithm inspired by Cultural Consensus Theory (CCT). We also construct a taxonomy with which names themselves can be classified. We find that our method's performance is competitive with paid services and that our method, and others, approach the upper limits of performance; we show that conditioning estimates on additional metadata (e.g. cultural context), further combining methods, or collecting additional name-gender association data is unlikely to meaningfully improve performance. This work definitively shows that name-based gender classification can be a reliable part of scientific research and provides a pair of tools, a classification method and a taxonomy of names, that realize this potential.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
Labor advantages drive the greater productivity of faculty at elite universities
Authors:
Sam Zhang,
K. Hunter Wapman,
Daniel B. Larremore,
Aaron Clauset
Abstract:
Faculty at prestigious institutions dominate scientific discourse, with the small proportion of researchers at elite universities producing a disproportionate share of all research publications. Environmental prestige is known to drive such epistemic disparity, but the mechanisms by which it causes increased faculty productivity remain unknown. Here we combine employment, publication, and federal…
▽ More
Faculty at prestigious institutions dominate scientific discourse, with the small proportion of researchers at elite universities producing a disproportionate share of all research publications. Environmental prestige is known to drive such epistemic disparity, but the mechanisms by which it causes increased faculty productivity remain unknown. Here we combine employment, publication, and federal survey data for 78,802 tenure-track faculty at 262 PhD-granting institutions in the American university system between 2008--2017 to show through multiple lines of evidence that the greater availability of funded graduate and postdoctoral labor at more prestigious institutions drives the environmental effect of prestige on productivity. In particular, we show that greater environmental prestige leads to larger faculty-led research groups, which drive higher faculty productivity, primarily in disciplines with research group collaboration norms. In contrast, we show that productivity does not increase substantially with prestige for either faculty papers published without group members, nor group members themselves. The disproportionate scientific productivity of elite researchers is thus largely explained by their substantial labor advantage, indicating a more limited role for prestige itself in predicting scientific contributions.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Subfield prestige and gender inequality in computing
Authors:
Nicholas LaBerge,
K. Hunter Wapman,
Allison C. Morgan,
Sam Zhang,
Daniel B. Larremore,
Aaron Clauset
Abstract:
Women and people of color remain dramatically underrepresented among computing faculty, and improvements in demographic diversity are slow and uneven. Effective diversification strategies depend on quantifying the correlates, causes, and trends of diversity in the field. But field-level demographic changes are driven by subfield hiring dynamics because faculty searches are typically at the subfiel…
▽ More
Women and people of color remain dramatically underrepresented among computing faculty, and improvements in demographic diversity are slow and uneven. Effective diversification strategies depend on quantifying the correlates, causes, and trends of diversity in the field. But field-level demographic changes are driven by subfield hiring dynamics because faculty searches are typically at the subfield level. Here, we quantify and forecast variations in the demographic composition of the subfields of computing using a comprehensive database of training and employment records for 6882 tenure-track faculty from 269 PhD-granting computing departments in the United States, linked with 327,969 publications. We find that subfield prestige correlates with gender inequality, such that faculty working in computing subfields with more women tend to hold positions at less prestigious institutions. In contrast, we find no significant evidence of racial or socioeconomic differences by subfield. Tracking representation over time, we find steady progress toward gender equality in all subfields, but more prestigious subfields tend to be roughly 25 years behind the less prestigious subfields in gender representation. These results illustrate how the choice of subfield in a faculty search can shape a department's gender diversity.
△ Less
Submitted 9 May, 2022; v1 submitted 1 January, 2022;
originally announced January 2022.
-
Sampling random graphs with specified degree sequences
Authors:
Upasana Dutta,
Bailey K. Fosdick,
Aaron Clauset
Abstract:
The configuration model is a standard tool for uniformly generating random graphs with a specified degree sequence, and is often used as a null model to evaluate how much of an observed network's structure can be explained by its degree structure alone. A Markov chain Monte Carlo (MCMC) algorithm, based on a degree-preserving double-edge swap, provides an asymptotic solution to sample from the con…
▽ More
The configuration model is a standard tool for uniformly generating random graphs with a specified degree sequence, and is often used as a null model to evaluate how much of an observed network's structure can be explained by its degree structure alone. A Markov chain Monte Carlo (MCMC) algorithm, based on a degree-preserving double-edge swap, provides an asymptotic solution to sample from the configuration model. However, accurately and efficiently detecting this Markov chain's convergence on its stationary distribution remains an unsolved problem. Here, we provide a solution to detect convergence and sample from the configuration model. We develop an algorithm, based on the assortativity of the sampled graphs, for estimating the gap between effectively independent MCMC states, and a computationally efficient gap-estimation heuristic derived from analyzing a corpus of 509 empirical networks. We provide a convergence detection method based on the Dickey-Fuller Generalized Least Squares test, which we show is more accurate and efficient than three alternative Markov chain convergence tests.
△ Less
Submitted 29 May, 2023; v1 submitted 25 May, 2021;
originally announced May 2021.
-
The Dynamics of Faculty Hiring Networks
Authors:
Eun Lee,
Aaron Clauset,
Daniel B. Larremore
Abstract:
Faculty hiring networks-who hires whose graduates as faculty-exhibit steep hierarchies, which can reinforce both social and epistemic inequalities in academia. Understanding the mechanisms driving these patterns would inform efforts to diversify the academy and shed new light on the role of hiring in shaping which scientific discoveries are made. Here, we investigate the degree to which structural…
▽ More
Faculty hiring networks-who hires whose graduates as faculty-exhibit steep hierarchies, which can reinforce both social and epistemic inequalities in academia. Understanding the mechanisms driving these patterns would inform efforts to diversify the academy and shed new light on the role of hiring in shaping which scientific discoveries are made. Here, we investigate the degree to which structural mechanisms can explain hierarchy and other network characteristics observed in empirical faculty hiring networks. We study a family of adaptive rewiring network models, which reinforce institutional prestige within the hierarchy in five distinct ways. Each mechanism determines the probability that a new hire comes from a particular institution according to that institution's prestige score, which is inferred from the hiring network's existing structure. We find that structural inequalities and centrality patterns in real hiring networks are best reproduced by a mechanism of global placement power, in which a new hire is drawn from a particular institution in proportion to the number of previously drawn hires anywhere. On the other hand, network measures of biased visibility are better recapitulated by a mechanism of local placement power, in which a new hire is drawn from a particular institution in proportion to the number of its previous hires already present at the hiring institution. These contrasting results suggest that the underlying structural mechanism reinforcing hierarchies in faculty hiring networks is a mixture of global and local preference for institutional prestige. Under these dynamics, we show that each institution's position in the hierarchy is remarkably stable, due to a dynamic competition that overwhelmingly favors more prestigious institutions.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Examining the consumption of radical content on YouTube
Authors:
Homa Hosseinmardi,
Amir Ghasemian,
Aaron Clauset,
Markus Mobius,
David M. Rothschild,
Duncan J. Watts
Abstract:
Although it is under-studied relative to other social media platforms, YouTube is arguably the largest and most engaging online media consumption platform in the world. Recently, YouTube's scale has fueled concerns that YouTube users are being radicalized via a combination of biased recommendations and ostensibly apolitical anti-woke channels, both of which have been claimed to direct attention to…
▽ More
Although it is under-studied relative to other social media platforms, YouTube is arguably the largest and most engaging online media consumption platform in the world. Recently, YouTube's scale has fueled concerns that YouTube users are being radicalized via a combination of biased recommendations and ostensibly apolitical anti-woke channels, both of which have been claimed to direct attention to radical political content. Here we test this hypothesis using a representative panel of more than 300,000 Americans and their individual-level browsing behavior, on and off YouTube, from January 2016 through December 2019. Using a labeled set of political news channels, we find that news consumption on YouTube is dominated by mainstream and largely centrist sources. Consumers of far-right content, while more engaged than average, represent a small and stable percentage of news consumers. However, consumption of anti-woke content, defined in terms of its opposition to progressive intellectual and political agendas, grew steadily in popularity and is correlated with consumption of far-right content off-platform. We find no evidence that engagement with far-right content is caused by YouTube recommendations systematically, nor do we find clear evidence that anti-woke channels serve as a gateway to the far right. Rather, consumption of political content on YouTube appears to reflect individual preferences that extend across the web as a whole.
△ Less
Submitted 14 February, 2022; v1 submitted 25 November, 2020;
originally announced November 2020.
-
Stacking Models for Nearly Optimal Link Prediction in Complex Networks
Authors:
Amir Ghasemian,
Homa Hosseinmardi,
Aram Galstyan,
Edoardo M. Airoldi,
Aaron Clauset
Abstract:
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability v…
▽ More
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Environmental Changes and the Dynamics of Musical Identity
Authors:
Samuel F. Way,
Santiago Gil,
Ian Anderson,
Aaron Clauset
Abstract:
Musical tastes reflect our unique values and experiences, our relationships with others, and the places where we live. But as each of these things changes, do our tastes also change to reflect the present, or remain fixed, reflecting our past? Here, we investigate how where a person lives shapes their musical preferences, using geographic relocation to construct quasi-natural experiments that meas…
▽ More
Musical tastes reflect our unique values and experiences, our relationships with others, and the places where we live. But as each of these things changes, do our tastes also change to reflect the present, or remain fixed, reflecting our past? Here, we investigate how where a person lives shapes their musical preferences, using geographic relocation to construct quasi-natural experiments that measure short- and long-term effects. Analyzing comprehensive data on over 16 million users on Spotify, we show that relocation within the United States has only a small impact on individuals' tastes, which remain more similar to those of their past environments. We then show that the age gap between a person and the music they consume indicates that adolescence, and likely their environment during these years, shapes their lifelong musical tastes. Our results demonstrate the robustness of individuals' musical identity, and shed new light on the development of preferences.
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
Predicting the outcomes of policy diffusion from U.S. states to federal law
Authors:
Nora Connor,
Aaron Clauset
Abstract:
In the United States, national policies often begin as state laws, which then spread from state to state until they gain momentum to become enacted as a national policy. However, not every state policy reaches the national level. Previous work has suggested that state-level policies are more likely to become national policies depending on their geographic origin, their category of legislation, or…
▽ More
In the United States, national policies often begin as state laws, which then spread from state to state until they gain momentum to become enacted as a national policy. However, not every state policy reaches the national level. Previous work has suggested that state-level policies are more likely to become national policies depending on their geographic origin, their category of legislation, or some characteristic of their initiating states, such as wealth, urbanicity, or ideological liberalism. Here, we tested these hypotheses by divorcing the set of traits from the states' identities and building predictive forecasting models of state policies becoming national policies. Using a large, longitudinal data set of state level policies and their traits, we train models to predict (i) whether policies become national policy, and (ii) how many states must pass a given policy before it becomes national. Using these models as components, we then develop a logistic growth model to forecast when a currently spreading state-level policy is likely to pass at the national level. Our results indicate that traits of initiating states are not systematically correlated with becoming national policy and they predict neither how many states must enact a policy before it becomes national nor whether it ultimately becomes a national law. In contrast, the cumulative number of state-level adoptions of a policy is reasonably predictive of when a policy becomes national. For the policies of same sex marriage and methamphetamine precursor laws, we investigate how well the logistic growth model could forecast the probable time horizon for true national action. We close with a data-driven forecast of when marijuana legalization and "stand your ground" laws will become national policy.
△ Less
Submitted 21 October, 2018;
originally announced October 2018.
-
Thermodynamics of the Minimum Description Length on Community Detection
Authors:
Juan Ignacio Perotti,
Claudio Juan Tessone,
Aaron Clauset,
Guido Caldarelli
Abstract:
Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occam's razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we…
▽ More
Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occam's razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we introduce the Boltzmannian MDL (BMDL), a formalization of the principle of MDL with a parametric complexity conveniently formulated as the free-energy of an artificial thermodynamic system. In this way, we leverage on the rich theoretical and technical background of statistical mechanics, to show the crucial importance that phase transitions and other thermodynamic concepts have on the problem of statistical modeling from an information theoretic point of view. For example, we provide information theoretic justifications of why a high-temperature series expansion can be used to compute systematic approximations of the BMDL when the formalism is used to model data, and why statistically significant model selections can be identified with ordered phases when the BMDL is used to model models. To test the introduced formalism, we compute approximations of BMDL for the problem of community detection in complex networks, where we obtain a principled MDL derivation of the Girvan-Newman (GN) modularity and the Zhang-Moore (ZM) community detection method. Here, by means of analytical estimations and numerical experiments on synthetic and empirical networks, we find that BMDL-based correction terms of the GN modularity improve the quality of the detected communities and we also find an information theoretic justification of why the ZM criterion for estimation of the number of network communities is better than alternative approaches such as the bare minimization of a free energy.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
-
Prestige drives epistemic inequality in the diffusion of scientific ideas
Authors:
Allison C. Morgan,
Dimitrios J. Economou,
Samuel F. Way,
Aaron Clauset
Abstract:
The spread of ideas in the scientific community is often viewed as a competition, in which good ideas spread further because of greater intrinsic fitness, and publication venue and citation counts correlate with importance and impact. However, relatively little is known about how structural factors influence the spread of ideas, and specifically how where an idea originates might influence how it…
▽ More
The spread of ideas in the scientific community is often viewed as a competition, in which good ideas spread further because of greater intrinsic fitness, and publication venue and citation counts correlate with importance and impact. However, relatively little is known about how structural factors influence the spread of ideas, and specifically how where an idea originates might influence how it spreads. Here, we investigate the role of faculty hiring networks, which embody the set of researcher transitions from doctoral to faculty institutions, in shaping the spread of ideas in computer science, and the importance of where in the network an idea originates. We consider comprehensive data on the hiring events of 5032 faculty at all 205 Ph.D.-granting departments of computer science in the U.S. and Canada, and on the timing and titles of 200,476 associated publications. Analyzing five popular research topics, we show empirically that faculty hiring can and does facilitate the spread of ideas in science. Having established such a mechanism, we then analyze its potential consequences using epidemic models to simulate the generic spread of research ideas and quantify the impact of where an idea originates on its longterm diffusion across the network. We find that research from prestigious institutions spreads more quickly and completely than work of similar quality originating from less prestigious institutions. Our analyses establish the theoretical trade-offs between university prestige and the quality of ideas necessary for efficient circulation. Our results establish faculty hiring as an underlying mechanism that drives the persistent epistemic advantage observed for elite institutions, and provide a theoretical lower bound for the impact of structural inequality in shaping the spread of ideas in science.
△ Less
Submitted 22 October, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Automatically assembling a full census of an academic field
Authors:
Allison C. Morgan,
Samuel F. Way,
Aaron Clauset
Abstract:
The composition of the scientific workforce shapes the direction of scientific research, directly through the selection of questions to investigate, and indirectly through its influence on the training of future scientists. In most fields, however, complete census information is difficult to obtain, complicating efforts to study workforce dynamics and the effects of policy. This is particularly tr…
▽ More
The composition of the scientific workforce shapes the direction of scientific research, directly through the selection of questions to investigate, and indirectly through its influence on the training of future scientists. In most fields, however, complete census information is difficult to obtain, complicating efforts to study workforce dynamics and the effects of policy. This is particularly true in computer science, which lacks a single, all-encompassing directory or professional organization. A full census of computer science would serve many purposes, not the least of which is a better understanding of the trends and causes of unequal representation in computing. Previous academic census efforts have relied on narrow or biased samples, or on professional society membership rolls. A full census can be constructed directly from online departmental faculty directories, but doing so by hand is prohibitively expensive and time-consuming. Here, we introduce a topical web crawler for automating the collection of faculty information from web-based department rosters, and demonstrate the resulting system on the 205 PhD-granting computer science departments in the U.S. and Canada. This method constructs a complete census of the field within a few minutes, and achieves over 99% precision and recall. We conclude by comparing the resulting 2017 census to a hand-curated 2011 census to quantify turnover and retention in computer science, in general and for female faculty in particular, demonstrating the types of analysis made possible by automated census construction.
△ Less
Submitted 26 April, 2018; v1 submitted 8 April, 2018;
originally announced April 2018.
-
Evaluating Overfit and Underfit in Models of Network Community Structure
Authors:
Amir Ghasemian,
Homa Hosseinmardi,
Aaron Clauset
Abstract:
A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algori…
▽ More
A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.
△ Less
Submitted 16 April, 2019; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Scale-free networks are rare
Authors:
Anna D. Broido,
Aaron Clauset
Abstract:
A central claim in modern network science is that real-world networks are typically "scale free," meaning that the fraction of nodes with degree $k$ follows a power law, decaying like $k^{-α}$, often with $2 < α< 3$. However, empirical evidence for this belief derives from a relatively small number of real-world networks. We test the universality of scale-free structure by applying state-of-the-ar…
▽ More
A central claim in modern network science is that real-world networks are typically "scale free," meaning that the fraction of nodes with degree $k$ follows a power law, decaying like $k^{-α}$, often with $2 < α< 3$. However, empirical evidence for this belief derives from a relatively small number of real-world networks. We test the universality of scale-free structure by applying state-of-the-art statistical tools to a large corpus of nearly 1000 network data sets drawn from social, biological, technological, and informational sources. We fit the power-law model to each degree distribution, test its statistical plausibility, and compare it via a likelihood ratio test to alternative, non-scale-free models, e.g., the log-normal. Across domains, we find that scale-free networks are rare, with only 4% exhibiting the strongest-possible evidence of scale-free structure and 52% exhibiting the weakest-possible evidence. Furthermore, evidence of scale-free structure is not uniformly distributed across sources: social networks are at best weakly scale free, while a handful of technological and biological networks can be called strongly scale free. These results undermine the universality of scale-free networks and reveal that real-world networks exhibit a rich structural diversity that will likely require new ideas and mechanisms to explain.
△ Less
Submitted 8 January, 2018;
originally announced January 2018.
-
Characterizing the structural diversity of complex networks across domains
Authors:
Kansuke Ikehara,
Aaron Clauset
Abstract:
The structure of complex networks has been of interest in many scientific and engineering disciplines over the decades. A number of studies in the field have been focused on finding the common properties among different kinds of networks such as heavy-tail degree distribution, small-worldness and modular structure and they have tried to establish a theory of structural universality in complex netw…
▽ More
The structure of complex networks has been of interest in many scientific and engineering disciplines over the decades. A number of studies in the field have been focused on finding the common properties among different kinds of networks such as heavy-tail degree distribution, small-worldness and modular structure and they have tried to establish a theory of structural universality in complex networks. However, there is no comprehensive study of network structure across a diverse set of domains in order to explain the structural diversity we observe in the real-world networks. In this paper, we study 986 real-world networks of diverse domains ranging from ecological food webs to online social networks along with 575 networks generated from four popular network models. Our study utilizes a number of machine learning techniques such as random forest and confusion matrix in order to show the relationships among network domains in terms of network structure. Our results indicate that there are some partitions of network categories in which networks are hard to distinguish based purely on network structure. We have found that these partitions of network categories tend to have similar underlying functions, constraints and/or generative mechanisms of networks even though networks in the same partition have different origins, e.g., biological processes, results of engineering by human being, etc. This suggests that the origin of a network, whether it's biological, technological or social, may not necessarily be a decisive factor of the formation of similar network structure. Our findings shed light on the possible direction along which we could uncover the hidden principles for the structural diversity of complex networks.
△ Less
Submitted 30 October, 2017;
originally announced October 2017.
-
The misleading narrative of the canonical faculty productivity trajectory
Authors:
Samuel F. Way,
Allison C. Morgan,
Aaron Clauset,
Daniel B. Larremore
Abstract:
A scientist may publish tens or hundreds of papers over a career, but these contributions are not evenly spaced in time. Sixty years of studies on career productivity patterns in a variety of fields suggest an intuitive and universal pattern: productivity tends to rise rapidly to an early peak and then gradually declines. Here, we test the universality of this conventional narrative by analyzing t…
▽ More
A scientist may publish tens or hundreds of papers over a career, but these contributions are not evenly spaced in time. Sixty years of studies on career productivity patterns in a variety of fields suggest an intuitive and universal pattern: productivity tends to rise rapidly to an early peak and then gradually declines. Here, we test the universality of this conventional narrative by analyzing the structures of individual faculty productivity time series, constructed from over 200,000 publications and matched with hiring data for 2453 tenure-track faculty in all 205 Ph.D-granting computer science departments in the U.S. and Canada. Unlike prior studies, which considered only some faculty or some institutions, or lacked common career reference points, here we combine a large bibliographic dataset with comprehensive information on career transitions that covers an entire field of study. We show that the conventional narrative confidently describes only one fifth of faculty, regardless of department prestige or researcher gender, and the remaining four fifths of faculty exhibit a rich diversity of productivity patterns. To explain this diversity, we introduce a simple model of productivity trajectories, and explore correlations between its parameters and researcher covariates, showing that departmental prestige predicts overall individual productivity and the timing of the transition from first- to last-author publications. These results demonstrate the unpredictability of productivity over time, and open the door for new efforts to understand how environmental and individual factors shape scientific productivity.
△ Less
Submitted 17 October, 2017; v1 submitted 24 December, 2016;
originally announced December 2016.
-
The ground truth about metadata and community detection in networks
Authors:
Leto Peel,
Daniel B. Larremore,
Aaron Clauset
Abstract:
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-ca…
▽ More
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called "ground truth" communities. This works well in synthetic networks with planted communities because such networks' links are formed explicitly based on those known communities. However, there are no planted communities in real world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. Here, we show that metadata are not the same as ground truth, and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structure.
△ Less
Submitted 3 May, 2017; v1 submitted 20 August, 2016;
originally announced August 2016.
-
Gender, Productivity, and Prestige in Computer Science Faculty Hiring Networks
Authors:
Samuel F. Way,
Daniel B. Larremore,
Aaron Clauset
Abstract:
Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty h…
▽ More
Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty hiring and the non-independence of hires. Using comprehensive data on both hiring outcomes and scholarly productivity for 2659 tenure-track faculty across 205 Ph.D.-granting departments in North America, we investigate the multi-dimensional nature of gender inequality in computer science faculty hiring through a network model of the hiring process. Overall, we find that hiring outcomes are most directly affected by (i) the relative prestige between hiring and placing institutions and (ii) the scholarly productivity of the candidates. After including these, and other features, the addition of gender did not significantly reduce modeling error. However, gender differences do exist, e.g., in scholarly productivity, postdoctoral training rates, and in career movements up the rankings of universities, suggesting that the effects of gender are indirectly incorporated into hiring decisions through gender's covariates. Furthermore, we find evidence that more highly ranked departments recruit female faculty at higher than expected rates, which appears to inhibit similar efforts by lower ranked departments. These findings illustrate the subtle nature of gender inequality in faculty hiring networks and provide new insights to the underrepresentation of women in computer science.
△ Less
Submitted 2 February, 2016;
originally announced February 2016.
-
Structure and inference in annotated networks
Authors:
M. E. J. Newman,
Aaron Clauset
Abstract:
For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network, geographic location of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Here we demonstrate how this "metadata" can be used to improve our analysis and understanding of network s…
▽ More
For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network, geographic location of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Here we demonstrate how this "metadata" can be used to improve our analysis and understanding of network structure. We focus in particular on the problem of community detection in networks and develop a mathematically principled approach that combines a network and its metadata to detect communities more accurately than can be done with either alone. Crucially, the method does not assume that the metadata are correlated with the communities we are trying to find. Instead the method learns whether a correlation exists and correctly uses or ignores the metadata depending on whether they contain useful information. The learned correlations are also of interest in their own right, allowing us to make predictions about the community membership of nodes whose network connections are unknown. We demonstrate our method on synthetic networks with known structure and on real-world networks, large and small, drawn from social, biological, and technological domains.
△ Less
Submitted 14 July, 2015;
originally announced July 2015.
-
Eigenvector-Based Centrality Measures for Temporal Networks
Authors:
Dane Taylor,
Sean A. Myers,
Aaron Clauset,
Mason A. Porter,
Peter J. Mucha
Abstract:
Numerous centrality measures have been developed to quantify the importances of nodes in time-independent networks, and many of them can be expressed as the leading eigenvector of some matrix. With the increasing availability of network data that changes in time, it is important to extend such eigenvector-based centrality measures to time-dependent networks. In this paper, we introduce a principle…
▽ More
Numerous centrality measures have been developed to quantify the importances of nodes in time-independent networks, and many of them can be expressed as the leading eigenvector of some matrix. With the increasing availability of network data that changes in time, it is important to extend such eigenvector-based centrality measures to time-dependent networks. In this paper, we introduce a principled generalization of network centrality measures that is valid for any eigenvector-based centrality. We consider a temporal network with N nodes as a sequence of T layers that describe the network during different time windows, and we couple centrality matrices for the layers into a supra-centrality matrix of size NTxNT whose dominant eigenvector gives the centrality of each node i at each time t. We refer to this eigenvector and its components as a joint centrality, as it reflects the importances of both the node i and the time layer t. We also introduce the concepts of marginal and conditional centralities, which facilitate the study of centrality trajectories over time. We find that the strength of coupling between layers is important for determining multiscale properties of centrality, such as localization phenomena and the time scale of centrality changes. In the strong-coupling regime, we derive expressions for time-averaged centralities, which are given by the zeroth-order terms of a singular perturbation expansion. We also study first-order terms to obtain first-order-mover scores, which concisely describe the magnitude of nodes' centrality changes over time. As examples, we apply our method to three empirical temporal networks: the United States Ph.D. exchange in mathematics, costarring relationships among top-billed actors during the Golden Age of Hollywood, and citations of decisions from the United States Supreme Court.
△ Less
Submitted 21 September, 2016; v1 submitted 5 July, 2015;
originally announced July 2015.
-
Detectability thresholds and optimal algorithms for community structure in dynamic networks
Authors:
Amir Ghasemian,
Pan Zhang,
Aaron Clauset,
Cristopher Moore,
Leto Peel
Abstract:
We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a…
▽ More
We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a function of the rate of change and the strength of the communities. Below this threshold, we claim that no algorithm can identify the communities better than chance. We then give two algorithms that are optimal in the sense that they succeed all the way down to this limit. The first uses belief propagation (BP), which gives asymptotically optimal accuracy, and the second is a fast spectral clustering algorithm, based on linearizing the BP equations. We verify our analytic and algorithmic results via numerical simulation, and close with a brief discussion of extensions and open questions.
△ Less
Submitted 19 June, 2015;
originally announced June 2015.
-
Predicting sports scoring dynamics with restoration and anti-persistence
Authors:
Leto Peel,
Aaron Clauset
Abstract:
Professional team sports provide an excellent domain for studying the dynamics of social competitions. These games are constructed with simple, well-defined rules and payoffs that admit a high-dimensional set of possible actions and nontrivial scoring dynamics. The resulting gameplay and efforts to predict its evolution are the object of great interest to both sports professionals and enthusiasts.…
▽ More
Professional team sports provide an excellent domain for studying the dynamics of social competitions. These games are constructed with simple, well-defined rules and payoffs that admit a high-dimensional set of possible actions and nontrivial scoring dynamics. The resulting gameplay and efforts to predict its evolution are the object of great interest to both sports professionals and enthusiasts. In this paper, we consider two online prediction problems for team sports:~given a partially observed game Who will score next? and ultimately Who will win? We present novel interpretable generative models of within-game scoring that allow for dependence on lead size (restoration) and on the last team to score (anti-persistence). We then apply these models to comprehensive within-game scoring data for four sports leagues over a ten year period. By assessing these models' relative goodness-of-fit we shed new light on the underlying mechanisms driving the observed scoring dynamics of each sport. Furthermore, in both predictive tasks, the performance of our models consistently outperforms baselines models, and our models make quantitative assessments of the latent team skill, over time.
△ Less
Submitted 22 April, 2015;
originally announced April 2015.
-
Assembling thefacebook: Using heterogeneity to understand online social network assembly
Authors:
Abigail Z. Jacobs,
Samuel F. Way,
Johan Ugander,
Aaron Clauset
Abstract:
Online social networks represent a popular and diverse class of social media systems. Despite this variety, each of these systems undergoes a general process of online social network assembly, which represents the complicated and heterogeneous changes that transform newly born systems into mature platforms. However, little is known about this process. For example, how much of a network's assembly…
▽ More
Online social networks represent a popular and diverse class of social media systems. Despite this variety, each of these systems undergoes a general process of online social network assembly, which represents the complicated and heterogeneous changes that transform newly born systems into mature platforms. However, little is known about this process. For example, how much of a network's assembly is driven by simple growth? How does a network's structure change as it matures? How does network structure vary with adoption rates and user heterogeneity, and do these properties play different roles at different points in the assembly? We investigate these and other questions using a unique dataset of online connections among the roughly one million users at the first 100 colleges admitted to Facebook, captured just 20 months after its launch. We first show that different vintages and adoption rates across this population of networks reveal temporal dynamics of the assembly process, and that assembly is only loosely related to network growth. We then exploit natural experiments embedded in this dataset and complementary data obtained via Internet archaeology to show that different subnetworks matured at different rates toward similar end states. These results shed light on the processes and patterns of online social network assembly, and may facilitate more effective design for online social systems.
△ Less
Submitted 31 May, 2015; v1 submitted 23 March, 2015;
originally announced March 2015.
-
A unified view of generative models for networks: models, methods, opportunities, and challenges
Authors:
Abigail Z. Jacobs,
Aaron Clauset
Abstract:
Research on probabilistic models of networks now spans a wide variety of fields, including physics, sociology, biology, statistics, and machine learning. These efforts have produced a diverse ecology of models and methods. Despite this diversity, many of these models share a common underlying structure: pairwise interactions (edges) are generated with probability conditional on latent vertex attri…
▽ More
Research on probabilistic models of networks now spans a wide variety of fields, including physics, sociology, biology, statistics, and machine learning. These efforts have produced a diverse ecology of models and methods. Despite this diversity, many of these models share a common underlying structure: pairwise interactions (edges) are generated with probability conditional on latent vertex attributes. Differences between models generally stem from different philosophical choices about how to learn from data or different empirically-motivated goals. The highly interdisciplinary nature of work on these generative models, however, has inhibited the development of a unified view of their similarities and differences. For instance, novel theoretical models and optimization techniques developed in machine learning are largely unknown within the social and biological sciences, which have instead emphasized model interpretability. Here, we describe a unified view of generative models for networks that draws together many of these disparate threads and highlights the fundamental similarities and differences that span these fields. We then describe a number of opportunities and challenges for future work that are revealed by this view.
△ Less
Submitted 14 November, 2014;
originally announced November 2014.
-
Learning Latent Block Structure in Weighted Networks
Authors:
Christopher Aicher,
Abigail Z. Jacobs,
Aaron Clauset
Abstract:
Community detection is an important task in network analysis, in which we aim to learn a network partition that groups together vertices with similar community-level connectivity patterns. By finding such groups of vertices with similar structural roles, we extract a compact representation of the network's large-scale structure, which can facilitate its scientific interpretation and the prediction…
▽ More
Community detection is an important task in network analysis, in which we aim to learn a network partition that groups together vertices with similar community-level connectivity patterns. By finding such groups of vertices with similar structural roles, we extract a compact representation of the network's large-scale structure, which can facilitate its scientific interpretation and the prediction of unknown or future interactions. Popular approaches, including the stochastic block model, assume edges are unweighted, which limits their utility by throwing away potentially useful information. We introduce the `weighted stochastic block model' (WSBM), which generalizes the stochastic block model to networks with edge weights drawn from any exponential family distribution. This model learns from both the presence and weight of edges, allowing it to discover structure that would otherwise be hidden when weights are discarded or thresholded. We describe a Bayesian variational algorithm for efficiently approximating this model's posterior distribution over latent block structures. We then evaluate the WSBM's performance on both edge-existence and edge-weight prediction tasks for a set of real-world weighted networks. In all cases, the WSBM performs as well or better than the best alternatives on these tasks.
△ Less
Submitted 3 June, 2014; v1 submitted 1 April, 2014;
originally announced April 2014.
-
Efficiently inferring community structure in bipartite networks
Authors:
Daniel B. Larremore,
Aaron Clauset,
Abigail Z. Jacobs
Abstract:
Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and…
▽ More
Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and lack of interpretability. Here we solve the community detection problem for bipartite networks by formulating a bipartite stochastic block model, which explicitly includes vertex type information and may be trivially extended to $k$-partite networks. This bipartite stochastic block model yields a projection-free and statistically principled method for community detection that makes clear assumptions and parameter choices and yields interpretable results. We demonstrate this model's ability to efficiently and accurately find community structure in synthetic bipartite networks with known structure and in real-world bipartite networks with unknown structure, and we characterize its performance in practical contexts.
△ Less
Submitted 10 July, 2014; v1 submitted 12 March, 2014;
originally announced March 2014.
-
Detecting change points in the large-scale structure of evolving networks
Authors:
Leto Peel,
Aaron Clauset
Abstract:
Interactions among people or objects are often dynamic in nature and can be represented as a sequence of networks, each providing a snapshot of the interactions over a brief period of time. An important task in analyzing such evolving networks is change-point detection, in which we both identify the times at which the large-scale pattern of interactions changes fundamentally and quantify how large…
▽ More
Interactions among people or objects are often dynamic in nature and can be represented as a sequence of networks, each providing a snapshot of the interactions over a brief period of time. An important task in analyzing such evolving networks is change-point detection, in which we both identify the times at which the large-scale pattern of interactions changes fundamentally and quantify how large and what kind of change occurred. Here, we formalize for the first time the network change-point detection problem within an online probabilistic learning framework and introduce a method that can reliably solve it. This method combines a generalized hierarchical random graph model with a Bayesian hypothesis test to quantitatively determine if, when, and precisely how a change point has occurred. We analyze the detectability of our method using synthetic data with known change points of different types and magnitudes, and show that this method is more accurate than several previously used alternatives. Applied to two high-resolution evolving social networks, this method identifies a sequence of change points that align with known external "shocks" to these networks.
△ Less
Submitted 14 November, 2014; v1 submitted 4 March, 2014;
originally announced March 2014.
-
Scoring dynamics across professional team sports: tempo, balance and predictability
Authors:
Sears Merritt,
Aaron Clauset
Abstract:
Despite growing interest in quantifying and modeling the scoring dynamics within professional sports games, relative little is known about what patterns or principles, if any, cut across different sports. Using a comprehensive data set of scoring events in nearly a dozen consecutive seasons of college and professional (American) football, professional hockey, and professional basketball, we identi…
▽ More
Despite growing interest in quantifying and modeling the scoring dynamics within professional sports games, relative little is known about what patterns or principles, if any, cut across different sports. Using a comprehensive data set of scoring events in nearly a dozen consecutive seasons of college and professional (American) football, professional hockey, and professional basketball, we identify several common patterns in scoring dynamics. Across these sports, scoring tempo---when scoring events occur---closely follows a common Poisson process, with a sport-specific rate. Similarly, scoring balance---how often a team wins an event---follows a common Bernoulli process, with a parameter that effectively varies with the size of the lead. Combining these processes within a generative model of gameplay, we find they both reproduce the observed dynamics in all four sports and accurately predict game outcomes. These results demonstrate common dynamical patterns underlying within-game scoring dynamics across professional team sports, and suggest specific mechanisms for driving them. We close with a brief discussion of the implications of our results for several popular hypotheses about sports dynamics.
△ Less
Submitted 20 March, 2014; v1 submitted 16 October, 2013;
originally announced October 2013.
-
Social Network Dynamics in a Massive Online Game: Network Turnover, Non-densification, and Team Engagement in Halo Reach
Authors:
Sears Merritt,
Aaron Clauset
Abstract:
Online multiplayer games are a popular form of social interaction, used by hundreds of millions of individuals. However, little is known about the social networks within these online games, or how they evolve over time. Understanding human social dynamics within massive online games can shed new light on social interactions in general and inform the development of more engaging systems. Here, we s…
▽ More
Online multiplayer games are a popular form of social interaction, used by hundreds of millions of individuals. However, little is known about the social networks within these online games, or how they evolve over time. Understanding human social dynamics within massive online games can shed new light on social interactions in general and inform the development of more engaging systems. Here, we study a novel, large friendship network, inferred from nearly 18 billion social interactions over 44 weeks between 17 million individuals in the popular online game Halo: Reach. This network is one of the largest, most detailed temporal interaction networks studied to date, and provides a novel perspective on the dynamics of online friendship networks, as opposed to mere interaction graphs. Initially, this network exhibits strong structural turnover and decays rapidly from a peak size. In the following period, however, both network size and turnover stabilize, producing a dynamic structural equilibrium. In contrast to other studies, we find that the Halo friendship network is non-densifying: both the mean degree and the average pairwise distance are stable, suggesting that densification cannot occur when maintaining friendships is costly. Finally, players with greater long-term engagement exhibit stronger local clustering, suggesting a group-level social engagement process. These results demonstrate the utility of online games for studying social networks, shed new light on empirical temporal graph patterns, and clarify the claims of universality of network densification.
△ Less
Submitted 18 June, 2013;
originally announced June 2013.
-
Adapting the Stochastic Block Model to Edge-Weighted Networks
Authors:
Christopher Aicher,
Abigail Z. Jacobs,
Aaron Clauset
Abstract:
We generalize the stochastic block model to the important case in which edges are annotated with weights drawn from an exponential family distribution. This generalization introduces several technical difficulties for model estimation, which we solve using a Bayesian approach. We introduce a variational algorithm that efficiently approximates the model's posterior distribution for dense graphs. In…
▽ More
We generalize the stochastic block model to the important case in which edges are annotated with weights drawn from an exponential family distribution. This generalization introduces several technical difficulties for model estimation, which we solve using a Bayesian approach. We introduce a variational algorithm that efficiently approximates the model's posterior distribution for dense graphs. In specific numerical experiments on edge-weighted networks, this weighted stochastic block model outperforms the common approach of first applying a single threshold to all weights and then applying the classic stochastic block model, which can obscure latent block structure in networks. This model will enable the recovery of latent structure in a broader range of network data than was previously possible.
△ Less
Submitted 24 May, 2013;
originally announced May 2013.
-
Environmental structure and competitive scoring advantages in team competitions
Authors:
Sears Merritt,
Aaron Clauset
Abstract:
In most professional sports, the structure of the environment is kept neutral so that scoring imbalances may be attributed to differences in team skill. It thus remains unknown what impact structural heterogeneities can have on scoring dynamics and producing competitive advantages. Applying a generative model of scoring dynamics to roughly 10 million team competitions drawn from an online game, we…
▽ More
In most professional sports, the structure of the environment is kept neutral so that scoring imbalances may be attributed to differences in team skill. It thus remains unknown what impact structural heterogeneities can have on scoring dynamics and producing competitive advantages. Applying a generative model of scoring dynamics to roughly 10 million team competitions drawn from an online game, we quantify the relationship between a competition's structure and its scoring dynamics. Despite wide structural variations, we find the same three-phase pattern in the tempo of events observed in many sports. Tempo and balance are highly predictable from a competition's structural features alone and teams exploit environmental heterogeneities for sustained competitive advantage. The most balanced competitions are associated with specific environmental heterogeneities, not from equally skilled teams. These results shed new light on the principles of balanced competition, and illustrate the potential of online game data for investigating social dynamics and competition.
△ Less
Submitted 3 April, 2013;
originally announced April 2013.
-
Detecting Friendship Within Dynamic Online Interaction Networks
Authors:
Sears Merritt,
Abigail Z. Jacobs,
Winter Mason,
Aaron Clauset
Abstract:
In many complex social systems, the timing and frequency of interactions between individuals are observable but friendship ties are hidden. Recovering these hidden ties, particularly for casual users who are relatively less active, would enable a wide variety of friendship-aware applications in domains where labeled data are often unavailable, including online advertising and national security. He…
▽ More
In many complex social systems, the timing and frequency of interactions between individuals are observable but friendship ties are hidden. Recovering these hidden ties, particularly for casual users who are relatively less active, would enable a wide variety of friendship-aware applications in domains where labeled data are often unavailable, including online advertising and national security. Here, we investigate the accuracy of multiple statistical features, based either purely on temporal interaction patterns or on the cooperative nature of the interactions, for automatically extracting latent social ties. Using self-reported friendship and non-friendship labels derived from an anonymous online survey, we learn highly accurate predictors for recovering hidden friendships within a massive online data set encompassing 18 billion interactions among 17 million individuals of the popular online game Halo: Reach. We find that the accuracy of many features improves as more data accumulates, and cooperative features are generally reliable. However, periodicities in interaction time series are sufficient to correctly classify 95% of ties, even for casual users. These results clarify the nature of friendship in online social environments and suggest new opportunities and new privacy concerns for friendship-aware applications that do not require the disclosure of private friendship information.
△ Less
Submitted 25 March, 2013;
originally announced March 2013.
-
Persistence and periodicity in a dynamic proximity network
Authors:
Aaron Clauset,
Nathan Eagle
Abstract:
The topology of social networks can be understood as being inherently dynamic, with edges having a distinct position in time. Most characterizations of dynamic networks discretize time by converting temporal information into a sequence of network "snapshots" for further analysis. Here we study a highly resolved data set of a dynamic proximity network of 66 individuals. We show that the topology of…
▽ More
The topology of social networks can be understood as being inherently dynamic, with edges having a distinct position in time. Most characterizations of dynamic networks discretize time by converting temporal information into a sequence of network "snapshots" for further analysis. Here we study a highly resolved data set of a dynamic proximity network of 66 individuals. We show that the topology of this network evolves over a very broad distribution of time scales, that its behavior is characterized by strong periodicities driven by external calendar cycles, and that the conversion of inherently continuous-time data into a sequence of snapshots can produce highly biased estimates of network structure. We suggest that dynamic social networks exhibit a natural time scale Δ_{nat}, and that the best conversion of such dynamic data to a discrete sequence of networks is done at this natural rate.
△ Less
Submitted 30 November, 2012;
originally announced November 2012.
-
Estimating the historical and future probabilities of large terrorist events
Authors:
Aaron Clauset,
Ryan Woodard
Abstract:
Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given moder…
▽ More
Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given modern terrorism's historical record? Accurately estimating the probability of such an event is complicated by the large fluctuations in the empirical distribution's upper tail. We present a generic statistical algorithm for making such estimates, which combines semi-parametric models of tail behavior and a nonparametric bootstrap. Applied to a global database of terrorist events, we estimate the worldwide historical probability of observing at least one 9/11-sized or larger event since 1968 to be 11-35%. These results are robust to conditioning on global variations in economic development, domestic versus international events, the type of weapon used and a truncated history that stops at 1998. We then use this procedure to make a data-driven statistical forecast of at least one similar event over the next decade.
△ Less
Submitted 8 January, 2014; v1 submitted 1 September, 2012;
originally announced September 2012.
-
Friends FTW! Friendship, Collaboration and Competition in Halo: Reach
Authors:
Winter Mason,
Aaron Clauset
Abstract:
How important are friendships in determining success by individuals and teams in complex collaborative environments? By combining a novel data set containing the dynamics of millions of ad hoc teams from the popular multiplayer online first person shooter Halo: Reach with survey data on player demographics, play style, psychometrics and friendships derived from an anonymous online survey, we inves…
▽ More
How important are friendships in determining success by individuals and teams in complex collaborative environments? By combining a novel data set containing the dynamics of millions of ad hoc teams from the popular multiplayer online first person shooter Halo: Reach with survey data on player demographics, play style, psychometrics and friendships derived from an anonymous online survey, we investigate the impact of friendship on collaborative and competitive performance. In addition to finding significant differences in player behavior across these variables, we find that friendships exert a strong influence, leading to both improved individual and team performance--even after controlling for the overall expertise of the team--and increased pro-social behaviors. Players also structure their in-game activities around social opportunities, and as a result hidden friendship ties can be accurately inferred directly from behavioral time series. Virtual environments that enable such friendship effects will thus likely see improved collaboration and competition.
△ Less
Submitted 25 February, 2013; v1 submitted 10 March, 2012;
originally announced March 2012.
-
Adapting to Non-stationarity with Growing Expert Ensembles
Authors:
Cosma Rohilla Shalizi,
Abigail Z. Jacobs,
Kristina Lisa Klinkner,
Aaron Clauset
Abstract:
When dealing with time series with complex non-stationarities, low retrospective regret on individual realizations is a more appropriate goal than low prospective risk in expectation. Online learning algorithms provide powerful guarantees of this form, and have often been proposed for use with non-stationary processes because of their ability to switch between different forecasters or ``experts''.…
▽ More
When dealing with time series with complex non-stationarities, low retrospective regret on individual realizations is a more appropriate goal than low prospective risk in expectation. Online learning algorithms provide powerful guarantees of this form, and have often been proposed for use with non-stationary processes because of their ability to switch between different forecasters or ``experts''. However, existing methods assume that the set of experts whose forecasts are to be combined are all given at the start, which is not plausible when dealing with a genuinely historical or evolutionary system. We show how to modify the ``fixed shares'' algorithm for tracking the best expert to cope with a steadily growing set of experts, obtained by fitting new models to new data as it becomes available, and obtain regret bounds for the growing ensemble.
△ Less
Submitted 28 June, 2011; v1 submitted 4 March, 2011;
originally announced March 2011.
-
Structural Inference of Hierarchies in Networks
Authors:
Aaron Clauset,
Cristopher Moore,
M. E. J. Newman
Abstract:
One property of networks that has received comparatively little attention is hierarchy, i.e., the property of having vertices that cluster together in groups, which then join to form groups of groups, and so forth, up through all levels of organization in the network. Here, we give a precise definition of hierarchical structure, give a generic model for generating arbitrary hierarchical structur…
▽ More
One property of networks that has received comparatively little attention is hierarchy, i.e., the property of having vertices that cluster together in groups, which then join to form groups of groups, and so forth, up through all levels of organization in the network. Here, we give a precise definition of hierarchical structure, give a generic model for generating arbitrary hierarchical structure in a random graph, and describe a statistically principled way to learn the set of hierarchical features that most plausibly explain a particular real-world network. By applying this approach to two example networks, we demonstrate its advantages for the interpretation of network data, the annotation of graphs with edge, vertex and community properties, and the generation of generic null models for further hypothesis testing.
△ Less
Submitted 9 October, 2006;
originally announced October 2006.
-
On the Bias of Traceroute Sampling; or, Power-law Degree Distributions in Regular Graphs
Authors:
Dimitris Achlioptas,
Aaron Clauset,
David Kempe,
Cristopher Moore
Abstract:
Understanding the structure of the Internet graph is a crucial step for building accurate network models and designing efficient algorithms for Internet applications. Yet, obtaining its graph structure is a surprisingly difficult task, as edges cannot be explicitly queried. Instead, empirical studies rely on traceroutes to build what are essentially single-source, all-destinations, shortest-path…
▽ More
Understanding the structure of the Internet graph is a crucial step for building accurate network models and designing efficient algorithms for Internet applications. Yet, obtaining its graph structure is a surprisingly difficult task, as edges cannot be explicitly queried. Instead, empirical studies rely on traceroutes to build what are essentially single-source, all-destinations, shortest-path trees. These trees only sample a fraction of the network's edges, and a recent paper by Lakhina et al. found empirically that the resuting sample is intrinsically biased. For instance, the observed degree distribution under traceroute sampling exhibits a power law even when the underlying degree distribution is Poisson.
In this paper, we study the bias of traceroute sampling systematically, and, for a very general class of underlying degree distributions, calculate the likely observed distributions explicitly. To do this, we use a continuous-time realization of the process of exposing the BFS tree of a random graph with a given degree distribution, calculate the expected degree distribution of the tree, and show that it is sharply concentrated. As example applications of our machinery, we show how traceroute sampling finds power-law degree distributions in both delta-regular and Poisson-distributed random graphs. Thus, our work puts the observations of Lakhina et al. on a rigorous footing, and extends them to nearly arbitrary degree distributions.
△ Less
Submitted 29 March, 2006; v1 submitted 3 March, 2005;
originally announced March 2005.
-
Accuracy and Scaling Phenomena in Internet Mapping
Authors:
Aaron Clauset,
Cristopher Moore
Abstract:
A great deal of effort has been spent measuring topological features of the Internet. However, it was recently argued that sampling based on taking paths or traceroutes through the network from a small number of sources introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For Erdos-Renyi random graphs with mean degree c, we show…
▽ More
A great deal of effort has been spent measuring topological features of the Internet. However, it was recently argued that sampling based on taking paths or traceroutes through the network from a small number of sources introduces a fundamental bias in the observed degree distribution. We examine this bias analytically and experimentally. For Erdos-Renyi random graphs with mean degree c, we show analytically that traceroute sampling gives an observed degree distribution P(k) ~ 1/k for k < c, even though the underlying degree distribution is Poisson. For graphs whose degree distributions have power-law tails P(k) ~ k^-alpha, traceroute sampling from a small number of sources can significantly underestimate the value of αwhen the graph has a large excess (i.e., many more edges than vertices). We find that in order to obtain a good estimate of alpha it is necessary to use a number of sources which grows linearly in the average degree of the underlying graph. Based on these observations we comment on the accuracy of the published values of alpha for the Internet.
△ Less
Submitted 4 October, 2004;
originally announced October 2004.