Search | arXiv e-print repository

Scientific productivity as a random walk

Authors: Sam Zhang, Nicholas LaBerge, Samuel F. Way, Daniel B. Larremore, Aaron Clauset

Abstract: The expectation that scientific productivity follows regular patterns over a career underpins many scholarly evaluations, including hiring, promotion and tenure, awards, and grant funding. However, recent studies of individual productivity patterns reveal a puzzle: on the one hand, the average number of papers published per year robustly follows the "canonical trajectory" of a rapid rise to an ear… ▽ More The expectation that scientific productivity follows regular patterns over a career underpins many scholarly evaluations, including hiring, promotion and tenure, awards, and grant funding. However, recent studies of individual productivity patterns reveal a puzzle: on the one hand, the average number of papers published per year robustly follows the "canonical trajectory" of a rapid rise to an early peak followed by a gradual decline, but on the other hand, only about 20% of individual productivity trajectories follow this pattern. We resolve this puzzle by modeling scientific productivity as a parameterized random walk, showing that the canonical pattern can be explained as a decrease in the variance in changes to productivity in the early-to-mid career. By empirically characterizing the variable structure of 2,085 productivity trajectories of computer science faculty at 205 PhD-granting institutions, spanning 29,119 publications over 1980--2016, we (i) discover remarkably simple patterns in both early-career and year-to-year changes to productivity, and (ii) show that a random walk model of productivity both reproduces the canonical trajectory in the average productivity and captures much of the diversity of individual-level trajectories. These results highlight the fundamental role of a panoply of contingent factors in shaping individual scientific productivity, opening up new avenues for characterizing how systemic incentives and opportunities can be directed for aggregate effect. △ Less

Submitted 13 March, 2025; v1 submitted 8 September, 2023; originally announced September 2023.

MSC Class: 62P25

arXiv:2001.11818 [pdf, other]

doi 10.1103/PhysRevE.102.032309

Community Detection in Bipartite Networks with Stochastic Blockmodels

Authors: Tzu-Chi Yen, Daniel B. Larremore

Abstract: In bipartite networks, community structures are restricted to being disassortative, in that nodes of one type are grouped according to common patterns of connection with nodes of the other type. This makes the stochastic block model (SBM), a highly flexible generative model for networks with block structure, an intuitive choice for bipartite community detection. However, typical formulations of th… ▽ More In bipartite networks, community structures are restricted to being disassortative, in that nodes of one type are grouped according to common patterns of connection with nodes of the other type. This makes the stochastic block model (SBM), a highly flexible generative model for networks with block structure, an intuitive choice for bipartite community detection. However, typical formulations of the SBM do not make use of the special structure of bipartite networks. Here we introduce a Bayesian nonparametric formulation of the SBM and a corresponding algorithm to efficiently find communities in bipartite networks which parsimoniously chooses the number of communities. The biSBM improves community detection results over general SBMs when data are noisy, improves the model resolution limit by a factor of $\sqrt{2}$, and expands our understanding of the complicated optimization landscape associated with community detection tasks. A direct comparison of certain terms of the prior distributions in the biSBM and a related high-resolution hierarchical SBM also reveals a counterintuitive regime of community detection problems, populated by smaller and sparser networks, where nonhierarchical models outperform their more flexible counterpart. △ Less

Submitted 29 September, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: 17 pages, 6 figures. Code is available at https://github.com/junipertcy/bipartiteSBM and a documentation at https://docs.netscied.tw/bipartiteSBM/index.html

Journal ref: Phys. Rev. E 102, 032309 (2020)

arXiv:1608.05878 [pdf, other]

doi 10.1126/sciadv.1602548

The ground truth about metadata and community detection in networks

Authors: Leto Peel, Daniel B. Larremore, Aaron Clauset

Abstract: Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-ca… ▽ More Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called "ground truth" communities. This works well in synthetic networks with planted communities because such networks' links are formed explicitly based on those known communities. However, there are no planted communities in real world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. Here, we show that metadata are not the same as ground truth, and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structure. △ Less

Submitted 3 May, 2017; v1 submitted 20 August, 2016; originally announced August 2016.

Comments: 27 pages, 10 figures, 11 tables

Journal ref: Science Advances 3(5) e1602548, 2017

arXiv:1608.00607 [pdf, other]

Configuring Random Graph Models with Fixed Degree Sequences

Authors: Bailey K. Fosdick, Daniel B. Larremore, Joel Nishimura, Johan Ugander

Abstract: Random graph null models have found widespread application in diverse research communities analyzing network datasets, including social, information, and economic networks, as well as food webs, protein-protein interactions, and neuronal networks. The most popular family of random graph null models, called configuration models, are defined as uniform distributions over a space of graphs with a fix… ▽ More Random graph null models have found widespread application in diverse research communities analyzing network datasets, including social, information, and economic networks, as well as food webs, protein-protein interactions, and neuronal networks. The most popular family of random graph null models, called configuration models, are defined as uniform distributions over a space of graphs with a fixed degree sequence. Commonly, properties of an empirical network are compared to properties of an ensemble of graphs from a configuration model in order to quantify whether empirical network properties are meaningful or whether they are instead a common consequence of the particular degree sequence. In this work we study the subtle but important decisions underlying the specification of a configuration model, and investigate the role these choices play in graph sampling procedures and a suite of applications. We place particular emphasis on the importance of specifying the appropriate graph labeling (stub-labeled or vertex-labeled) under which to consider a null model, a choice that closely connects the study of random graphs to the study of random contingency tables. We show that the choice of graph labeling is inconsequential for studies of simple graphs, but can have a significant impact on analyses of multigraphs or graphs with self-loops. The importance of these choices is demonstrated through a series of three vignettes, analyzing network datasets under many different configuration models and observing substantial differences in study conclusions under different models. We argue that in each case, only one of the possible configuration models is appropriate. While our work focuses on undirected static networks, it aims to guide the study of directed networks, dynamic networks, and all other network contexts that are suitably studied through the lens of random graph null models. △ Less

Submitted 10 October, 2017; v1 submitted 1 August, 2016; originally announced August 2016.

Comments: To appear in SIAM Review, June 2018. Code available at github.com/joelnish/double-edge-swap-mcmc. v3 fixed minor typos

arXiv:1602.00795 [pdf, other]

Gender, Productivity, and Prestige in Computer Science Faculty Hiring Networks

Authors: Samuel F. Way, Daniel B. Larremore, Aaron Clauset

Abstract: Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty h… ▽ More Women are dramatically underrepresented in computer science at all levels in academia and account for just 15% of tenure-track faculty. Understanding the causes of this gender imbalance would inform both policies intended to rectify it and employment decisions by departments and individuals. Progress in this direction, however, is complicated by the complexity and decentralized nature of faculty hiring and the non-independence of hires. Using comprehensive data on both hiring outcomes and scholarly productivity for 2659 tenure-track faculty across 205 Ph.D.-granting departments in North America, we investigate the multi-dimensional nature of gender inequality in computer science faculty hiring through a network model of the hiring process. Overall, we find that hiring outcomes are most directly affected by (i) the relative prestige between hiring and placing institutions and (ii) the scholarly productivity of the candidates. After including these, and other features, the addition of gender did not significantly reduce modeling error. However, gender differences do exist, e.g., in scholarly productivity, postdoctoral training rates, and in career movements up the rankings of universities, suggesting that the effects of gender are indirectly incorporated into hiring decisions through gender's covariates. Furthermore, we find evidence that more highly ranked departments recruit female faculty at higher than expected rates, which appears to inhibit similar efforts by lower ranked departments. These findings illustrate the subtle nature of gender inequality in faculty hiring networks and provide new insights to the underrepresentation of women in computer science. △ Less

Submitted 2 February, 2016; originally announced February 2016.

Comments: 11 pages, 7 figures, 5 tables

Journal ref: Proc. 2016 World Wide Web Conference (WWW), 1169-1179 (2016)

arXiv:1403.2933 [pdf, other]

doi 10.1103/PhysRevE.90.012805

Efficiently inferring community structure in bipartite networks

Authors: Daniel B. Larremore, Aaron Clauset, Abigail Z. Jacobs

Abstract: Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and… ▽ More Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and lack of interpretability. Here we solve the community detection problem for bipartite networks by formulating a bipartite stochastic block model, which explicitly includes vertex type information and may be trivially extended to $k$-partite networks. This bipartite stochastic block model yields a projection-free and statistically principled method for community detection that makes clear assumptions and parameter choices and yields interpretable results. We demonstrate this model's ability to efficiently and accurately find community structure in synthetic bipartite networks with known structure and in real-world bipartite networks with unknown structure, and we characterize its performance in practical contexts. △ Less

Submitted 10 July, 2014; v1 submitted 12 March, 2014; originally announced March 2014.

Comments: 12 pages, 9 figures

Journal ref: Physical Review E 90(1): 012805 (2014)

Showing 1–6 of 6 results for author: Larremore, D B