-
How should the advent of large language models affect the practice of science?
Authors:
Marcel Binz,
Stephan Alaniz,
Adina Roskies,
Balazs Aczel,
Carl T. Bergstrom,
Colin Allen,
Daniel Schad,
Dirk Wulff,
Jevin D. West,
Qiong Zhang,
Richard M. Shiffrin,
Samuel J. Gershman,
Ven Popov,
Emily M. Bender,
Marco Marelli,
Matthew M. Botvinick,
Zeynep Akata,
Eric Schulz
Abstract:
Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schu…
▽ More
Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Rationalizing risk aversion in science
Authors:
Kevin Gross,
Carl T. Bergstrom
Abstract:
Scientific research requires taking risks, as the most cautious approaches are unlikely to lead to the most rapid progress. Yet much funded scientific research plays it safe and funding agencies bemoan the difficulty of attracting high-risk, high-return research projects. Why don't the incentives for scientific discovery adequately impel researchers toward such projects? Here we adapt an economic…
▽ More
Scientific research requires taking risks, as the most cautious approaches are unlikely to lead to the most rapid progress. Yet much funded scientific research plays it safe and funding agencies bemoan the difficulty of attracting high-risk, high-return research projects. Why don't the incentives for scientific discovery adequately impel researchers toward such projects? Here we adapt an economic contracting model to explore how the unobservability of risk and effort discourages risky research. The model considers a hidden-action problem, in which the scientific community must reward discoveries in a way that encourages effort and risk-taking while simultaneously protecting researchers' livelihoods against the vicissitudes of scientific chance. Its challenge when doing so is that incentives to motivate effort clash with incentives to motivate risk-taking, because a failed project may be evidence of a risky undertaking but could also be the result of simple sloth. As a result, the incentives needed to encourage effort actively discourage risk-taking. Scientists respond by working on safe projects that generate evidence of effort but that don't move science forward as rapidly as riskier projects would. A social planner who prizes scientific productivity above researchers' well-being could remedy the problem by rewarding major discoveries richly enough to induce high-risk research, but scientists would be worse off for it. Because the scientific community is approximately self-governing and constructs its own reward schedule, the incentives that researchers are willing to impose on themselves are inadequate to motivate the scientific risks that would best expedite scientific progress.
△ Less
Submitted 27 February, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Challenges in cybersecurity: Lessons from biological defense systems
Authors:
Edward Schrom,
Ann Kinzig,
Stephanie Forrest,
Andrea L. Graham,
Simon A. Levin,
Carl T. Bergstrom,
Carlos Castillo-Chavez,
James P. Collins,
Rob J. de Boer,
Adam Doupé,
Roya Ensafi,
Stuart Feldman,
Bryan T. Grenfell. Alex Halderman,
Silvie Huijben,
Carlo Maley,
Melanie Mosesr,
Alan S. Perelson,
Charles Perrings,
Joshua Plotkin,
Jennifer Rexford,
Mohit Tiwari
Abstract:
We explore the commonalities between methods for assuring the security of computer systems (cybersecurity) and the mechanisms that have evolved through natural selection to protect vertebrates against pathogens, and how insights derived from studying the evolution of natural defenses can inform the design of more effective cybersecurity systems. More generally, security challenges are crucial for…
▽ More
We explore the commonalities between methods for assuring the security of computer systems (cybersecurity) and the mechanisms that have evolved through natural selection to protect vertebrates against pathogens, and how insights derived from studying the evolution of natural defenses can inform the design of more effective cybersecurity systems. More generally, security challenges are crucial for the maintenance of a wide range of complex adaptive systems, including financial systems, and again lessons learned from the study of the evolution of natural defenses can provide guidance for the protection of such systems.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Why ex post peer review encourages high-risk research while ex ante review discourages it
Authors:
Kevin Gross,
Carl T. Bergstrom
Abstract:
Peer review is an integral component of contemporary science. While peer review focuses attention on promising and interesting science, it also encourages scientists to pursue some questions at the expense of others. Here, we use ideas from forecasting assessment to examine how two modes of peer review -- ex ante review of proposals for future work and ex post review of completed science -- motiva…
▽ More
Peer review is an integral component of contemporary science. While peer review focuses attention on promising and interesting science, it also encourages scientists to pursue some questions at the expense of others. Here, we use ideas from forecasting assessment to examine how two modes of peer review -- ex ante review of proposals for future work and ex post review of completed science -- motivate scientists to favor some questions instead of others. Our main result is that ex ante and ex post peer review push investigators toward distinct sets of scientific questions. This tension arises because ex post review allows an investigator to leverage her own scientific beliefs to generate results that others will find surprising, whereas ex ante review does not. Moreover, ex ante review will favor different research questions depending on whether reviewers rank proposals in anticipation of changes to their own personal beliefs, or to the beliefs of their peers. The tension between ex ante and ex post review puts investigators in a bind, because most researchers need to find projects that will survive both. By unpacking the tension between these two modes of review, we can understand how they shape the landscape of science and how changes to peer review might shift scientific activity in unforeseen directions.
△ Less
Submitted 28 September, 2021; v1 submitted 24 June, 2021;
originally announced June 2021.
-
Gender-based homophily in collaborations across a heterogeneous scholarly landscape
Authors:
Y. Samuel Wang,
Carole J. Lee,
Jevin D. West,
Carl T. Bergstrom,
Elena A. Erosheva
Abstract:
In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homoph…
▽ More
In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homophily, we develop methodology which explicitly accounts for the fact that the data comprises heterogeneous intellectual communities and that not all authorships are exchangeable. In particular, we distinguish three phenomena which may affect the distribution of observed gender homophily in collaborations: a structural component that is due to demographics and non-gendered authorship norms of a scholarly community, a compositional component which is driven by varying gender representation across sub-disciplines and time, and a behavioral component which we define as the remainder of observed gender homophily after its structural and compositional components have been taken into account. Using minimal modeling assumptions, the methodology we develop allows us to test for behavioral homophily. We find that statistically significant behavioral homophily can be detected across the JSTOR corpus and show that this finding is robust to missing gender indicators in our data. In a secondary analysis, we show that the proportion of women representation in a field is positively associated with the probability of finding statistically significant behavioral homophily.
△ Less
Submitted 16 June, 2022; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Why scatter plots suggest causality, and what we can do about it
Authors:
Carl T. Bergstrom,
Jevin D. West
Abstract:
Scatter plots carry an implicit if subtle message about causality. Whether we look at functions of one variable in pure mathematics, plots of experimental measurements as a function of the experimental conditions, or scatter plots of predictor and response variables, the value plotted on the vertical axis is by convention assumed to be determined or influenced by the value on the horizontal axis.…
▽ More
Scatter plots carry an implicit if subtle message about causality. Whether we look at functions of one variable in pure mathematics, plots of experimental measurements as a function of the experimental conditions, or scatter plots of predictor and response variables, the value plotted on the vertical axis is by convention assumed to be determined or influenced by the value on the horizontal axis. This is a problem for the public understanding of scientific results and perhaps also for professional scientists' interpretations of scatter plots. To avoid suggesting a causal relationship between the x and y values in a scatter plot, we propose a new type of data visualization, the diamond plot. Diamond plots are essentially 45 degree rotations of ordinary scatter plots; by visually jarring the viewer they clearly indicate that she should not draw the usual distinction between independent/predictor variable and dependent/response variable. Instead, she should see the relationship as purely correlative.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
Contest models highlight inherent inefficiencies of scientific funding competitions
Authors:
Kevin Gross,
Carl T. Bergstrom
Abstract:
Scientific research funding is allocated largely through a system of soliciting and ranking competitive grant proposals. In these competitions, the proposals themselves are not the deliverables that the funder seeks, but instead are used by the funder to screen for the most promising research ideas. Consequently, some of the funding program's impact on science is squandered because applying resear…
▽ More
Scientific research funding is allocated largely through a system of soliciting and ranking competitive grant proposals. In these competitions, the proposals themselves are not the deliverables that the funder seeks, but instead are used by the funder to screen for the most promising research ideas. Consequently, some of the funding program's impact on science is squandered because applying researchers must spend time writing proposals instead of doing science. To what extent does the community's aggregate investment in proposal preparation negate the scientific impact of the funding program? Are there alternative mechanisms for awarding funds that advance science more efficiently? We use the economic theory of contests to analyze how efficiently grant proposal competitions advance science, and compare them with recently proposed, partially randomized alternatives such as lotteries. We find that the effort researchers waste in writing proposals may be comparable to the total scientific value of the research that the funding supports, especially when only a few proposals can be funded. Moreover, when professional pressures motivate investigators to seek funding for reasons that extend beyond the value of the proposed science (e.g., promotion, prestige), the entire program can actually hamper scientific progress when the number of awards is small. We suggest that lost efficiency may be restored either by partial lotteries for funding, or by funding researchers based on past scientific success instead of proposals for future work.
△ Less
Submitted 2 January, 2019; v1 submitted 10 April, 2018;
originally announced April 2018.
-
Publication bias and the canonization of false facts
Authors:
Silas B. Nissen,
Tali Magidson,
Kevin Gross,
Carl T. Bergstrom
Abstract:
In the process of scientific inquiry, certain claims accumulate enough support to be established as facts. Unfortunately, not every claim accorded the status of fact turns out to be true. In this paper, we model the dynamic process by which claims are canonized as fact through repeated experimental confirmation. The community's confidence in a claim constitutes a Markov process: each successive pu…
▽ More
In the process of scientific inquiry, certain claims accumulate enough support to be established as facts. Unfortunately, not every claim accorded the status of fact turns out to be true. In this paper, we model the dynamic process by which claims are canonized as fact through repeated experimental confirmation. The community's confidence in a claim constitutes a Markov process: each successive published result shifts the degree of belief, until sufficient evidence accumulates to accept the claim as fact or to reject it as false. In our model, publication bias --- in which positive results are published preferentially over negative ones --- influences the distribution of published results. We find that when readers do not know the degree of publication bias and thus cannot condition on it, false claims often can be canonized as facts. Unless a sufficient fraction of negative results are published, the scientific process will do a poor job at discriminating false from true claims. This problem is exacerbated when scientists engage in p-hacking, data dredging, and other behaviors that increase the rate at which false positives are published. If negative results become easier to publish as a claim approaches acceptance as a fact, however, true and false claims can be more readily distinguished. To the degree that the model accurately represents current scholarly practice, there will be serious concern about the validity of purported facts in some areas of scientific research.
△ Less
Submitted 20 November, 2016; v1 submitted 2 September, 2016;
originally announced September 2016.
-
Men Set Their Own Cites High: Gender and Self-citation across Fields and over Time
Authors:
Molly M. King,
Carl T. Bergstrom,
Shelley J. Correll,
Jennifer Jacquet,
Jevin D. West
Abstract:
How common is self-citation in scholarly publication, and does the practice vary by gender? Using novel methods and a data set of 1.5 million research papers in the scholarly database JSTOR published between 1779 and 2011, the authors find that nearly 10 percent of references are self-citations by a paper's authors. The findings also show that between 1779 and 2011, men cited their own papers 56 p…
▽ More
How common is self-citation in scholarly publication, and does the practice vary by gender? Using novel methods and a data set of 1.5 million research papers in the scholarly database JSTOR published between 1779 and 2011, the authors find that nearly 10 percent of references are self-citations by a paper's authors. The findings also show that between 1779 and 2011, men cited their own papers 56 percent more than did women. In the last two decades of data, men self-cited 70 percent more than women. Women are also more than 10 percentage points more likely than men to not cite their own previous work at all. While these patterns could result from differences in the number of papers that men and women authors have published rather than gender-specific patterns of self-citation behavior, this gender gap in self-citation rates has remained stable over the last 50 years, despite increased representation of women in academia. The authors break down self-citation patterns by academic field and number of authors and comment on potential mechanisms behind these observations. These findings have important implications for scholarly visibility and cumulative advantage in academic careers.
△ Less
Submitted 12 December, 2017; v1 submitted 30 June, 2016;
originally announced July 2016.
-
Static Ranking of Scholarly Papers using Article-Level Eigenfactor (ALEF)
Authors:
Ian Wesley-Smith,
Carl T. Bergstrom,
Jevin D. West
Abstract:
Microsoft Research hosted the 2016 WSDM Cup Challenge based on the Microsoft Academic Graph. The goal was to provide static rankings for the articles that make up the graph, with the rankings to be evaluated against those of human judges. While the Microsoft Academic Graph provided metadata about many aspects of each scholarly document, we focused more narrowly on citation data and used this conte…
▽ More
Microsoft Research hosted the 2016 WSDM Cup Challenge based on the Microsoft Academic Graph. The goal was to provide static rankings for the articles that make up the graph, with the rankings to be evaluated against those of human judges. While the Microsoft Academic Graph provided metadata about many aspects of each scholarly document, we focused more narrowly on citation data and used this contest as an opportunity to test the Article Level Eigenfactor (ALEF), a novel citation-based ranking algorithm, and evaluate its performance against competing algorithms that drew upon multiple facets of the data from a large, real world dataset (122M papers and 757M citations). Our final submission to this contest was scored at 0.676, earning second place.
△ Less
Submitted 27 June, 2016;
originally announced June 2016.
-
Why Scientists Chase Big Problems: Individual Strategy and Social Optimality
Authors:
Carl T. Bergstrom,
Jacob G. Foster,
Yangbo Song
Abstract:
Scientists pursue collective knowledge, but they also seek personal recognition from their peers. When scientists decide whether or not to work on a big new problem, they weigh the potential rewards of a major discovery against the costs of setting aside other projects. These self-interested choices can potentially spread researchers across problems in an efficient manner, but efficiency is not gu…
▽ More
Scientists pursue collective knowledge, but they also seek personal recognition from their peers. When scientists decide whether or not to work on a big new problem, they weigh the potential rewards of a major discovery against the costs of setting aside other projects. These self-interested choices can potentially spread researchers across problems in an efficient manner, but efficiency is not guaranteed. We use simple economic models to understand such decisions and their collective consequences. Academic science differs from industrial R&D in that academics often share partial solutions to gain reputation. This convention of Open Science is thought to accelerate collective discovery, but we find that it need not do so. The ability to share partial results influences which scientists work on a particular problem; consequently, Open Science can slow down the solution of a problem if it deters entry by important actors.
△ Less
Submitted 23 July, 2016; v1 submitted 19 May, 2016;
originally announced May 2016.
-
Adaptive behavior can produce maladaptive anxiety due to individual differences in experience
Authors:
Frazer Meacham,
Carl T. Bergstrom
Abstract:
Normal anxiety is considered an adaptive response to the possible presence of danger, but is susceptible to dysregulation. Anxiety disorders are prevalent at high frequency in contemporary human societies, yet impose substantial disability upon their sufferers. This raises a puzzle: why has evolution left us vulnerable to anxiety disorders? We develop a signal detection model in which individuals…
▽ More
Normal anxiety is considered an adaptive response to the possible presence of danger, but is susceptible to dysregulation. Anxiety disorders are prevalent at high frequency in contemporary human societies, yet impose substantial disability upon their sufferers. This raises a puzzle: why has evolution left us vulnerable to anxiety disorders? We develop a signal detection model in which individuals must learn how to calibrate their anxiety responses: they need to learn which cues indicate danger in the environment. We derive the optimal strategy for doing so, and find that individuals face an inevitable exploration-exploitation tradeoff between obtaining a better estimate of the level of risk on one hand, and maximizing current payoffs on the other. Because of this tradeoff, a subset of the population can become trapped in a state of self-perpetuating over-sensitivity to threatening stimuli, even when individuals learn optimally. This phenomenon arises because when individuals become too cautious, they stop sampling the environment and fail to correct their misperceptions, whereas when individuals become too careless they continue to sample the environment and soon discover their mistakes. Thus, over-sensitivity to threats becomes common whereas under-sensitivity becomes rare. We suggest that this process may be involved in the development of excessive anxiety in humans.
△ Less
Submitted 7 June, 2016; v1 submitted 13 January, 2015;
originally announced January 2015.
-
Defensive complexity and the phylogenetic conservation of immune control
Authors:
Erick Chastain,
Rustom Antia,
Carl T. Bergstrom
Abstract:
One strategy for winning a coevolutionary struggle is to evolve rapidly. Most of the literature on host-pathogen coevolution focuses on this phenomenon, and looks for consequent evidence of coevolutionary arms races. An alternative strategy, less often considered in the literature, is to deter rapid evolutionary change by the opponent. To study how this can be done, we construct an evolutionary ga…
▽ More
One strategy for winning a coevolutionary struggle is to evolve rapidly. Most of the literature on host-pathogen coevolution focuses on this phenomenon, and looks for consequent evidence of coevolutionary arms races. An alternative strategy, less often considered in the literature, is to deter rapid evolutionary change by the opponent. To study how this can be done, we construct an evolutionary game between a controller that must process information, and an adversary that can tamper with this information processing. In this game, a species can foil its antagonist by processing information in a way that is hard for the antagonist to manipulate. We show that the structure of the information processing system induces a fitness landscape on which the adversary population evolves. Complex processing logic can carve long, deep fitness valleys that slow adaptive evolution in the adversary population. We suggest that this type of defensive complexity on the part of the vertebrate adaptive immune system may be an important element of coevolutionary dynamics between pathogens and their vertebrate hosts. Furthermore, we cite evidence that the immune control logic is phylogenetically conserved in mammalian lineages. Thus our model of defensive complexity suggests a new hypothesis for the lower rates of evolution for immune control logic compared to other immune structures.
△ Less
Submitted 12 November, 2012;
originally announced November 2012.
-
The role of gender in scholarly authorship
Authors:
Jevin D. West,
Jennifer Jacquet,
Molly M. King,
Shelley J. Correll,
Carl T. Bergstrom
Abstract:
Gender disparities appear to be decreasing in academia according to a number of metrics, such as grant funding, hiring, acceptance at scholarly journals, and productivity, and it might be tempting to think that gender inequity will soon be a problem of the past. However, a large-scale analysis based on over eight million papers across the natural sciences, social sciences, and humanities re- revea…
▽ More
Gender disparities appear to be decreasing in academia according to a number of metrics, such as grant funding, hiring, acceptance at scholarly journals, and productivity, and it might be tempting to think that gender inequity will soon be a problem of the past. However, a large-scale analysis based on over eight million papers across the natural sciences, social sciences, and humanities re- reveals a number of understated and persistent ways in which gender inequities remain. For instance, even where raw publication counts seem to be equal between genders, close inspection reveals that, in certain fields, men predominate in the prestigious first and last author positions. Moreover, women are significantly underrepresented as authors of single-authored papers. Academics should be aware of the subtle ways that gender disparities can appear in scholarly authorship.
△ Less
Submitted 7 November, 2012;
originally announced November 2012.
-
Defensive complexity in antagonistic coevolution
Authors:
Erick Chastain,
Rustom Antia,
Carl T. Bergstrom
Abstract:
One strategy for winning a coevolutionary struggle is to evolve rapidly. Most of the literature on host-pathogen coevolution focuses on this phenomenon, and looks for consequent evidence of coevolutionary arms races. An alternative strategy, less often considered in the literature, is to deter rapid evolutionary change by the opponent. To study how this can be done, we construct an evolutionary ga…
▽ More
One strategy for winning a coevolutionary struggle is to evolve rapidly. Most of the literature on host-pathogen coevolution focuses on this phenomenon, and looks for consequent evidence of coevolutionary arms races. An alternative strategy, less often considered in the literature, is to deter rapid evolutionary change by the opponent. To study how this can be done, we construct an evolutionary game between a controller that must process information, and an adversary that can tamper with this information processing. In this game, a species can foil its antagonist by processing information in a way that is hard for the antagonist to manipulate. We show that the structure of the information processing system induces a fitness landscape on which the adversary population evolves, and that complex processing logic is required to make that landscape rugged. Drawing on the rich literature concerning rates of evolution on rugged landscapes, we show how a species can slow adaptive evolution in the adversary population. We suggest that this type of defensive complexity on the part of the vertebrate adaptive immune system may be an important element of coevolutionary dynamics between pathogens and their vertebrate hosts.
△ Less
Submitted 15 December, 2014; v1 submitted 20 March, 2012;
originally announced March 2012.
-
Nodal dynamics, not degree distributions, determine the structural controllability of complex networks
Authors:
Noah J. Cowan,
Erick J. Chastain,
Daril A. Vilhena,
James S. Freudenberg,
Carl T. Bergstrom
Abstract:
Structural controllability has been proposed as an analytical framework for making predictions regarding the control of complex networks across myriad disciplines in the physical and life sciences (Liu et al., Nature:473(7346):167-173, 2011). Although the integration of control theory and network analysis is important, we argue that the application of the structural controllability framework to mo…
▽ More
Structural controllability has been proposed as an analytical framework for making predictions regarding the control of complex networks across myriad disciplines in the physical and life sciences (Liu et al., Nature:473(7346):167-173, 2011). Although the integration of control theory and network analysis is important, we argue that the application of the structural controllability framework to most if not all real-world networks leads to the conclusion that a single control input, applied to the power dominating set (PDS), is all that is needed for structural controllability. This result is consistent with the well-known fact that controllability and its dual observability are generic properties of systems. We argue that more important than issues of structural controllability are the questions of whether a system is almost uncontrollable, whether it is almost unobservable, and whether it possesses almost pole-zero cancellations.
△ Less
Submitted 21 April, 2012; v1 submitted 13 June, 2011;
originally announced June 2011.
-
Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems
Authors:
M. Rosvall,
C. T. Bergstrom
Abstract:
To comprehend the hierarchical organization of large integrated systems, we introduce the hierarchical map equation, which reveals multilevel structures in networks. In this information-theoretic approach, we exploit the duality between compression and pattern detection; by compressing a description of a random walker as a proxy for real flow on a network, we find regularities in the network that…
▽ More
To comprehend the hierarchical organization of large integrated systems, we introduce the hierarchical map equation, which reveals multilevel structures in networks. In this information-theoretic approach, we exploit the duality between compression and pattern detection; by compressing a description of a random walker as a proxy for real flow on a network, we find regularities in the network that induce this system-wide flow. Finding the shortest multilevel description of the random walker therefore gives us the best hierarchical clustering of the network, the optimal number of levels and modular partition at each level, with respect to the dynamics on the network. With a novel search algorithm, we extract and illustrate the rich multilevel organization of several large social and biological networks. For example, from the global air traffic network we uncover countries and continents, and from the pattern of scientific communication we reveal more than 100 scientific fields organized in four major disciplines: life sciences, physical sciences, ecology and earth sciences, and social sciences. In general, we find shallow hierarchical structures in globally interconnected systems, such as neural networks, and rich multilevel organizations in systems with highly separated regions, such as road networks.
△ Less
Submitted 10 April, 2011; v1 submitted 3 October, 2010;
originally announced October 2010.
-
The map equation
Authors:
M. Rosvall,
D. Axelsson,
C. T. Bergstrom
Abstract:
Many real-world networks are so large that we must simplify their structure before we can extract useful information about the systems they represent. As the tools for doing these simplifications proliferate within the network literature, researchers would benefit from some guidelines about which of the so-called community detection algorithms are most appropriate for the structures they are stu…
▽ More
Many real-world networks are so large that we must simplify their structure before we can extract useful information about the systems they represent. As the tools for doing these simplifications proliferate within the network literature, researchers would benefit from some guidelines about which of the so-called community detection algorithms are most appropriate for the structures they are studying and the questions they are asking. Here we show that different methods highlight different aspects of a network's structure and that the the sort of information that we seek to extract about the system must guide us in our decision. For example, many community detection algorithms, including the popular modularity maximization approach, infer module assignments from an underlying model of the network formation process. However, we are not always as interested in how a system's network structure was formed, as we are in how a network's extant structure influences the system's behavior. To see how structure influences current behavior, we will recognize that links in a network induce movement across the network and result in system-wide interdependence. In doing so, we explicitly acknowledge that most networks carry flow. To highlight and simplify the network structure with respect to this flow, we use the map equation. We present an intuitive derivation of this flow-based and information-theoretic method and provide an interactive on-line application that anyone can use to explore the mechanics of the map equation. We also describe an algorithm and provide source code to efficiently decompose large weighted and directed networks based on the map equation.
△ Less
Submitted 23 September, 2009; v1 submitted 7 June, 2009;
originally announced June 2009.
-
Mapping change in large networks
Authors:
M. Rosvall,
C. T. Bergstrom
Abstract:
Change is a fundamental ingredient of interaction patterns in biology, technology, the economy, and science itself: Interactions within and between organisms change; transportation patterns by air, land, and sea all change; the global financial flow changes; and the frontiers of scientific research change. Networks and clustering methods have become important tools to comprehend instances of these…
▽ More
Change is a fundamental ingredient of interaction patterns in biology, technology, the economy, and science itself: Interactions within and between organisms change; transportation patterns by air, land, and sea all change; the global financial flow changes; and the frontiers of scientific research change. Networks and clustering methods have become important tools to comprehend instances of these large-scale structures, but without methods to distinguish between real trends and noisy data, these approaches are not useful for studying how networks change. Only if we can assign significance to the partitioning of single networks can we distinguish meaningful structural changes from random fluctuations. Here we show that bootstrap resampling accompanied by significance clustering provides a solution to this problem. To connect changing structures with the changing function of networks, we highlight and summarize the significant structural changes with alluvial diagrams and realize de Solla Price's vision of mapping change in science: studying the citation pattern between about 7000 scientific journals over the past decade, we find that neuroscience has transformed from an interdisciplinary specialty to a mature and stand-alone discipline.
△ Less
Submitted 6 October, 2010; v1 submitted 5 December, 2008;
originally announced December 2008.
-
The transmission sense of information
Authors:
C. T. Bergstrom,
M. Rosvall
Abstract:
Biologists rely heavily on the language of information, coding, and transmission that is commonplace in the field of information theory as developed by Claude Shannon, but there is open debate about whether such language is anything more than facile metaphor. Philosophers of biology have argued that when biologists talk about information in genes and in evolution, they are not talking about the…
▽ More
Biologists rely heavily on the language of information, coding, and transmission that is commonplace in the field of information theory as developed by Claude Shannon, but there is open debate about whether such language is anything more than facile metaphor. Philosophers of biology have argued that when biologists talk about information in genes and in evolution, they are not talking about the sort of information that Shannon's theory addresses. First, philosophers have suggested that Shannon theory is only useful for developing a shallow notion of correlation, the so-called "causal sense" of information. Second they typically argue that in genetics and evolutionary biology, information language is used in a "semantic sense," whereas semantics are deliberately omitted from Shannon theory. Neither critique is well-founded. Here we propose an alternative to the causal and semantic senses of information: a transmission sense of information, in which an object X conveys information if the function of X is to reduce, by virtue of its sequence properties, uncertainty on the part of an agent who observes X. The transmission sense not only captures much of what biologists intend when they talk about information in genes, but also brings Shannon's theory back to the fore. By taking the viewpoint of a communications engineer and focusing on the decision problem of how information is to be packaged for transport, this approach resolves several problems that have plagued the information concept in biology, and highlights a number of important features of the way that information is encoded, stored, and transmitted as genetic sequence.
△ Less
Submitted 22 October, 2008;
originally announced October 2008.
-
Differences in Impact Factor Across Fields and Over Time
Authors:
Benjamin M. Althouse,
Jevin D. West,
Theodore Bergstrom,
Carl T. Bergstrom
Abstract:
The bibliometric measure impact factor is a leading indicator of journal influence, and impact factors are routinely used in making decisions ranging from selecting journal subscriptions to allocating research funding to deciding tenure cases. Yet journal impact factors have increased gradually over time, and moreover impact factors vary widely across academic disciplines. Here we quantify infla…
▽ More
The bibliometric measure impact factor is a leading indicator of journal influence, and impact factors are routinely used in making decisions ranging from selecting journal subscriptions to allocating research funding to deciding tenure cases. Yet journal impact factors have increased gradually over time, and moreover impact factors vary widely across academic disciplines. Here we quantify inflation over time and differences across fields in impact factor scores and determine the sources of these differences. We find that the average number of citations in reference lists has increased gradually, and this is the predominant factor responsible for the inflation of impact factor scores over time. Field-specific variation in the fraction of citations to literature indexed by Thomson Scientific's Journal Citation Reports is the single greatest contributor to differences among the impact factors of journals in different fields. The growth rate of the scientific literature as a whole, and cross-field differences in net size and growth rate of individual fields, have had very little influence on impact factor inflation or on cross-field differences in impact factor.
△ Less
Submitted 18 April, 2008;
originally announced April 2008.
-
Maps of random walks on complex networks reveal community structure
Authors:
M. Rosvall,
C. T. Bergstrom
Abstract:
To comprehend the multipartite organization of large-scale biological and social systems, we introduce a new information theoretic approach that reveals community structure in weighted and directed networks. The method decomposes a network into modules by optimally compressing a description of information flows on the network. The result is a map that both simplifies and highlights the regularit…
▽ More
To comprehend the multipartite organization of large-scale biological and social systems, we introduce a new information theoretic approach that reveals community structure in weighted and directed networks. The method decomposes a network into modules by optimally compressing a description of information flows on the network. The result is a map that both simplifies and highlights the regularities in the structure and their relationships. We illustrate the method by making a map of scientific communication as captured in the citation patterns of more than 6000 journals. We discover a multicentric organization with fields that vary dramatically in size and degree of integration into the network of science. Along the backbone of the network -- including physics, chemistry, molecular biology, and medicine -- information flows bidirectionally, but the map reveals a directional pattern of citation from the applied fields to the basic sciences.
△ Less
Submitted 12 November, 2007; v1 submitted 4 July, 2007;
originally announced July 2007.
-
An information-theoretic framework for resolving community structure in complex networks
Authors:
Martin Rosvall,
Carl T. Bergstrom
Abstract:
To understand the structure of a large-scale biological, social, or technological network, it can be helpful to decompose the network into smaller subunits or modules. In this article, we develop an information-theoretic foundation for the concept of modularity in networks. We identify the modules of which the network is composed by finding an optimal compression of its topology, capitalizing on…
▽ More
To understand the structure of a large-scale biological, social, or technological network, it can be helpful to decompose the network into smaller subunits or modules. In this article, we develop an information-theoretic foundation for the concept of modularity in networks. We identify the modules of which the network is composed by finding an optimal compression of its topology, capitalizing on regularities in its structure. We explain the advantages of this approach and illustrate them by partitioning a number of real-world and model networks.
△ Less
Submitted 2 May, 2007; v1 submitted 5 December, 2006;
originally announced December 2006.
-
The fitness value of information
Authors:
Carl T. Bergstrom,
Michael Lachmann
Abstract:
Biologists measure information in different ways. Neurobiologists and researchers in bioinformatics often measure information using information-theoretic measures such as Shannon's entropy or mutual information. Behavioral biologists and evolutionary ecologists more commonly use decision-theoretic measures, such the value of information, which assess the worth of information to a decision maker.…
▽ More
Biologists measure information in different ways. Neurobiologists and researchers in bioinformatics often measure information using information-theoretic measures such as Shannon's entropy or mutual information. Behavioral biologists and evolutionary ecologists more commonly use decision-theoretic measures, such the value of information, which assess the worth of information to a decision maker. Here we show that these two kinds of measures are intimately related in the context of biological evolution. We present a simple model of evolution in an uncertain environment, and calculate the increase in Darwinian fitness that is made possible by information about the environmental state. This fitness increase -- the fitness value of information -- is a composite of both Shannon's mutual information and the decision-theoretic value of information. Furthermore, we show that in certain cases the fitness value of responding to a cue is exactly equal to the mutual information between the cue and the environment. In general the Shannon entropy of the environment, which seemingly fails to take anything about organismal fitness into account, nonetheless imposes an upper bound on the fitness value of information.
△ Less
Submitted 3 October, 2005;
originally announced October 2005.