-
Breaking the Code: Multi-level Learning in the Eurovision Song Contest
Authors:
Luís A. Nunes Amaral,
Arthur Capozzi,
Dirk Helbing
Abstract:
Organizations learn from the market, political, and societal responses to their actions. While in some cases both the actions and responses take place in an open manner, in many others, some aspects may be hidden from external observers. The Eurovision Song Contest offers an interesting example to study organizational level learning at two levels: organizers and participants. We find evidence for…
▽ More
Organizations learn from the market, political, and societal responses to their actions. While in some cases both the actions and responses take place in an open manner, in many others, some aspects may be hidden from external observers. The Eurovision Song Contest offers an interesting example to study organizational level learning at two levels: organizers and participants. We find evidence for changes in the rules of the Contest in response to undesired outcomes such as runaway winners. We also find strong evidence of participant learning in the characteristics of competing songs over the 70-years of the Contest. English has been adopted as the lingua franca of the competing songs and pop has become the standard genre. Number of words of lyrics has also grown in response to this collective learning. Remarkably, we find evidence that four participating countries have chosen to ignore the "lesson" that English lyrics increase winning probability. This choice is consistent with utility functions that award greater value to featuring national language than to winning the Contest. Indeed, we find evidence that some countries -- but not Germany -- appear to be less susceptible to "peer" pressure. These observations appear to be valid beyond Eurovision.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Centrality anomalies in complex networks as a result of model over-simplification
Authors:
Luiz G. A. Alves,
Alberto Aleta,
Francisco A. Rodrigues,
Yamir Moreno,
Luis A. Nunes Amaral
Abstract:
Tremendous advances have been made in our understanding of the properties and evolution of complex networks. These advances were initially driven by information-poor empirical networks and theoretical analysis of unweighted and undirected graphs. Recently, information-rich empirical data complex networks supported the development of more sophisticated models that include edge directionality and we…
▽ More
Tremendous advances have been made in our understanding of the properties and evolution of complex networks. These advances were initially driven by information-poor empirical networks and theoretical analysis of unweighted and undirected graphs. Recently, information-rich empirical data complex networks supported the development of more sophisticated models that include edge directionality and weight properties, and multiple layers. Many studies still focus on unweighted undirected description of networks, prompting an essential question: how to identify when a model is simpler than it must be? Here, we argue that the presence of centrality anomalies in complex networks is a result of model over-simplification. Specifically, we investigate the well-known anomaly in betweenness centrality for transportation networks, according to which highly connected nodes are not necessarily the most central. Using a broad class of network models with weights and spatial constraints and four large data sets of transportation networks, we show that the unweighted projection of the structure of these networks can exhibit a significant fraction of anomalous nodes compared to a random null model. However, the weighted projection of these networks, compared with an appropriated null model, significantly reduces the fraction of anomalies observed, suggesting that centrality anomalies are a symptom of model over-simplification. Because lack of information-rich data is a common challenge when dealing with complex networks and can cause anomalies that misestimate the role of nodes in the system, we argue that sufficiently sophisticated models be used when anomalies are detected.
△ Less
Submitted 13 March, 2020; v1 submitted 2 February, 2019;
originally announced February 2019.
-
A new evaluation framework for topic modeling algorithms based on synthetic corpora
Authors:
Hanyu Shi,
Martin Gerlach,
Isabel Diersen,
Doug Downey,
Luis A. N. Amaral
Abstract:
Topic models are in widespread use in natural language processing and beyond. Here, we propose a new framework for the evaluation of probabilistic topic modeling algorithms based on synthetic corpora containing an unambiguously defined ground truth topic structure. The major innovation of our approach is the ability to quantify the agreement between the planted and inferred topic structures by com…
▽ More
Topic models are in widespread use in natural language processing and beyond. Here, we propose a new framework for the evaluation of probabilistic topic modeling algorithms based on synthetic corpora containing an unambiguously defined ground truth topic structure. The major innovation of our approach is the ability to quantify the agreement between the planted and inferred topic structures by comparing the assigned topic labels at the level of the tokens. In experiments, our approach yields novel insights about the relative strengths of topic models as corpus characteristics vary, and the first evidence of an "undetectable phase" for topic models when the planted structure is weak. We also establish the practical relevance of the insights gained for synthetic corpora by predicting the performance of topic modeling algorithms in classification tasks in real-world corpora.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.
-
The Distribution of the Asymptotic Number of Citations to Sets of Publications by a Researcher or From an Academic Department Are Consistent With a Discrete Lognormal Model
Authors:
João A. G. Moreira,
Xiao Han T. Zeng,
Luís A. Nunes Amaral
Abstract:
How to quantify the impact of a researcher's or an institution's body of work is a matter of increasing importance to scientists, funding agencies, and hiring committees. The use of bibliometric indicators, such as the h-index or the Journal Impact Factor, have become widespread despite their known limitations. We argue that most existing bibliometric indicators are inconsistent, biased, and, wors…
▽ More
How to quantify the impact of a researcher's or an institution's body of work is a matter of increasing importance to scientists, funding agencies, and hiring committees. The use of bibliometric indicators, such as the h-index or the Journal Impact Factor, have become widespread despite their known limitations. We argue that most existing bibliometric indicators are inconsistent, biased, and, worst of all, susceptible to manipulation. Here, we pursue a principled approach to the development of an indicator to quantify the scientific impact of both individual researchers and research institutions grounded on the functional form of the distribution of the asymptotic number of citations. We validate our approach using the publication records of 1,283 researchers from seven scientific and engineering disciplines and the chemistry departments at the 106 U.S. research institutions classified as "very high research activity". Our approach has three distinct advantages. First, it accurately captures the overall scientific impact of researchers at all career stages, as measured by asymptotic citation counts. Second, unlike other measures, our indicator is resistant to manipulation and rewards publication quality over quantity. Third, our approach captures the time-evolution of the scientific impact of research institutions.
△ Less
Submitted 2 November, 2015;
originally announced November 2015.
-
A high-reproducibility and high-accuracy method for automated topic classification
Authors:
Andrea Lancichinetti,
M. Irmak Sirer,
Jane X. Wang,
Daniel Acuna,
Konrad Körding,
Luís A. Nunes Amaral
Abstract:
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perf…
▽ More
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.
△ Less
Submitted 3 February, 2014;
originally announced February 2014.
-
Correlations between user voting data, budget, and box office for films in the Internet Movie Database
Authors:
Max Wasserman,
Satyam Mukherjee,
Konner Scott,
Xiao Han T. Zeng,
Filippo Radicchi,
Luís A. N. Amaral
Abstract:
The Internet Movie Database (IMDb) is one of the most-visited websites in the world and the premier source for information on films. Like Wikipedia, much of IMDb's information is user contributed. IMDb also allows users to voice their opinion on the quality of films through voting. We investigate whether there is a connection between this user voting data and certain economic film characteristics.…
▽ More
The Internet Movie Database (IMDb) is one of the most-visited websites in the world and the premier source for information on films. Like Wikipedia, much of IMDb's information is user contributed. IMDb also allows users to voice their opinion on the quality of films through voting. We investigate whether there is a connection between this user voting data and certain economic film characteristics. To this end, we perform distribution and correlation analysis on a set of films chosen to mitigate effects of bias due to the language and country of origin of films. We show that production budget, box office gross, and total number of user votes for films are consistent with double-log normal distributions for certain time periods. Both total gross and user votes are consistent with a double-log normal distribution from the late 1980s onward, while for budget, it extends from 1935 to 1979. In addition, we find a strong correlation between number of user votes and the economic statistics, particularly budget. Remarkably, we find no evidence for a correlation between number of votes and average user rating. As previous studies have found a strong correlation between production budget and marketing expenses, our results suggest that total user votes is an indicator of a film's prominence or notability, which can be quantified by its promotional costs.
△ Less
Submitted 16 January, 2014; v1 submitted 13 December, 2013;
originally announced December 2013.
-
The Possible Role of Resource Requirements and Academic Career-Choice Risk on Gender Differences in Publication Rate and Impact
Authors:
Jordi Duch,
Xiao Han T. Zeng,
Marta Sales-Pardo,
Filippo Radicchi,
Shayna Otis,
Teresa K. Woodruff,
Luis A. Nunes Amaral
Abstract:
Many studies demonstrate that there is still a significant gender bias, especially at higher career levels, in many areas including science, technology, engineering, and mathematics (STEM). We investigated field-dependent, gender-specific effects of the selective pressures individuals experience as they pursue a career in academia within seven STEM disciplines. We built a unique database that comp…
▽ More
Many studies demonstrate that there is still a significant gender bias, especially at higher career levels, in many areas including science, technology, engineering, and mathematics (STEM). We investigated field-dependent, gender-specific effects of the selective pressures individuals experience as they pursue a career in academia within seven STEM disciplines. We built a unique database that comprises 437,787 publications authored by 4,292 faculty members at top United States research universities. Our analyses reveal that gender differences in publication rate and impact are discipline-specific. Our results also support two hypotheses. First, the widely-reported lower publication rates of female faculty are correlated with the amount of research resources typically needed in the discipline considered, and thus may be explained by the lower level of institutional support historically received by females. Second, in disciplines where pursuing an academic position incurs greater career risk, female faculty tend to have a greater fraction of higher impact publications than males. Our findings have significant, field-specific, policy implications for achieving diversity at the faculty level within the STEM disciplines.
△ Less
Submitted 13 December, 2012;
originally announced December 2012.
-
Rationality, irrationality and escalating behavior in lowest unique bid auctions
Authors:
Filippo Radicchi,
Andrea Baronchelli,
Luis A. N. Amaral
Abstract:
Information technology has revolutionized the traditional structure of markets. The removal of geographical and time constraints has fostered the growth of online auction markets, which now include millions of economic agents worldwide and annual transaction volumes in the billions of dollars. Here, we analyze bid histories of a little studied type of online auctions --- lowest unique bid auctions…
▽ More
Information technology has revolutionized the traditional structure of markets. The removal of geographical and time constraints has fostered the growth of online auction markets, which now include millions of economic agents worldwide and annual transaction volumes in the billions of dollars. Here, we analyze bid histories of a little studied type of online auctions --- lowest unique bid auctions. Similarly to what has been reported for foraging animals searching for scarce food, we find that agents adopt Levy flight search strategies in their exploration of "bid space". The Levy regime, which is characterized by a power-law decaying probability distribution of step lengths, holds over nearly three orders of magnitude. We develop a quantitative model for lowest unique bid online auctions that reveals that agents use nearly optimal bidding strategies. However, agents participating in these auctions do not optimize their financial gain. Indeed, as long as there are many auction participants, a rational profit optimizing agent would choose not to participate in these auction markets.
△ Less
Submitted 18 January, 2012; v1 submitted 2 May, 2011;
originally announced May 2011.
-
Characterizing Individual Communication Patterns
Authors:
R. Dean Malmgren,
Jake M. Hofman,
Luis A. N. Amaral,
Duncan J. Watts
Abstract:
The increasing availability of electronic communication data, such as that arising from e-mail exchange, presents social and information scientists with new possibilities for characterizing individual behavior and, by extension, identifying latent structure in human populations. Here, we propose a model of individual e-mail communication that is sufficiently rich to capture meaningful variabilit…
▽ More
The increasing availability of electronic communication data, such as that arising from e-mail exchange, presents social and information scientists with new possibilities for characterizing individual behavior and, by extension, identifying latent structure in human populations. Here, we propose a model of individual e-mail communication that is sufficiently rich to capture meaningful variability across individuals, while remaining simple enough to be interpretable. We show that the model, a cascading non-homogeneous Poisson process, can be formulated as a double-chain hidden Markov model, allowing us to use an efficient inference algorithm to estimate the model parameters from observed data. We then apply this model to two e-mail data sets consisting of 404 and 6,164 users, respectively, that were collected from two universities in different countries and years. We find that the resulting best-estimate parameter distributions for both data sets are surprisingly similar, indicating that at least some features of communication dynamics generalize beyond specific contexts. We also find that variability of individual behavior over time is significantly less than variability across the population, suggesting that individuals can be classified into persistent "types". We conclude that communication patterns may prove useful as an additional class of attribute data, complementing demographic and network data, for user classification and outlier detection--a point that we illustrate with an interpretable clustering of users based on their inferred model parameters.
△ Less
Submitted 1 May, 2009;
originally announced May 2009.
-
A Poissonian explanation for heavy-tails in e-mail communication
Authors:
R. Dean Malmgren,
Daniel B. Stouffer,
Adilson E. Motter,
Luis A. N. Amaral
Abstract:
Patterns of deliberate human activity and behavior are of utmost importance in areas as diverse as disease spread, resource allocation, and emergency response. Because of its widespread availability and use, e-mail correspondence provides an attractive proxy for studying human activity. Recently, it was reported that the probability density for the inter-event time $τ$ between consecutively sent…
▽ More
Patterns of deliberate human activity and behavior are of utmost importance in areas as diverse as disease spread, resource allocation, and emergency response. Because of its widespread availability and use, e-mail correspondence provides an attractive proxy for studying human activity. Recently, it was reported that the probability density for the inter-event time $τ$ between consecutively sent e-mails decays asymptotically as $τ^{-α}$, with $α\approx 1$. The slower than exponential decay of the inter-event time distribution suggests that deliberate human activity is inherently non-Poissonian. Here, we demonstrate that the approximate power-law scaling of the inter-event time distribution is a consequence of circadian and weekly cycles of human activity. We propose a cascading non-homogeneous Poisson process which explicitly integrates these periodic patterns in activity with an individual's tendency to continue participating in an activity. Using standard statistical techniques, we show that our model is consistent with the empirical data. Our findings may also provide insight into the origins of heavy-tailed distributions in other complex systems.
△ Less
Submitted 5 January, 2009;
originally announced January 2009.