-
The incel lexicon: Deciphering the emergent cryptolect of a global misogynistic community
Authors:
Kelly Gothard,
David Rushing Dewhurst,
Joshua R. Minot,
Jane Lydia Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community…
▽ More
Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community's online expressions of misogyny and real-world acts of violence perpetrated against women. After assembling around three million comments from incel-themed Reddit channels, we analyze the temporal dynamics of a data driven rank ordering of the glossary of phrases belonging to an emergent incel lexicon. Our study reveals the generation and normalization of an extensive coded misogynist vocabulary in service of the group's identity.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Structural time series grammar over variable blocks
Authors:
David Rushing Dewhurst
Abstract:
A structural time series model additively decomposes into generative, semantically-meaningful components, each of which depends on a vector of parameters. We demonstrate that considering each generative component together with its vector of parameters as a single latent structural time series node can simplify reasoning about collections of structural time series components. We then introduce a fo…
▽ More
A structural time series model additively decomposes into generative, semantically-meaningful components, each of which depends on a vector of parameters. We demonstrate that considering each generative component together with its vector of parameters as a single latent structural time series node can simplify reasoning about collections of structural time series components. We then introduce a formal grammar over structural time series nodes and parameter vectors. Valid sentences in the grammar can be interpreted as generative structural time series models. An extension of the grammar can also express structural time series models that include changepoints, though these models are necessarily not generative. We demonstrate a preliminary implementation of the language generated by this grammar. We close with a discussion of possible future work.
△ Less
Submitted 15 September, 2020;
originally announced September 2020.
-
Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series
Authors:
P. S. Dodds,
J. R. Minot,
M. V. Arnold,
T. Alshaabi,
J. L. Adams,
D. R. Dewhurst,
A. J. Reagan,
C. M. Danforth
Abstract:
Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in…
▽ More
Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10\% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable.
△ Less
Submitted 27 August, 2020; v1 submitted 25 August, 2020;
originally announced August 2020.
-
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
Authors:
Thayer Alshaabi,
Jane L. Adams,
Michael V. Arnold,
Joshua R. Minot,
David R. Dewhurst,
Andrew J. Reagan,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.…
▽ More
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest.
△ Less
Submitted 16 July, 2021; v1 submitted 25 July, 2020;
originally announced July 2020.
-
The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong
Authors:
Thayer Alshaabi,
David Rushing Dewhurst,
James P. Bagrow,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information…
▽ More
Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information on the temporal components of this relationship. Using the districts of Hong Kong over multiple census years as a case study, we demonstrate that there are differences in how wealth indicator variables are associated with longevity in (a) areas that are affluent but neighbored by socially deprived districts versus (b) wealthy areas surrounded by similarly wealthy districts. We also show that the inclusion of spatially-distributed variables reduces uncertainty in mortality rate predictions in each census year when compared with a baseline model. Our results suggest that geographic mortality models should incorporate nonlocal information (e.g., spatial neighbors) to lower the variance of their mortality estimates, and point to a more in-depth analysis of sociospatial spillover effects on mortality rates.
△ Less
Submitted 25 January, 2021; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance
Authors:
David Rushing Dewhurst,
Thayer Alshaabi,
Michael V. Arnold,
Joshua R. Minot,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that rep…
▽ More
Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that represents March COVID-19-related discourse. Aggregating countries by dominant language use, we find that volatility in the first dynamic regime is associated with future volatility in new cases of COVID-19 roughly three weeks (average 22.49 $\pm$ 3.26 days) later. Our results suggest that surveillance of change in usage of epidemiology-related words on social media may be useful in forecasting later change in disease case numbers, but we emphasize that our current findings are not causal or necessarily predictive.
△ Less
Submitted 19 May, 2020; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Hurricanes and hashtags: Characterizing online collective attention for natural disasters
Authors:
Michael V. Arnold,
David Rushing Dewhurst,
Thayer Alshaabi,
Joshua R. Minot,
Jane L. Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `…
▽ More
We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane's Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage.
△ Less
Submitted 31 March, 2020;
originally announced March 2020.
-
How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter
Authors:
T. Alshaabi,
J. R. Minot,
M. V. Arnold,
J. L. Adams,
D. R. Dewhurst,
A. J. Reagan,
R. Muhamad,
C. M. Danforth,
P. S. Dodds
Abstract:
In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for…
▽ More
In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for antibodies, and we need to track major socioeconomic indices. But we also need auxiliary data of all kinds, including data related to how populations are talking about the unfolding pandemic through news and stories. To in part help on the social media side, we curate a set of 2000 day-scale time series of 1- and 2-grams across 24 languages on Twitter that are most 'important' for April 2020 with respect to April 2019. We determine importance through our allotaxonometric instrument, rank-turbulence divergence. We make some basic observations about some of the time series, including a comparison to numbers of confirmed deaths due to COVID-19 over time. We broadly observe across all languages a peak for the language-specific word for 'virus' in January 2020 followed by a decline through February and then a surge through March and April. The world's collective attention dropped away while the virus spread out from China. We host the time series on Gitlab, updating them on a daily basis while relevant. Our main intent is for other researchers to use these time series to enhance whatever analyses that may be of use during the pandemic as well as for retrospective investigations.
△ Less
Submitted 6 January, 2021; v1 submitted 27 March, 2020;
originally announced March 2020.
-
The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020
Authors:
Thayer Alshaabi,
David R. Dewhurst,
Joshua R. Minot,
Michael V. Arnold,
Jane L. Adams,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio':…
▽ More
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.
△ Less
Submitted 8 March, 2021; v1 submitted 7 March, 2020;
originally announced March 2020.
-
Evolving ab initio trading strategies in heterogeneous environments
Authors:
David Rushing Dewhurst,
Yi Li,
Alexander Bogdan,
Jasmine Geng
Abstract:
Securities markets are quintessential complex adaptive systems in which heterogeneous agents compete in an attempt to maximize returns. Species of trading agents are also subject to evolutionary pressure as entire classes of strategies become obsolete and new classes emerge. Using an agent-based model of interacting heterogeneous agents as a flexible environment that can endogenously model many di…
▽ More
Securities markets are quintessential complex adaptive systems in which heterogeneous agents compete in an attempt to maximize returns. Species of trading agents are also subject to evolutionary pressure as entire classes of strategies become obsolete and new classes emerge. Using an agent-based model of interacting heterogeneous agents as a flexible environment that can endogenously model many diverse market conditions, we subject deep neural networks to evolutionary pressure to create dominant trading agents. After analyzing the performance of these agents and noting the emergence of anomalous superdiffusion through the evolutionary process, we construct a method to turn high-fitness agents into trading algorithms. We backtest these trading algorithms on real high-frequency foreign exchange data, demonstrating that elite trading algorithms are consistently profitable in a variety of market conditions---even though these algorithms had never before been exposed to real financial data. These results provide evidence to suggest that developing \textit{ab initio} trading strategies by repeated simulation and evolution in a mechanistic market model may be a practical alternative to explicitly training models with past observed market data.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.
-
Fame and Ultrafame: Measuring and comparing daily levels of `being talked about' for United States' presidents, their rivals, God, countries, and K-pop
Authors:
Peter Sheridan Dodds,
Joshua R. Minot,
Michael V. Arnold,
Thayer Alshaabi,
Jane Lydia Adams,
David Rushing Dewhurst,
Andrew J. Reagan,
Christopher M. Danforth
Abstract:
When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw…
▽ More
When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw fame dynamics on Twitter for US Presidents and major US Presidential candidates from 2008 to 2020: Barack Obama, John McCain, Mitt Romney, Hillary Clinton, Donald Trump, and Joe Biden. We assign ``lexical fame'' to be the number and (Zipfian) rank of the (lowercased) mentions made for each individual across all languages. We show that all five political figures have at some point reached extraordinary volume levels of what we define to be ``lexical ultrafame'': An overall rank of approximately 300 or less which is largely the realm of function words and demarcated by the highly stable rank of `god'. By this measure, `trump' has become enduringly ultrafamous, from the 2016 election on. We use typical ranks for country names and function words as standards to improve perception of scale. We quantify relative fame rates and find that in the eight weeks leading up the 2008 and 2012 elections, `obama' held a 1000:757 volume ratio over `mccain' and 1000:892 over `romney', well short of the 1000:544 and 1000:504 volumes favoring `trump' over `hillary' and `biden' in the 8 weeks leading up to the 2016 and 2020 elections. Finally, we track how only one other entity has more sustained ultrafame than `trump' on Twitter: The K-pop (Korean pop) band BTS. We chart the dramatic rise of BTS, finding their Twitter handle `@bts\_twt' has been able to compete with `a' and `the'. Our findings for BTS more generally point to K-pop's growing economic, social, and political power.
△ Less
Submitted 29 October, 2021; v1 submitted 30 September, 2019;
originally announced October 2019.
-
The shocklet transform: A decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series
Authors:
David Rushing Dewhurst,
Thayer Alshaabi,
Dilan Kiley,
Michael V. Arnold,
Joshua R. Minot,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous be…
▽ More
We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST's ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms' utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter.
△ Less
Submitted 18 December, 2019; v1 submitted 27 June, 2019;
originally announced June 2019.
-
Selection mechanisms affect volatility in evolving markets
Authors:
David Rushing Dewhurst,
Michael Vincent Arnold,
Colin Michael Van Oort
Abstract:
Financial asset markets are sociotechnical systems whose constituent agents are subject to evolutionary pressure as unprofitable agents exit the marketplace and more profitable agents continue to trade assets. Using a population of evolving zero-intelligence agents and a frequent batch auction price-discovery mechanism as substrate, we analyze the role played by evolutionary selection mechanisms i…
▽ More
Financial asset markets are sociotechnical systems whose constituent agents are subject to evolutionary pressure as unprofitable agents exit the marketplace and more profitable agents continue to trade assets. Using a population of evolving zero-intelligence agents and a frequent batch auction price-discovery mechanism as substrate, we analyze the role played by evolutionary selection mechanisms in determining macro-observable market statistics. In particular, we show that selection mechanisms incorporating a local fitness-proportionate component are associated with high correlation between a micro, risk-aversion parameter and a commonly-used macro-volatility statistic, while a purely quantile-based selection mechanism shows significantly less correlation.
△ Less
Submitted 27 April, 2019; v1 submitted 13 December, 2018;
originally announced December 2018.