Search | arXiv e-print repository

arXiv:2105.12006 [pdf, other]

The incel lexicon: Deciphering the emergent cryptolect of a global misogynistic community

Authors: Kelly Gothard, David Rushing Dewhurst, Joshua R. Minot, Jane Lydia Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community… ▽ More Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community's online expressions of misogyny and real-world acts of violence perpetrated against women. After assembling around three million comments from incel-themed Reddit channels, we analyze the temporal dynamics of a data driven rank ordering of the glossary of phrases belonging to an emergent incel lexicon. Our study reveals the generation and normalization of an extensive coded misogynist vocabulary in service of the group's identity. △ Less

Submitted 25 May, 2021; originally announced May 2021.

Comments: 18 pages, 11 figures

arXiv:2009.06865 [pdf, other]

Structural time series grammar over variable blocks

Authors: David Rushing Dewhurst

Abstract: A structural time series model additively decomposes into generative, semantically-meaningful components, each of which depends on a vector of parameters. We demonstrate that considering each generative component together with its vector of parameters as a single latent structural time series node can simplify reasoning about collections of structural time series components. We then introduce a fo… ▽ More A structural time series model additively decomposes into generative, semantically-meaningful components, each of which depends on a vector of parameters. We demonstrate that considering each generative component together with its vector of parameters as a single latent structural time series node can simplify reasoning about collections of structural time series components. We then introduce a formal grammar over structural time series nodes and parameter vectors. Valid sentences in the grammar can be interpreted as generative structural time series models. An extension of the grammar can also express structural time series models that include changepoints, though these models are necessarily not generative. We demonstrate a preliminary implementation of the language generated by this grammar. We close with a discussion of possible future work. △ Less

Submitted 15 September, 2020; originally announced September 2020.

Comments: 7 pages, 2 figures, 1 table

ACM Class: D.3; G.3; I.2

arXiv:2008.11305 [pdf, other]

Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series

Authors: P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, D. R. Dewhurst, A. J. Reagan, C. M. Danforth

Abstract: Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in… ▽ More Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10\% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable. △ Less

Submitted 27 August, 2020; v1 submitted 25 August, 2020; originally announced August 2020.

Comments: 8 pages, 5 figures

arXiv:2007.12988 [pdf, other]

doi 10.1126/sciadv.abe6534

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Authors: Thayer Alshaabi, Jane L. Adams, Michael V. Arnold, Joshua R. Minot, David R. Dewhurst, Andrew J. Reagan, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.… ▽ More In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest. △ Less

Submitted 16 July, 2021; v1 submitted 25 July, 2020; originally announced July 2020.

Comments: Main text: 15 pages, 6 figures; Supplementary text: 23 pages, 11 figures, 15 tables. Website: https://storywrangling.org/

Journal ref: Sci.Adv. 7 eabe6534 (2021)

arXiv:2006.08527 [pdf, other]

doi 10.1371/journal.pone.0247795

The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong

Authors: Thayer Alshaabi, David Rushing Dewhurst, James P. Bagrow, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information… ▽ More Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information on the temporal components of this relationship. Using the districts of Hong Kong over multiple census years as a case study, we demonstrate that there are differences in how wealth indicator variables are associated with longevity in (a) areas that are affluent but neighbored by socially deprived districts versus (b) wealthy areas surrounded by similarly wealthy districts. We also show that the inclusion of spatially-distributed variables reduces uncertainty in mortality rate predictions in each census year when compared with a baseline model. Our results suggest that geographic mortality models should incorporate nonlocal information (e.g., spatial neighbors) to lower the variance of their mortality estimates, and point to a more in-depth analysis of sociospatial spillover effects on mortality rates. △ Less

Submitted 25 January, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 26 pages (15 main, 11 appendix), 22 figures (6 main, 11 appendix), 2 tables

arXiv:2004.03516 [pdf, other]

Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance

Authors: David Rushing Dewhurst, Thayer Alshaabi, Michael V. Arnold, Joshua R. Minot, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that rep… ▽ More Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that represents March COVID-19-related discourse. Aggregating countries by dominant language use, we find that volatility in the first dynamic regime is associated with future volatility in new cases of COVID-19 roughly three weeks (average 22.49 $\pm$ 3.26 days) later. Our results suggest that surveillance of change in usage of epidemiology-related words on social media may be useful in forecasting later change in disease case numbers, but we emphasize that our current findings are not causal or necessarily predictive. △ Less

Submitted 19 May, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: 12 + 4 pages, 11 + 4 figures, code + data + figures will soon be available at http://compstorylab.org/covid19ngrams/

arXiv:2003.14291 [pdf, other]

doi 10.1371/journal.pone.0251762

Hurricanes and hashtags: Characterizing online collective attention for natural disasters

Authors: Michael V. Arnold, David Rushing Dewhurst, Thayer Alshaabi, Joshua R. Minot, Jane L. Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `… ▽ More We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane's Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: 31 pages (14 main, 17 Supplemental), 19 figures (5 main, 14 appendix)

arXiv:2003.12614 [pdf, other]

doi 10.1371/journal.pone.0244476

How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter

Authors: T. Alshaabi, J. R. Minot, M. V. Arnold, J. L. Adams, D. R. Dewhurst, A. J. Reagan, R. Muhamad, C. M. Danforth, P. S. Dodds

Abstract: In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for… ▽ More In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for antibodies, and we need to track major socioeconomic indices. But we also need auxiliary data of all kinds, including data related to how populations are talking about the unfolding pandemic through news and stories. To in part help on the social media side, we curate a set of 2000 day-scale time series of 1- and 2-grams across 24 languages on Twitter that are most 'important' for April 2020 with respect to April 2019. We determine importance through our allotaxonometric instrument, rank-turbulence divergence. We make some basic observations about some of the time series, including a comparison to numbers of confirmed deaths due to COVID-19 over time. We broadly observe across all languages a peak for the language-specific word for 'virus' in January 2020 followed by a decline through February and then a surge through March and April. The world's collective attention dropped away while the virus spread out from China. We host the time series on Gitlab, updating them on a daily basis while relevant. Our main intent is for other researchers to use these time series to enhance whatever analyses that may be of use during the pandemic as well as for retrospective investigations. △ Less

Submitted 6 January, 2021; v1 submitted 27 March, 2020; originally announced March 2020.

Comments: 13 pages, 6 figures, 3 tables, website: http://compstorylab.org/covid19ngrams/

arXiv:2003.03667 [pdf, other]

doi 10.1140/epjds/s13688-021-00271-0

The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020

Authors: Thayer Alshaabi, David R. Dewhurst, Joshua R. Minot, Michael V. Arnold, Jane L. Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio':… ▽ More Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages. △ Less

Submitted 8 March, 2021; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: 26 pages (15 main, 11 appendix), 13 figures (6 main, 7 appendix), and 4 online appendices available at http://compstorylab.org/storywrangler/papers/tlid/

arXiv:1912.09524 [pdf, other]

Evolving ab initio trading strategies in heterogeneous environments

Authors: David Rushing Dewhurst, Yi Li, Alexander Bogdan, Jasmine Geng

Abstract: Securities markets are quintessential complex adaptive systems in which heterogeneous agents compete in an attempt to maximize returns. Species of trading agents are also subject to evolutionary pressure as entire classes of strategies become obsolete and new classes emerge. Using an agent-based model of interacting heterogeneous agents as a flexible environment that can endogenously model many di… ▽ More Securities markets are quintessential complex adaptive systems in which heterogeneous agents compete in an attempt to maximize returns. Species of trading agents are also subject to evolutionary pressure as entire classes of strategies become obsolete and new classes emerge. Using an agent-based model of interacting heterogeneous agents as a flexible environment that can endogenously model many diverse market conditions, we subject deep neural networks to evolutionary pressure to create dominant trading agents. After analyzing the performance of these agents and noting the emergence of anomalous superdiffusion through the evolutionary process, we construct a method to turn high-fitness agents into trading algorithms. We backtest these trading algorithms on real high-frequency foreign exchange data, demonstrating that elite trading algorithms are consistently profitable in a variety of market conditions---even though these algorithms had never before been exposed to real financial data. These results provide evidence to suggest that developing \textit{ab initio} trading strategies by repeated simulation and evolution in a mechanistic market model may be a practical alternative to explicitly training models with past observed market data. △ Less

Submitted 19 December, 2019; originally announced December 2019.

Comments: 20 pages (10 main body, 10 appendix), 11 figures (6 main body, 5 appendix), open-source matching engine implementation available at https://gitlab.com/daviddewhurst/verdantcurve

arXiv:1910.00149 [pdf, other]

Fame and Ultrafame: Measuring and comparing daily levels of `being talked about' for United States' presidents, their rivals, God, countries, and K-pop

Authors: Peter Sheridan Dodds, Joshua R. Minot, Michael V. Arnold, Thayer Alshaabi, Jane Lydia Adams, David Rushing Dewhurst, Andrew J. Reagan, Christopher M. Danforth

Abstract: When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw… ▽ More When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw fame dynamics on Twitter for US Presidents and major US Presidential candidates from 2008 to 2020: Barack Obama, John McCain, Mitt Romney, Hillary Clinton, Donald Trump, and Joe Biden. We assign ``lexical fame'' to be the number and (Zipfian) rank of the (lowercased) mentions made for each individual across all languages. We show that all five political figures have at some point reached extraordinary volume levels of what we define to be ``lexical ultrafame'': An overall rank of approximately 300 or less which is largely the realm of function words and demarcated by the highly stable rank of `god'. By this measure, `trump' has become enduringly ultrafamous, from the 2016 election on. We use typical ranks for country names and function words as standards to improve perception of scale. We quantify relative fame rates and find that in the eight weeks leading up the 2008 and 2012 elections, `obama' held a 1000:757 volume ratio over `mccain' and 1000:892 over `romney', well short of the 1000:544 and 1000:504 volumes favoring `trump' over `hillary' and `biden' in the 8 weeks leading up to the 2016 and 2020 elections. Finally, we track how only one other entity has more sustained ultrafame than `trump' on Twitter: The K-pop (Korean pop) band BTS. We chart the dramatic rise of BTS, finding their Twitter handle `@bts\_twt' has been able to compete with `a' and `the'. Our findings for BTS more generally point to K-pop's growing economic, social, and political power. △ Less

Submitted 29 October, 2021; v1 submitted 30 September, 2019; originally announced October 2019.

Comments: 31 pages (21 pages main text, 10 pages appendix), 8 figures (7 in main text, 1 in appendix), 10 tables (1 in main text, 9 in appendix)

arXiv:1906.11710 [pdf, other]

The shocklet transform: A decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series

Authors: David Rushing Dewhurst, Thayer Alshaabi, Dilan Kiley, Michael V. Arnold, Joshua R. Minot, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous be… ▽ More We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST's ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms' utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter. △ Less

Submitted 18 December, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

Comments: 29 pages (20 body, 9 appendix), 20 figures (13 body, 7 appendix), three online appendices available at http://compstorylab.org/shocklets/ (two displaying interactive visualizations and one containing over 10,000 figures), open-source implementation of STAR algorithm and discrete shocklet transform available at https://gitlab.com/compstorylab/discrete-shocklet-transform

arXiv:1812.05657 [pdf, other]

doi 10.1145/3321707.3321734

Selection mechanisms affect volatility in evolving markets

Authors: David Rushing Dewhurst, Michael Vincent Arnold, Colin Michael Van Oort

Abstract: Financial asset markets are sociotechnical systems whose constituent agents are subject to evolutionary pressure as unprofitable agents exit the marketplace and more profitable agents continue to trade assets. Using a population of evolving zero-intelligence agents and a frequent batch auction price-discovery mechanism as substrate, we analyze the role played by evolutionary selection mechanisms i… ▽ More Financial asset markets are sociotechnical systems whose constituent agents are subject to evolutionary pressure as unprofitable agents exit the marketplace and more profitable agents continue to trade assets. Using a population of evolving zero-intelligence agents and a frequent batch auction price-discovery mechanism as substrate, we analyze the role played by evolutionary selection mechanisms in determining macro-observable market statistics. In particular, we show that selection mechanisms incorporating a local fitness-proportionate component are associated with high correlation between a micro, risk-aversion parameter and a commonly-used macro-volatility statistic, while a purely quantile-based selection mechanism shows significantly less correlation. △ Less

Submitted 27 April, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

Comments: 9 pages, 7 figures, to appear in proceedings of GECCO 2019 as a full paper

Journal ref: In Proceedings of the Genetic and Evolutionary Computation Conference (2019) 90-98

Showing 1–13 of 13 results for author: Dewhurst, D R