-
Digital Gatekeepers: Google's Role in Curating Hashtags and Subreddits
Authors:
Amrit Poudel,
Yifan Ding,
Jurgen Pfeffer,
Tim Weninger
Abstract:
Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reve…
▽ More
Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google's algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google's gatekeeping practices influence public discourse by curating the social media narratives available to users.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Citations and Trust in LLM Generated Responses
Authors:
Yifan Ding,
Matthew Facciani,
Amrit Poudel,
Ellen Joyce,
Salvador Aguinaga,
Balaji Veeramani,
Sanmitra Bhattacharya,
Tim Weninger
Abstract:
Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbo…
▽ More
Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
MedCodER: A Generative AI Assistant for Medical Coding
Authors:
Krishanu Das Baksi,
Elijah Soba,
John J. Higgins,
Ravi Saini,
Jaden Wood,
Jane Cook,
Jack Scott,
Nirmala Pudota,
Tim Weninger,
Edward Bowen,
Sanmitra Bhattacharya
Abstract:
Medical coding is essential for standardizing clinical data and communication but is often time-consuming and prone to errors. Traditional Natural Language Processing (NLP) methods struggle with automating coding due to the large label space, lengthy text inputs, and the absence of supporting evidence annotations that justify code selection. Recent advancements in Generative Artificial Intelligenc…
▽ More
Medical coding is essential for standardizing clinical data and communication but is often time-consuming and prone to errors. Traditional Natural Language Processing (NLP) methods struggle with automating coding due to the large label space, lengthy text inputs, and the absence of supporting evidence annotations that justify code selection. Recent advancements in Generative Artificial Intelligence (AI) offer promising solutions to these challenges. In this work, we introduce MedCodER, a Generative AI framework for automatic medical coding that leverages extraction, retrieval, and re-ranking techniques as core components. MedCodER achieves a micro-F1 score of 0.60 on International Classification of Diseases (ICD) code prediction, significantly outperforming state-of-the-art methods. Additionally, we present a new dataset containing medical records annotated with disease diagnoses, ICD codes, and supporting evidence texts (https://doi.org/10.5281/zenodo.13308316). Ablation tests confirm that MedCodER's performance depends on the integration of each of its aforementioned components, as performance declines when these components are evaluated in isolation.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
Fear and Loathing on the Frontline: Decoding the Language of Othering by Russia-Ukraine War Bloggers
Authors:
Patrick Gerard,
William Theisen,
Tim Weninger,
Kristina Lerman
Abstract:
Othering, the act of portraying outgroups as fundamentally different from the ingroup, often escalates into framing them as existential threats--fueling intergroup conflict and justifying exclusion and violence. These dynamics are alarmingly pervasive, spanning from the extreme historical examples of genocides against minorities in Germany and Rwanda to the ongoing violence and rhetoric targeting…
▽ More
Othering, the act of portraying outgroups as fundamentally different from the ingroup, often escalates into framing them as existential threats--fueling intergroup conflict and justifying exclusion and violence. These dynamics are alarmingly pervasive, spanning from the extreme historical examples of genocides against minorities in Germany and Rwanda to the ongoing violence and rhetoric targeting migrants in the US and Europe. While concepts like hate speech and fear speech have been explored in existing literature, they capture only part of this broader and more nuanced dynamic which can often be harder to detect, particularly in online speech and propaganda. To address this challenge, we introduce a novel computational framework that leverages large language models (LLMs) to quantify othering across diverse contexts, extending beyond traditional linguistic indicators of hostility. Applying the model to real-world data from Telegram war bloggers and political discussions on Gab reveals how othering escalates during conflicts, interacts with moral language, and garners significant attention, particularly during periods of crisis. Our framework, designed to offer deeper insights into othering dynamics, combines with a rapid adaptation process to provide essential tools for mitigating othering's adverse impacts on social cohesion.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
Modeling Information Narrative Detection and Evolution on Telegram during the Russia-Ukraine War
Authors:
Patrick Gerard,
Svitlana Volkova,
Louis Penafiel,
Kristina Lerman,
Tim Weninger
Abstract:
Following the Russian Federation's full-scale invasion of Ukraine in February 2022, a multitude of information narratives emerged within both pro-Russian and pro-Ukrainian communities online. As the conflict progresses, so too do the information narratives, constantly adapting and influencing local and global community perceptions and attitudes. This dynamic nature of the evolving information envi…
▽ More
Following the Russian Federation's full-scale invasion of Ukraine in February 2022, a multitude of information narratives emerged within both pro-Russian and pro-Ukrainian communities online. As the conflict progresses, so too do the information narratives, constantly adapting and influencing local and global community perceptions and attitudes. This dynamic nature of the evolving information environment (IE) underscores a critical need to fully discern how narratives evolve and affect online communities. Existing research, however, often fails to capture information narrative evolution, overlooking both the fluid nature of narratives and the internal mechanisms that drive their evolution. Recognizing this, we introduce a novel approach designed to both model narrative evolution and uncover the underlying mechanisms driving them. In this work we perform a comparative discourse analysis across communities on Telegram covering the initial three months following the invasion. First, we uncover substantial disparities in narratives and perceptions between pro-Russian and pro-Ukrainian communities. Then, we probe deeper into prevalent narratives of each group, identifying key themes and examining the underlying mechanisms fueling their evolution. Finally, we explore influences and factors that may shape the development and spread of narratives.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery
Authors:
Sounak Lahiri,
Sumit Pai,
Tim Weninger,
Sanmitra Bhattacharya
Abstract:
Electronic Discovery (eDiscovery) requires identifying relevant documents from vast collections for legal production requests. While artificial intelligence (AI) and natural language processing (NLP) have improved document review efficiency, current methods still struggle with legal entities, citations, and complex legal artifacts. To address these challenges, we introduce DISCOvery Graph (DISCOG)…
▽ More
Electronic Discovery (eDiscovery) requires identifying relevant documents from vast collections for legal production requests. While artificial intelligence (AI) and natural language processing (NLP) have improved document review efficiency, current methods still struggle with legal entities, citations, and complex legal artifacts. To address these challenges, we introduce DISCOvery Graph (DISCOG), an emerging system that integrates knowledge graphs for enhanced document ranking and classification, augmented by LLM-driven reasoning. DISCOG outperforms strong baselines in F1-score, precision, and recall across both balanced and imbalanced datasets. In real-world deployments, it has reduced litigation-related document review costs by approximately 98\%, demonstrating significant business impact.
△ Less
Submitted 13 June, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Reputation Transfer in the Twitter Diaspora
Authors:
Kristina Radivojevic,
DJ Adams,
Griffin Laszlo,
Felixander Kery,
Tim Weninger
Abstract:
Social media platforms have witnessed a dynamic landscape of user migration in recent years, fueled by changes in ownership, policy, and user preferences. This paper explores the phenomenon of user migration from established platforms like X/Twitter to emerging alternatives such as Threads, Mastodon, and Truth Social. Leveraging a large dataset from X/Twitter, we investigate the extent of user dep…
▽ More
Social media platforms have witnessed a dynamic landscape of user migration in recent years, fueled by changes in ownership, policy, and user preferences. This paper explores the phenomenon of user migration from established platforms like X/Twitter to emerging alternatives such as Threads, Mastodon, and Truth Social. Leveraging a large dataset from X/Twitter, we investigate the extent of user departure from X/Twitter and the destinations they migrate to. Additionally, we examine whether a user's reputation on one platform correlates with their reputation on another, shedding light on the transferability of digital reputation across social media ecosystems. Overall, we find that users with a large following on X/Twitter are more likely to migrate to another platform; and that their reputation on X/Twitter is highly correlated with reputations on Threads, but not Mastodon or Truth Social.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
SCHENO: Measuring Schema vs. Noise in Graphs
Authors:
Justus Isaiah Hibshman,
Adnan Hoq,
Tim Weninger
Abstract:
Real-world data is typically a noisy manifestation of a core pattern (schema), and the purpose of data mining algorithms is to uncover that pattern, thereby splitting (i.e. decomposing) the data into schema and noise. We introduce SCHENO, a principled evaluation metric for the goodness of a schema-noise decomposition of a graph. SCHENO captures how schematic the schema is, how noisy the noise is,…
▽ More
Real-world data is typically a noisy manifestation of a core pattern (schema), and the purpose of data mining algorithms is to uncover that pattern, thereby splitting (i.e. decomposing) the data into schema and noise. We introduce SCHENO, a principled evaluation metric for the goodness of a schema-noise decomposition of a graph. SCHENO captures how schematic the schema is, how noisy the noise is, and how well the combination of the two represent the original graph data. We visually demonstrate what this metric prioritizes in small graphs, then show that if SCHENO is used as the fitness function for a simple optimization strategy, we can uncover a wide variety of patterns. Finally, we evaluate several well-known graph mining algorithms with this metric; we find that although they produce patterns, those patterns are not always the best representation of the input data.
△ Less
Submitted 4 February, 2025; v1 submitted 20 April, 2024;
originally announced April 2024.
-
Span-Oriented Information Extraction -- A Unifying Perspective on Information Extraction
Authors:
Yifan Ding,
Michael Yankoski,
Tim Weninger
Abstract:
Information Extraction refers to a collection of tasks within Natural Language Processing (NLP) that identifies sub-sequences within text and their labels. These tasks have been used for many years to link extract relevant information and to link free text to structured data. However, the heterogeneity among information extraction tasks impedes progress in this area. We therefore offer a unifying…
▽ More
Information Extraction refers to a collection of tasks within Natural Language Processing (NLP) that identifies sub-sequences within text and their labels. These tasks have been used for many years to link extract relevant information and to link free text to structured data. However, the heterogeneity among information extraction tasks impedes progress in this area. We therefore offer a unifying perspective centered on what we define to be spans in text. We then re-orient these seemingly incongruous tasks into this unified perspective and then re-present the wide assortment of information extraction tasks as variants of the same basic Span-Oriented Information Extraction task.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
An Avalanche of Images on Telegram Preceded Russia's Full-Scale Invasion of Ukraine
Authors:
William Theisen,
Michael Yankoski,
Kristina Hook,
Ernesto Verdeja,
Walter Scheirer,
Tim Weninger
Abstract:
Governments use propaganda, including through visual content -- or Politically Salient Image Patterns (PSIP) -- on social media, to influence and manipulate public opinion. In the present work, we collected Telegram post-history of from 989 Russian milbloggers to better understand the social and political narratives that circulated online in the months surrounding Russia's 2022 full-scale invasion…
▽ More
Governments use propaganda, including through visual content -- or Politically Salient Image Patterns (PSIP) -- on social media, to influence and manipulate public opinion. In the present work, we collected Telegram post-history of from 989 Russian milbloggers to better understand the social and political narratives that circulated online in the months surrounding Russia's 2022 full-scale invasion of Ukraine. Overall, we found an 8,925% increase (p<0.001) in the number of posts and a 5,352% increase (p<0.001) in the number of images posted by these accounts in the two weeks prior to the invasion. We also observed a similar increase in the number and intensity of politically salient manipulated images that circulated on Telegram. Although this paper does not evaluate malice or coordination in these activities, we do conclude with a call for further research into the role that manipulated visual media has in the lead-up to instability events and armed conflict.
△ Less
Submitted 15 July, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
ChatEL: Entity Linking with Chatbots
Authors:
Yifan Ding,
Qingkai Zeng,
Tim Weninger
Abstract:
Entity Linking (EL) is an essential and challenging task in natural language processing that seeks to link some text representing an entity within a document or sentence with its corresponding entry in a dictionary or knowledge base. Most existing approaches focus on creating elaborate contextual models that look for clues the words surrounding the entity-text to help solve the linking problem. Al…
▽ More
Entity Linking (EL) is an essential and challenging task in natural language processing that seeks to link some text representing an entity within a document or sentence with its corresponding entry in a dictionary or knowledge base. Most existing approaches focus on creating elaborate contextual models that look for clues the words surrounding the entity-text to help solve the linking problem. Although these fine-tuned language models tend to work, they can be unwieldy, difficult to train, and do not transfer well to other domains. Fortunately, Large Language Models (LLMs) like GPT provide a highly-advanced solution to the problems inherent in EL models, but simply naive prompts to LLMs do not work well. In the present work, we define ChatEL, which is a three-step framework to prompt LLMs to return accurate results. Overall the ChatEL framework improves the average F1 performance across 10 datasets by more than 2%. Finally, a thorough error analysis shows many instances with the ground truth labels were actually incorrect, and the labels predicted by ChatEL were actually correct. This indicates that the quantitative results presented in this paper may be a conservative estimate of the actual performance. All data and code are available as an open-source package on GitHub at https://github.com/yifding/In_Context_EL.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
EntGPT: Entity Linking with Generative Large Language Models
Authors:
Yifan Ding,
Amrit Poudel,
Qingkai Zeng,
Tim Weninger,
Balaji Veeramani,
Sanmitra Bhattacharya
Abstract:
Entity Linking in natural language processing seeks to match text entities to their corresponding entries in a dictionary or knowledge base. Traditional approaches rely on contextual models, which can be complex, hard to train, and have limited transferability across different domains. Generative large language models like GPT offer a promising alternative but often underperform with naive prompts…
▽ More
Entity Linking in natural language processing seeks to match text entities to their corresponding entries in a dictionary or knowledge base. Traditional approaches rely on contextual models, which can be complex, hard to train, and have limited transferability across different domains. Generative large language models like GPT offer a promising alternative but often underperform with naive prompts. In this study, we introduce EntGPT, employing advanced prompt engineering to enhance EL tasks. Our three-step hard-prompting method (EntGPT-P) significantly boosts the micro-F_1 score by up to 36% over vanilla prompts, achieving competitive performance across 10 datasets without supervised fine-tuning. Additionally, our instruction tuning method (EntGPT-I) improves micro-F_1 scores by 2.1% on average in supervised EL tasks and outperforms several baseline models in six Question Answering tasks. Our methods are compatible with both open-source and proprietary LLMs. All data and code are available on GitHub at https://github.com/yifding/In_Context_EL.
△ Less
Submitted 22 May, 2025; v1 submitted 9 February, 2024;
originally announced February 2024.
-
Navigating the Post-API Dilemma | Search Engine Results Pages Present a Biased View of Social Media Data
Authors:
Amrit Poudel,
Tim Weninger
Abstract:
Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if…
▽ More
Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.
△ Less
Submitted 27 November, 2024; v1 submitted 27 January, 2024;
originally announced January 2024.
-
TK-KNN: A Balanced Distance-Based Pseudo Labeling Approach for Semi-Supervised Intent Classification
Authors:
Nicholas Botzer,
David Vasquez,
Tim Weninger,
Issam Laradji
Abstract:
The ability to detect intent in dialogue systems has become increasingly important in modern technology. These systems often generate a large amount of unlabeled data, and manually labeling this data requires substantial human effort. Semi-supervised methods attempt to remedy this cost by using a model trained on a few labeled examples and then by assigning pseudo-labels to further a subset of unl…
▽ More
The ability to detect intent in dialogue systems has become increasingly important in modern technology. These systems often generate a large amount of unlabeled data, and manually labeling this data requires substantial human effort. Semi-supervised methods attempt to remedy this cost by using a model trained on a few labeled examples and then by assigning pseudo-labels to further a subset of unlabeled examples that has a model prediction confidence higher than a certain threshold. However, one particularly perilous consequence of these methods is the risk of picking an imbalanced set of examples across classes, which could lead to poor labels. In the present work, we describe Top-K K-Nearest Neighbor (TK-KNN), which uses a more robust pseudo-labeling approach based on distance in the embedding space while maintaining a balanced set of pseudo-labeled examples across classes through a ranking-based approach. Experiments on several datasets show that TK-KNN outperforms existing models, particularly when labeled data is scarce on popular datasets such as CLINC150 and Banking77. Code is available at https://github.com/ServiceNow/tk-knn
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Entity Graphs for Exploring Online Discourse
Authors:
Nicholas Botzer,
Tim Weninger
Abstract:
Vast amounts of human communication occurs online. These digital traces of natural human communication along with recent advances in natural language processing technology provide for computational analysis of these discussions. In the study of social networks the typical perspective is to view users as nodes and concepts as flowing through and among the user-nodes within the social network. In th…
▽ More
Vast amounts of human communication occurs online. These digital traces of natural human communication along with recent advances in natural language processing technology provide for computational analysis of these discussions. In the study of social networks the typical perspective is to view users as nodes and concepts as flowing through and among the user-nodes within the social network. In the present work we take the opposite perspective: we extract and organize massive amounts of group discussion into a concept space we call an entity graph where concepts and entities are static and human communicators move about the concept space via their conversations. Framed by this perspective we performed several experiments and comparative analysis on large volumes of online discourse from Reddit. In quantitative experiments, we found that discourse was difficult to predict, especially as the conversation carried on. We also developed an interactive tool to visually inspect conversation trails over the entity graph; although they were difficult to predict, we found that conversations, in general, tended to diverge to a vast swath of topics initially, but then tended to converge to simple and popular concepts as the conversation progressed. An application of the spreading activation function from the field of cognitive psychology also provided compelling visual narratives from the data.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Dynamic Vertex Replacement Grammars
Authors:
Daniel Gonzalez Cedre,
Justus Isaiah Hibshman,
Timothy La Fond,
Grant Boquet,
Tim Weninger
Abstract:
Context-free graph grammars have shown a remarkable ability to model structures in real-world relational data. However, graph grammars lack the ability to capture time-changing phenomena since the left-to-right transitions of a production rule do not represent temporal change. In the present work, we describe dynamic vertex-replacement grammars (DyVeRG), which generalize vertex replacement grammar…
▽ More
Context-free graph grammars have shown a remarkable ability to model structures in real-world relational data. However, graph grammars lack the ability to capture time-changing phenomena since the left-to-right transitions of a production rule do not represent temporal change. In the present work, we describe dynamic vertex-replacement grammars (DyVeRG), which generalize vertex replacement grammars in the time domain by providing a formal framework for updating a learned graph grammar in accordance with modifications to its underlying data. We show that DyVeRG grammars can be learned from, and used to generate, real-world dynamic graphs faithfully while remaining human-interpretable. We also demonstrate their ability to forecast by computing dyvergence scores, a novel graph similarity measurement exposed by this framework.
△ Less
Submitted 21 March, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Truth Social Dataset
Authors:
Patrick Gerard,
Nicholas Botzer,
Tim Weninger
Abstract:
Formally announced to the public following former President Donald Trump's bans and suspensions from mainstream social networks in early 2022 after his role in the January 6 Capitol Riots, Truth Social was launched as an "alternative" social media platform that claims to be a refuge for free speech, offering a platform for those disaffected by the content moderation policies of the existing, mains…
▽ More
Formally announced to the public following former President Donald Trump's bans and suspensions from mainstream social networks in early 2022 after his role in the January 6 Capitol Riots, Truth Social was launched as an "alternative" social media platform that claims to be a refuge for free speech, offering a platform for those disaffected by the content moderation policies of the existing, mainstream social networks. The subsequent rise of Truth Social has been driven largely by hard-line supporters of the former president as well as those affected by the content moderation of other social networks. These distinct qualities combined with its status as the main mouthpiece of the former president positions Truth Social as a particularly influential social media platform and give rise to several research questions. However, outside of a handful of news reports, little is known about the new social media platform partially due to a lack of well-curated data. In the current work, we describe a dataset of over 823,000 posts to Truth Social and and social network with over 454,000 distinct users. In addition to the dataset itself, we also present some basic analysis of its content, certain temporal features, and its network.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
Inherent Limits on Topology-Based Link Prediction
Authors:
Justus I. Hibshman,
Tim Weninger
Abstract:
Link prediction systems (e.g. recommender systems) typically use graph topology as one of their main sources of information. However, automorphisms and related properties of graphs beget inherent limits in predictability. We calculate hard upper bounds on how well graph topology alone enables link prediction for a wide variety of real-world graphs. We find that in the sparsest of these graphs the…
▽ More
Link prediction systems (e.g. recommender systems) typically use graph topology as one of their main sources of information. However, automorphisms and related properties of graphs beget inherent limits in predictability. We calculate hard upper bounds on how well graph topology alone enables link prediction for a wide variety of real-world graphs. We find that in the sparsest of these graphs the upper bounds are surprisingly low, thereby demonstrating that prediction systems on sparse graph data are inherently limited and require information in addition to the graph topology.
△ Less
Submitted 26 June, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
H-EMD: A Hierarchical Earth Mover's Distance Method for Instance Segmentation
Authors:
Peixian Liang,
Yizhe Zhang,
Yifan Ding,
Jianxu Chen,
Chinedu S. Madukoma,
Tim Weninger,
Joshua D. Shrout,
Danny Z. Chen
Abstract:
Deep learning (DL) based semantic segmentation methods have achieved excellent performance in biomedical image segmentation, producing high quality probability maps to allow extraction of rich instance information to facilitate good instance segmentation. While numerous efforts were put into developing new DL semantic segmentation models, less attention was paid to a key issue of how to effectivel…
▽ More
Deep learning (DL) based semantic segmentation methods have achieved excellent performance in biomedical image segmentation, producing high quality probability maps to allow extraction of rich instance information to facilitate good instance segmentation. While numerous efforts were put into developing new DL semantic segmentation models, less attention was paid to a key issue of how to effectively explore their probability maps to attain the best possible instance segmentation. We observe that probability maps by DL semantic segmentation models can be used to generate many possible instance candidates, and accurate instance segmentation can be achieved by selecting from them a set of "optimized" candidates as output instances. Further, the generated instance candidates form a well-behaved hierarchical structure (a forest), which allows selecting instances in an optimized manner. Hence, we propose a novel framework, called hierarchical earth mover's distance (H-EMD), for instance segmentation in biomedical 2D+time videos and 3D images, which judiciously incorporates consistent instance selection with semantic-segmentation-generated probability maps. H-EMD contains two main stages. (1) Instance candidate generation: capturing instance-structured information in probability maps by generating many instance candidates in a forest structure. (2) Instance candidate selection: selecting instances from the candidate set for final instance segmentation. We formulate a key instance selection problem on the instance candidate forest as an optimization problem based on the earth mover's distance (EMD), and solve it by integer linear programming. Extensive experiments on eight biomedical video or 3D datasets demonstrate that H-EMD consistently boosts DL semantic segmentation models and is highly competitive with state-of-the-art methods.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
MEWS: Real-time Social Media Manipulation Detection and Analysis
Authors:
Trenton W. Ford,
William Theisen,
Michael Yankoski,
Tom Henry,
Farah Khashman,
Katherine R. Dearstyne,
Tim Weninger
Abstract:
This article presents a beta-version of MEWS (Misinformation Early Warning System). It describes the various aspects of the ingestion, manipulation detection, and graphing algorithms employed to determine--in near real-time--the relationships between social media images as they emerge and spread on social media platforms. By combining these various technologies into a single processing pipeline, M…
▽ More
This article presents a beta-version of MEWS (Misinformation Early Warning System). It describes the various aspects of the ingestion, manipulation detection, and graphing algorithms employed to determine--in near real-time--the relationships between social media images as they emerge and spread on social media platforms. By combining these various technologies into a single processing pipeline, MEWS can identify manipulated media items as they arise and identify when these particular items begin trending on individual social media platforms or even across multiple platforms. The emergence of a novel manipulation followed by rapid diffusion of the manipulated content suggests a disinformation campaign.
△ Less
Submitted 12 May, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
Subreddit Links Drive Community Creation and User Engagement on Reddit
Authors:
Rachel Krohn,
Tim Weninger
Abstract:
On Reddit, individual subreddits are used to organize content and connect users. One mode of interaction is the subreddit link, which occurs when a user makes a direct reference to a subreddit in another community. Based on the ubiquity of these references, we have undertaken a study on subreddit links on Reddit, with the goal of understanding their impact on both the referenced subreddit, and on…
▽ More
On Reddit, individual subreddits are used to organize content and connect users. One mode of interaction is the subreddit link, which occurs when a user makes a direct reference to a subreddit in another community. Based on the ubiquity of these references, we have undertaken a study on subreddit links on Reddit, with the goal of understanding their impact on both the referenced subreddit, and on the subreddit landscape as a whole. By way of an extensive observational study along with several natural experiments using the entire history of Reddit, we were able to determine that (1) subreddit links are a significant driver of new suberddit creation; (2) subreddit links (2a) substantially drive activity in the referenced subreddit, and (2b) are frequently created in response to high levels of activity in the referenced subreddit; and (3) the graph of subreddit links has become less dense and more treelike over time. We conclude with a discussion of how these results confirm, add to, and in some cases conflict with existing theories on information-seeking behavior and self-organizing behavior in online social systems.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Motif Mining: Finding and Summarizing Remixed Image Content
Authors:
William Theisen,
Daniel Gonzalez Cedre,
Zachariah Carmichael,
Daniel Moreira,
Tim Weninger,
Walter Scheirer
Abstract:
On the internet, images are no longer static; they have become dynamic content. Thanks to the availability of smartphones with cameras and easy-to-use editing software, images can be remixed (i.e., redacted, edited, and recombined with other content) on-the-fly and with a world-wide audience that can repeat the process. From digital art to memes, the evolution of images through time is now an impo…
▽ More
On the internet, images are no longer static; they have become dynamic content. Thanks to the availability of smartphones with cameras and easy-to-use editing software, images can be remixed (i.e., redacted, edited, and recombined with other content) on-the-fly and with a world-wide audience that can repeat the process. From digital art to memes, the evolution of images through time is now an important topic of study for digital humanists, social scientists, and media forensics specialists. However, because typical data sets in computer vision are composed of static content, the development of automated algorithms to analyze remixed content has been limited. In this paper, we introduce the idea of Motif Mining - the process of finding and summarizing remixed image content in large collections of unlabeled and unsorted data. In this paper, this idea is formalized and a reference implementation is introduced. Experiments are conducted on three meme-style data sets, including a newly collected set associated with the information war in the Russo-Ukrainian conflict. The proposed motif mining approach is able to identify related remixed content that, when compared to similar approaches, more closely aligns with the preferences and expectations of human observers.
△ Less
Submitted 17 March, 2022; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Attributed Graph Modeling with Vertex Replacement Grammars
Authors:
Satyaki Sikdar,
Neil Shah,
Tim Weninger
Abstract:
Recent work at the intersection of formal language theory and graph theory has explored graph grammars for graph modeling. However, existing models and formalisms can only operate on homogeneous (i.e., untyped or unattributed) graphs. We relax this restriction and introduce the Attributed Vertex Replacement Grammar (AVRG), which can be efficiently extracted from heterogeneous (i.e., typed, colored…
▽ More
Recent work at the intersection of formal language theory and graph theory has explored graph grammars for graph modeling. However, existing models and formalisms can only operate on homogeneous (i.e., untyped or unattributed) graphs. We relax this restriction and introduce the Attributed Vertex Replacement Grammar (AVRG), which can be efficiently extracted from heterogeneous (i.e., typed, colored, or attributed) graphs. Unlike current state-of-the-art methods, which train enormous models over complicated deep neural architectures, the AVRG model is unsupervised and interpretable. It is based on context-free string grammars and works by encoding graph rewriting rules into a graph grammar containing graphlets and instructions on how they fit together. We show that the AVRG can encode succinct models of input graphs yet faithfully preserve their structure and assortativity properties. Experiments on large real-world datasets show that graphs generated from the AVRG model exhibit substructures and attribute configurations that match those found in the input networks.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Pilot Study Suggests Online Media Literacy Programming Reduces Belief in False News in Indonesia
Authors:
Pamela Bilo Thomas,
Clark Hogan-Taylor,
Michael Yankoski,
Tim Weninger
Abstract:
Amidst the threat of digital misinformation, we offer a pilot study regarding the efficacy of an online social media literacy campaign aimed at empowering individuals in Indonesia with skills to help them identify misinformation. We found that users who engaged with our online training materials and educational videos were more likely to identify misinformation than those in our control group (tot…
▽ More
Amidst the threat of digital misinformation, we offer a pilot study regarding the efficacy of an online social media literacy campaign aimed at empowering individuals in Indonesia with skills to help them identify misinformation. We found that users who engaged with our online training materials and educational videos were more likely to identify misinformation than those in our control group (total $N$=1000). Given the promising results of our preliminary study, we plan to expand efforts in this area, and build upon lessons learned from this pilot study.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
Posthoc Verification and the Fallibility of the Ground Truth
Authors:
Yifan Ding,
Nicholas Botzer,
Tim Weninger
Abstract:
Classifiers commonly make use of pre-annotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test set typically made of human-annotated labels. Metrics used in these evaluations are tied to the availability of well-defined ground truth labels, and these metrics typically do not allow for inexact matches. These noisy ground truth labels and strict evaluation metrics ma…
▽ More
Classifiers commonly make use of pre-annotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test set typically made of human-annotated labels. Metrics used in these evaluations are tied to the availability of well-defined ground truth labels, and these metrics typically do not allow for inexact matches. These noisy ground truth labels and strict evaluation metrics may compromise the validity and realism of evaluation results. In the present work, we discuss these concerns and conduct a systematic posthoc verification experiment on the entity linking (EL) task. Unlike traditional methodologies, which asks annotators to provide free-form annotations, we ask annotators to verify the correctness of annotations after the fact (i.e., posthoc). Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology. Posthoc validation also permits the validation of the ground truth dataset. Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth. We conclude with a discussion on these findings and recommendations for future evaluations.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Survey Equivalence: A Procedure for Measuring Classifier Accuracy Against Human Labels
Authors:
Paul Resnick,
Yuqing Kong,
Grant Schoenebeck,
Tim Weninger
Abstract:
In many classification tasks, the ground truth is either noisy or subjective. Examples include: which of two alternative paper titles is better? is this comment toxic? what is the political leaning of this news article? We refer to such tasks as survey settings because the ground truth is defined through a survey of one or more human raters. In survey settings, conventional measurements of classif…
▽ More
In many classification tasks, the ground truth is either noisy or subjective. Examples include: which of two alternative paper titles is better? is this comment toxic? what is the political leaning of this news article? We refer to such tasks as survey settings because the ground truth is defined through a survey of one or more human raters. In survey settings, conventional measurements of classifier accuracy such as precision, recall, and cross-entropy confound the quality of the classifier with the level of agreement among human raters. Thus, they have no meaningful interpretation on their own. We describe a procedure that, given a dataset with predictions from a classifier and K ratings per item, rescales any accuracy measure into one that has an intuitive interpretation. The key insight is to score the classifier not against the best proxy for the ground truth, such as a majority vote of the raters, but against a single human rater at a time. That score can be compared to other predictors' scores, in particular predictors created by combining labels from several other human raters. The survey equivalence of any classifier is the minimum number of raters needed to produce the same expected score as that found for the classifier.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Competition Dynamics in the Meme Ecosystem
Authors:
Trenton Ford,
Rachel Krohn,
Tim Weninger
Abstract:
The creation and sharing of memes is a common modality of online social interactions. The goal of the present work is to better understand the collective dynamics of memes in this accelerating and competitive environment. By taking an ecological perspective and tracking the meme-text from 352 popular memes over the entirety of Reddit, we are able to show that the frequency of memes has scaled almo…
▽ More
The creation and sharing of memes is a common modality of online social interactions. The goal of the present work is to better understand the collective dynamics of memes in this accelerating and competitive environment. By taking an ecological perspective and tracking the meme-text from 352 popular memes over the entirety of Reddit, we are able to show that the frequency of memes has scaled almost exactly with the total amount of content created over the past decade. This means that as more data is posted, an equal proportion of memes are posted. One consequence of limited human attention in the face of a growing number of memes is that the diversity of these memes has decreased at the community level, albeit slightly, in the same period. Another consequence is that the average lifespan of a meme has decreased dramatically, which is further evidence of an increase in competition and a decreasing collective attention span.
△ Less
Submitted 7 February, 2021;
originally announced February 2021.
-
Analysis of Moral Judgement on Reddit
Authors:
Nicholas Botzer,
Shawn Gu,
Tim Weninger
Abstract:
Moral outrage has become synonymous with social media in recent years. However, the preponderance of academic analysis on social media websites has focused on hate speech and misinformation. This paper focuses on analyzing moral judgements rendered on social media by capturing the moral judgements that are passed in the subreddit /r/AmITheAsshole on Reddit. Using the labels associated with each ju…
▽ More
Moral outrage has become synonymous with social media in recent years. However, the preponderance of academic analysis on social media websites has focused on hate speech and misinformation. This paper focuses on analyzing moral judgements rendered on social media by capturing the moral judgements that are passed in the subreddit /r/AmITheAsshole on Reddit. Using the labels associated with each judgement we train a classifier that can take a comment and determine whether it judges the user who made the original post to have positive or negative moral valence. Then, we use this classifier to investigate an assortment of website traits surrounding moral judgements in ten other subreddits, including where negative moral users like to post and their posting patterns. Our findings also indicate that posts that are judged in a positive manner will score higher.
△ Less
Submitted 19 January, 2021;
originally announced January 2021.
-
Behavior Change in Response to Subreddit Bans and External Events
Authors:
Pamela Bilo Thomas,
Daniel Riehm,
Maria Glenski,
Tim Weninger
Abstract:
As more people flock to social media to connect with others and form virtual communities, it is important to research how members of these groups interact to understand human behavior on the Web. In response to an increase in hate speech, harassment and other antisocial behaviors, many social media companies have implemented different content and user moderation policies. On Reddit, for example, c…
▽ More
As more people flock to social media to connect with others and form virtual communities, it is important to research how members of these groups interact to understand human behavior on the Web. In response to an increase in hate speech, harassment and other antisocial behaviors, many social media companies have implemented different content and user moderation policies. On Reddit, for example, communities, i.e, subreddits, are occasionally banned for violating these policies. We study the effect of these regulatory actions as well as when a community experiences a significant external event like a political election or a market crash. Overall, we find that most subreddit bans prompt a small, but statistically significant, number of active users to leave the platform; the effect of external events varies with the type of event. We conclude with a discussion on the effectiveness of the bans and wider implications for the online content moderation.
△ Less
Submitted 5 January, 2021;
originally announced January 2021.
-
Reddit Entity Linking Dataset
Authors:
Nicholas Botzer,
Yifan Ding,
Tim Weninger
Abstract:
We introduce and make publicly available an entity linking dataset from Reddit that contains 17,316 linked entities, each annotated by three human annotators and then grouped into Gold, Silver, and Bronze to indicate inter-annotator agreement. We analyze the different errors and disagreements made by annotators and suggest three types of corrections to the raw data. Finally, we tested existing ent…
▽ More
We introduce and make publicly available an entity linking dataset from Reddit that contains 17,316 linked entities, each annotated by three human annotators and then grouped into Gold, Silver, and Bronze to indicate inter-annotator agreement. We analyze the different errors and disagreements made by annotators and suggest three types of corrections to the raw data. Finally, we tested existing entity linking models that are trained and tuned on text from non-social media datasets. We find that, although these existing entity linking models perform very well on their original datasets, they perform poorly on this social media dataset. We also show that the majority of these errors can be attributed to poor performance on the mention detection subtask. These results indicate the need for better entity linking models that can be applied to the enormous amount of social media text.
△ Less
Submitted 25 February, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
Authors:
Yifan Ding,
Nicholas Botzer,
Tim Weninger
Abstract:
Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a…
▽ More
Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be used to compute large models. The present work describes HetSeq, a software package adapted from the popular PyTorch package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems. HetSeq can be easily extended to other models like image classification. Package with supported document is publicly available at https://github.com/yifding/hetseq.
△ Less
Submitted 25 September, 2020;
originally announced September 2020.
-
The Infinity Mirror Test for Graph Models
Authors:
Satyaki Sikdar,
Daniel Gonzalez Cedre,
Trenton W. Ford,
Tim Weninger
Abstract:
Graph models, like other machine learning models, have implicit and explicit biases built-in, which often impact performance in nontrivial ways. The model's faithfulness is often measured by comparing the newly generated graph against the source graph using any number or combination of graph properties. Differences in the size or topology of the generated graph, therefore, indicate a loss in the m…
▽ More
Graph models, like other machine learning models, have implicit and explicit biases built-in, which often impact performance in nontrivial ways. The model's faithfulness is often measured by comparing the newly generated graph against the source graph using any number or combination of graph properties. Differences in the size or topology of the generated graph, therefore, indicate a loss in the model. Yet, in many systems, errors encoded in loss functions are subtle and not well understood. In the present work, we introduce the Infinity Mirror test for analyzing the robustness of graph models. This straightforward stress test works by repeatedly fitting a model to its own outputs. A hypothetically perfect graph model would have no deviation from the source graph; however, the model's implicit biases and assumptions are exaggerated by the Infinity Mirror test, exposing potential issues that were previously obscured. Through an analysis of thousands of experiments on synthetic and real-world graphs, we show that several conventional graph models degenerate in exciting and informative ways. We believe that the observed degenerative patterns are clues to the future development of better graph models.
△ Less
Submitted 3 January, 2022; v1 submitted 18 September, 2020;
originally announced September 2020.
-
Joint Subgraph-to-Subgraph Transitions -- Generalizing Triadic Closure for Powerful and Interpretable Graph Modeling
Authors:
Justus Hibshman,
Daniel Gonzalez Cedre,
Satyaki Sikdar,
Tim Weninger
Abstract:
We generalize triadic closure, along with previous generalizations of triadic closure, under an intuitive umbrella generalization: the Subgraph-to-Subgraph Transition (SST). We present algorithms and code to model graph evolution in terms of collections of these SSTs. We then use the SST framework to create link prediction models for both static and temporal, directed and undirected graphs which p…
▽ More
We generalize triadic closure, along with previous generalizations of triadic closure, under an intuitive umbrella generalization: the Subgraph-to-Subgraph Transition (SST). We present algorithms and code to model graph evolution in terms of collections of these SSTs. We then use the SST framework to create link prediction models for both static and temporal, directed and undirected graphs which produce highly interpretable results. Quantitatively, our models match out-of-the-box performance of state of the art graph neural network models, thereby validating the correctness and meaningfulness of our interpretable results.
△ Less
Submitted 17 February, 2022; v1 submitted 14 September, 2020;
originally announced September 2020.
-
Library Adoption Dynamics in Software Teams
Authors:
Pamela Bilo Thomas,
Rachel Krohn,
Tim Weninger
Abstract:
When a group of people strives to understand new information, struggle ensues as various ideas compete for attention. Steep learning curves are surmounted as teams learn together. To understand how these team dynamics play out in software development, we explore Git logs, which provide a complete change history of software repositories. In these repositories, we observe code additions, which repre…
▽ More
When a group of people strives to understand new information, struggle ensues as various ideas compete for attention. Steep learning curves are surmounted as teams learn together. To understand how these team dynamics play out in software development, we explore Git logs, which provide a complete change history of software repositories. In these repositories, we observe code additions, which represent successfully implemented ideas, and code deletions, which represent ideas that have failed or been superseded. By examining the patterns between these commit types, we can begin to understand how teams adopt new information. We specifically study what happens after a software library is adopted by a project, i.e. when a library is used for the first time in the project. We find that a variety of factors, including team size, library popularity, and prevalence on Stack Overflow are associated with how quickly teams learn and successfully adopt new software libraries.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Automatic Discovery of Political Meme Genres with Diverse Appearances
Authors:
William Theisen,
Joel Brogan,
Pamela Bilo Thomas,
Daniel Moreira,
Pascal Phoa,
Tim Weninger,
Walter Scheirer
Abstract:
Forms of human communication are not static -- we expect some evolution in the way information is conveyed over time because of advances in technology. One example of this phenomenon is the image-based meme, which has emerged as a dominant form of political messaging in the past decade. While originally used to spread jokes on social media, memes are now having an outsized impact on public percept…
▽ More
Forms of human communication are not static -- we expect some evolution in the way information is conveyed over time because of advances in technology. One example of this phenomenon is the image-based meme, which has emerged as a dominant form of political messaging in the past decade. While originally used to spread jokes on social media, memes are now having an outsized impact on public perception of world events. A significant challenge in automatic meme analysis has been the development of a strategy to match memes from within a single genre when the appearances of the images vary. Such variation is especially common in memes exhibiting mimicry. For example, when voters perform a common hand gesture to signal their support for a candidate. In this paper we introduce a scalable automated visual recognition pipeline for discovering political meme genres of diverse appearance. This pipeline can ingest meme images from a social network, apply computer vision-based techniques to extract local features and index new images into a database, and then organize the memes into related genres. To validate this approach, we perform a large case study on the 2019 Indonesian Presidential Election using a new dataset of over two million images collected from Twitter and Instagram. Results show that this approach can discover new meme genres with visually diverse images that share common stylistic elements, paving the way forward for further work in semantic analysis and content attribution.
△ Less
Submitted 10 September, 2020; v1 submitted 16 January, 2020;
originally announced January 2020.
-
Representation Learning in Heterogeneous Professional Social Networks with Ambiguous Social Connections
Authors:
Baoxu Shi,
Jaewon Yang,
Tim Weninger,
Jing How,
Qi He
Abstract:
Network representations have been shown to improve performance within a variety of tasks, including classification, clustering, and link prediction. However, most models either focus on moderate-sized, homogeneous networks or require a significant amount of auxiliary input to be provided by the user. Moreover, few works have studied network representations in real-world heterogeneous social networ…
▽ More
Network representations have been shown to improve performance within a variety of tasks, including classification, clustering, and link prediction. However, most models either focus on moderate-sized, homogeneous networks or require a significant amount of auxiliary input to be provided by the user. Moreover, few works have studied network representations in real-world heterogeneous social networks with ambiguous social connections and are often incomplete. In the present work, we investigate the problem of learning low-dimensional node representations in heterogeneous professional social networks (HPSNs), which are incomplete and have ambiguous social connections. We present a general heterogeneous network representation learning model called Star2Vec that learns entity and person embeddings jointly using a social connection strength-aware biased random walk combined with a node-structure expansion function. Experiments on LinkedIn's Economic Graph and publicly available snapshots of Facebook's network show that Star2Vec outperforms existing methods on members' industry and social circle classification, skill and title clustering, and member-entity link predictions. We also conducted large-scale case studies to demonstrate practical applications of the Star2Vec embeddings trained on LinkedIn's Economic Graph such as next career move, alternative career suggestions, and general entity similarity searches.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
Towards Interpretable Graph Modeling with Vertex Replacement Grammars
Authors:
Justus Hibshman,
Satyaki Sikdar,
Tim Weninger
Abstract:
An enormous amount of real-world data exists in the form of graphs. Oftentimes, interesting patterns that describe the complex dynamics of these graphs are captured in the form of frequently reoccurring substructures. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling and pattern mining. However, existing formulation…
▽ More
An enormous amount of real-world data exists in the form of graphs. Oftentimes, interesting patterns that describe the complex dynamics of these graphs are captured in the form of frequently reoccurring substructures. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling and pattern mining. However, existing formulations do not extract meaningful and easily interpretable patterns from the data. The present work addresses this limitation by extracting a special type of vertex replacement grammar, which we call a KT grammar, according to the Minimum Description Length (MDL) heuristic. In experiments on synthetic and real-world datasets, we show that KT-grammars can be efficiently extracted from a graph and that these grammars encode meaningful patterns that represent the dynamics of the real-world system.
△ Less
Submitted 18 October, 2019;
originally announced October 2019.
-
Modelling Online Comment Threads from their Start
Authors:
Rachel Krohn,
Tim Weninger
Abstract:
The social Web is a widely used platform for online discussion. Across social media, users can start discussions by posting a topical image, url, or message. Upon seeing this initial post, other users may add their own comments to the post, or to another user's comment. The resulting online discourse produces a comment thread, which constitutes an enormous portion of modern online communication. C…
▽ More
The social Web is a widely used platform for online discussion. Across social media, users can start discussions by posting a topical image, url, or message. Upon seeing this initial post, other users may add their own comments to the post, or to another user's comment. The resulting online discourse produces a comment thread, which constitutes an enormous portion of modern online communication. Comment threads are often viewed as trees: nodes represent the post and its comments, while directed edges represent reply-to relationships. The goal of the present work is to predict the size and shape of these comment threads. Existing models do this by observing the first several comments and then fitting a predictive model. However, most comment threads are relatively small, and waiting for data to materialize runs counter to the goal of the prediction task. We therefore introduce the Comment Thread Prediction Model (CTPM) that accurately predicts the size and shape of a comment thread using only the text of the initial post, allowing for the prediction of new posts without observable comments. We find that the CTPM significantly outperforms existing models and competitive baselines on thousands of Reddit discussions from nine varied subreddits, particularly for new posts.
△ Less
Submitted 18 October, 2019;
originally announced October 2019.
-
Massive Multi-Agent Data-Driven Simulations of the GitHub Ecosystem
Authors:
Jim Blythe,
John Bollenbacher,
Di Huang,
Pik-Mai Hui,
Rachel Krohn,
Diogo Pacheco,
Goran Muric,
Anna Sapienza,
Alexey Tregubov,
Yong-Yeol Ahn,
Alessandro Flammini,
Kristina Lerman,
Filippo Menczer,
Tim Weninger,
Emilio Ferrara
Abstract:
Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based simulation framework, which we are currently extendin…
▽ More
Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based simulation framework, which we are currently extending to allow simulating other planetary-scale techno-social systems. The challenge problem measured participant's ability, given 30 months of meta-data on user activity on GitHub, to predict the next months' activity as measured by a broad range of metrics applied to ground truth, using agent-based simulation. The challenge required scaling to a simulation of roughly 3 million agents producing a combined 30 million actions, acting on 6 million repositories with commodity hardware. It was also important to use the data optimally to predict the agent's next moves. We describe the agent framework and the data analysis employed by one of the winning teams in the challenge. Six different agent models were tested based on a variety of machine learning and statistical methods. While no single method proved the most accurate on every metric, the broadly most successful sampled from a stationary probability distribution of actions and repositories for each agent. Two reasons for the success of these agents were their use of a distinct characterization of each agent, and that GitHub users change their behavior relatively slowly.
△ Less
Submitted 15 August, 2019;
originally announced August 2019.
-
Modeling Graphs with Vertex Replacement Grammars
Authors:
Satyaki Sikdar,
Justus Hibshman,
Tim Weninger
Abstract:
One of the principal goals of graph modeling is to capture the building blocks of network data in order to study various physical and natural phenomena. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling. However, existing graph grammar formalisms, like Hyperedge Replacement Grammars, can only operate on small tree-l…
▽ More
One of the principal goals of graph modeling is to capture the building blocks of network data in order to study various physical and natural phenomena. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling. However, existing graph grammar formalisms, like Hyperedge Replacement Grammars, can only operate on small tree-like graphs. The present work relaxes this restriction by revising a different graph grammar formalism called Vertex Replacement Grammars (VRGs). We show that a variant of the VRG called Clustering-based Node Replacement Grammar (CNRG) can be efficiently extracted from many hierarchical clusterings of a graph. We show that CNRGs encode a succinct model of the graph, yet faithfully preserves the structure of the original graph. In experiments on large real-world datasets, we show that graphs generated from the CNRG model exhibit a diverse range of properties that are similar to those found in the original networks.
△ Less
Submitted 11 September, 2019; v1 submitted 10 August, 2019;
originally announced August 2019.
-
Dynamics of Team Library Adoptions: An Exploration of GitHub Commit Logs
Authors:
Pamela Bilo Thomas,
Rachel Krohn,
Tim Weninger
Abstract:
When a group of people strives to understand new information, struggle ensues as various ideas compete for attention. Steep learning curves are surmounted as teams learn together. To understand how these team dynamics play out in software development, we explore Git logs, which provide a complete change history of software repositories. In these repositories, we observe code additions, which repre…
▽ More
When a group of people strives to understand new information, struggle ensues as various ideas compete for attention. Steep learning curves are surmounted as teams learn together. To understand how these team dynamics play out in software development, we explore Git logs, which provide a complete change history of software repositories. In these repositories, we observe code additions, which represent successfully implemented ideas, and code deletions, which represent ideas that have failed or been superseded. By examining the patterns between these commit types, we can begin to understand how teams adopt new information. We specifically study what happens after a software library is adopted by a project, i.e., when a library is used for the first time in the project. We find that a variety of factors, including team size, library popularity, and prevalence on Stack Overflow are associated with how quickly teams learn and successfully adopt new software libraries.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
Improved Forecasting of Cryptocurrency Price using Social Signals
Authors:
Maria Glenski,
Tim Weninger,
Svitlana Volkova
Abstract:
Social media signals have been successfully used to develop large-scale predictive and anticipatory analytics. For example, forecasting stock market prices and influenza outbreaks. Recently, social data has been explored to forecast price fluctuations of cryptocurrencies, which are a novel disruptive technology with significant political and economic implications. In this paper we leverage and con…
▽ More
Social media signals have been successfully used to develop large-scale predictive and anticipatory analytics. For example, forecasting stock market prices and influenza outbreaks. Recently, social data has been explored to forecast price fluctuations of cryptocurrencies, which are a novel disruptive technology with significant political and economic implications. In this paper we leverage and contrast the predictive power of social signals, specifically user behavior and communication patterns, from multiple social platforms GitHub and Reddit to forecast prices for three cyptocurrencies with high developer and community interest - Bitcoin, Ethereum, and Monero. We evaluate the performance of neural network models that rely on long short-term memory units (LSTMs) trained on historical price data and social data against price only LSTMs and baseline autoregressive integrated moving average (ARIMA) models, commonly used to predict stock prices. Our results not only demonstrate that social signals reduce error when forecasting daily coin price, but also show that the language used in comments within the official communities on Reddit (r/Bitcoin, r/Ethereum, and r/Monero) are the best predictors overall. We observe that models are more accurate in forecasting price one day ahead for Bitcoin (4% root mean squared percent error) compared to Ethereum (7%) and Monero (8%).
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Propagation from Deceptive News Sources: Who Shares, How Much, How Evenly, and How Quickly?
Authors:
Maria Glenski,
Tim Weninger,
Svitlana Volkova
Abstract:
As people rely on social media as their primary sources of news, the spread of misinformation has become a significant concern. In this large-scale study of news in social media we analyze eleven million posts and investigate propagation behavior of users that directly interact with news accounts identified as spreading trusted versus malicious content. Unlike previous work, which looks at specifi…
▽ More
As people rely on social media as their primary sources of news, the spread of misinformation has become a significant concern. In this large-scale study of news in social media we analyze eleven million posts and investigate propagation behavior of users that directly interact with news accounts identified as spreading trusted versus malicious content. Unlike previous work, which looks at specific rumors, topics, or events, we consider all content propagated by various news sources. Moreover, we analyze and contrast population versus sub-population behaviour (by demographics) when spreading misinformation, and distinguish between two types of propagation, i.e., direct retweets and mentions. Our evaluation examines how evenly, how many, how quickly, and which users propagate content from various types of news sources on Twitter.
Our analysis has identified several key differences in propagation behavior from trusted versus suspicious news sources. These include high inequity in the diffusion rate based on the source of disinformation, with a small group of highly active users responsible for the majority of disinformation spread overall and within each demographic. Analysis by demographics showed that users with lower annual income and education share more from disinformation sources compared to their counterparts. News content is shared significantly more quickly from trusted, conspiracy, and disinformation sources compared to clickbait and propaganda. Older users propagate news from trusted sources more quickly than younger users, but they share from suspicious sources after longer delays. Finally, users who interact with clickbait and conspiracy sources are likely to share from propaganda accounts, but not the other way around.
△ Less
Submitted 9 December, 2018;
originally announced December 2018.
-
GuessTheKarma: A Game to Assess Social Rating Systems
Authors:
Maria Glenski,
Greg Stoddard,
Paul Resnick,
Tim Weninger
Abstract:
Popularity systems, like Twitter retweets, Reddit upvotes, and Pinterest pins have the potential to guide people toward posts that others liked. That, however, creates a feedback loop that reduces their informativeness: items marked as more popular get more attention, so that additional upvotes and retweets may simply reflect the increased attention and not independent information about the fracti…
▽ More
Popularity systems, like Twitter retweets, Reddit upvotes, and Pinterest pins have the potential to guide people toward posts that others liked. That, however, creates a feedback loop that reduces their informativeness: items marked as more popular get more attention, so that additional upvotes and retweets may simply reflect the increased attention and not independent information about the fraction of people that like the items. How much information remains? For example, how confident can we be that more people prefer item A to item B if item A had hundreds of upvotes on Reddit and item B had only a few? We investigate using an Internet game called GuessTheKarma that collects independent preference judgments (N=20,674) for 400 pairs of images, approximately 50 per pair. Unlike the rating systems that dominate social media services, GuessTheKarma is devoid of social and ranking effects that influence ratings. Overall, Reddit scores were not very good predictors of the true population preferences for items as measured by GuessTheKarma: the image with higher score was preferred by a majority of independent raters only 68% of the time. However, when one image had a low score and the other was one of the highest scoring in its subreddit, the higher scoring image was preferred nearly 90% of the time by the majority of independent raters. Similarly, Imgur view counts for the images were poor predictors except when there were orders of magnitude differences between the pairs. We conclude that popularity systems marked by feedback loops may convey a strong signal about population preferences, but only when comparing items that received vastly different popularity scores.
△ Less
Submitted 3 September, 2018;
originally announced September 2018.
-
How Humans versus Bots React to Deceptive and Trusted News Sources: A Case Study of Active Users
Authors:
Maria Glenski,
Tim Weninger,
Svitlana Volkova
Abstract:
Society's reliance on social media as a primary source of news has spawned a renewed focus on the spread of misinformation. In this work, we identify the differences in how social media accounts identified as bots react to news sources of varying credibility, regardless of the veracity of the content those sources have shared. We analyze bot and human responses annotated using a fine-grained model…
▽ More
Society's reliance on social media as a primary source of news has spawned a renewed focus on the spread of misinformation. In this work, we identify the differences in how social media accounts identified as bots react to news sources of varying credibility, regardless of the veracity of the content those sources have shared. We analyze bot and human responses annotated using a fine-grained model that labels responses as being an answer, appreciation, agreement, disagreement, an elaboration, humor, or a negative reaction. We present key findings of our analysis into the prevalence of bots, the variety and speed of bot and human reactions, and the disparity in authorship of reaction tweets between these two sub-populations. We observe that bots are responsible for 9-15% of the reactions to sources of any given type but comprise only 7-10% of accounts responsible for reaction-tweets; trusted news sources have the highest proportion of humans who reacted; bots respond with significantly shorter delays than humans when posting answer-reactions in response to sources identified as propaganda. Finally, we report significantly different inequality levels in reaction rates for accounts identified as bots vs not.
△ Less
Submitted 13 July, 2018;
originally announced July 2018.
-
Growing Better Graphs With Latent-Variable Probabilistic Graph Grammars
Authors:
Xinyi Wang,
Salvador Aguinaga,
Tim Weninger,
David Chiang
Abstract:
Recent work in graph models has found that probabilistic hyperedge replacement grammars (HRGs) can be extracted from graphs and used to generate new random graphs with graph properties and substructures close to the original. In this paper, we show how to add latent variables to the model, trained using Expectation-Maximization, to generate still better graphs, that is, ones that generalize better…
▽ More
Recent work in graph models has found that probabilistic hyperedge replacement grammars (HRGs) can be extracted from graphs and used to generate new random graphs with graph properties and substructures close to the original. In this paper, we show how to add latent variables to the model, trained using Expectation-Maximization, to generate still better graphs, that is, ones that generalize better to the test data. We evaluate the new method by separating training and test graphs, building the model on the former and measuring the likelihood of the latter, as a more stringent test of how well the model can generalize to new graphs. On this metric, we find that our latent-variable HRGs consistently outperform several existing graph models and provide interesting insights into the building blocks of real world networks.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Identifying and Understanding User Reactions to Deceptive and Trusted Social News Sources
Authors:
Maria Glenski,
Tim Weninger,
Svitlana Volkova
Abstract:
In the age of social news, it is important to understand the types of reactions that are evoked from news sources with various levels of credibility. In the present work we seek to better understand how users react to trusted and deceptive news sources across two popular, and very different, social media platforms. To that end, (1) we develop a model to classify user reactions into one of nine typ…
▽ More
In the age of social news, it is important to understand the types of reactions that are evoked from news sources with various levels of credibility. In the present work we seek to better understand how users react to trusted and deceptive news sources across two popular, and very different, social media platforms. To that end, (1) we develop a model to classify user reactions into one of nine types, such as answer, elaboration, and question, etc, and (2) we measure the speed and the type of reaction for trusted and deceptive news sources for 10.8M Twitter posts and 6.2M Reddit comments. We show that there are significant differences in the speed and the type of reactions between trusted and deceptive news sources on Twitter, but far smaller differences on Reddit.
△ Less
Submitted 30 May, 2018;
originally announced May 2018.
-
Visualizing the Flow of Discourse with a Concept Ontology
Authors:
Baoxu Shi,
Tim Weninger
Abstract:
Understanding and visualizing human discourse has long being a challenging task. Although recent work on argument mining have shown success in classifying the role of various sentences, the task of recognizing concepts and understanding the ways in which they are discussed remains challenging. Given an email thread or a transcript of a group discussion, our task is to extract the relevant concepts…
▽ More
Understanding and visualizing human discourse has long being a challenging task. Although recent work on argument mining have shown success in classifying the role of various sentences, the task of recognizing concepts and understanding the ways in which they are discussed remains challenging. Given an email thread or a transcript of a group discussion, our task is to extract the relevant concepts and understand how they are referenced and re-referenced throughout the discussion. In the present work, we present a preliminary approach for extracting and visualizing group discourse by adapting Wikipedia's category hierarchy to be an external concept ontology. From a user study, we found that our method achieved better results than 4 strong alternative approaches, and we illustrate our visualization method based on the extracted discourse flows.
△ Less
Submitted 23 February, 2018;
originally announced February 2018.
-
Learning Hyperedge Replacement Grammars for Graph Generation
Authors:
Salvador Aguinaga,
David Chiang,
Tim Weninger
Abstract:
The discovery and analysis of network patterns are central to the scientific enterprise. In the present work, we developed and evaluated a new approach that learns the building blocks of graphs that can be used to understand and generate new realistic graphs. Our key insight is that a graph's clique tree encodes robust and precise information. We show that a Hyperedge Replacement Grammar (HRG) can…
▽ More
The discovery and analysis of network patterns are central to the scientific enterprise. In the present work, we developed and evaluated a new approach that learns the building blocks of graphs that can be used to understand and generate new realistic graphs. Our key insight is that a graph's clique tree encodes robust and precise information. We show that a Hyperedge Replacement Grammar (HRG) can be extracted from the clique tree, and we develop a fixed-size graph generation algorithm that can be used to produce new graphs of a specified size. In experiments on large real-world graphs, we show that graphs generated from the HRG approach exhibit a diverse range of properties that are similar to those found in the original networks. In addition to graph properties like degree or eigenvector centrality, what a graph "looks like" ultimately depends on small details in local graph substructures that are difficult to define at a global level. We show that the HRG model can also preserve these local substructures when generating new graphs.
△ Less
Submitted 23 February, 2018; v1 submitted 20 February, 2018;
originally announced February 2018.
-
Open-World Knowledge Graph Completion
Authors:
Baoxu Shi,
Tim Weninger
Abstract:
Knowledge Graphs (KGs) have been applied to many tasks including Web search, link prediction, recommendation, natural language processing, and entity linking. However, most KGs are far from complete and are growing at a rapid pace. To address these problems, Knowledge Graph Completion (KGC) has been proposed to improve KGs by filling in its missing connections. Unlike existing methods which hold a…
▽ More
Knowledge Graphs (KGs) have been applied to many tasks including Web search, link prediction, recommendation, natural language processing, and entity linking. However, most KGs are far from complete and are growing at a rapid pace. To address these problems, Knowledge Graph Completion (KGC) has been proposed to improve KGs by filling in its missing connections. Unlike existing methods which hold a closed-world assumption, i.e., where KGs are fixed and new entities cannot be easily added, in the present work we relax this assumption and propose a new open-world KGC task. As a first attempt to solve this task we introduce an open-world KGC model called ConMask. This model learns embeddings of the entity's name and parts of its text-description to connect unseen entities to the KG. To mitigate the presence of noisy text descriptions, ConMask uses a relationship-dependent content masking to extract relevant snippets and then trains a fully convolutional neural network to fuse the extracted snippets with entities in the KG. Experiments on large data sets, both old and new, show that ConMask performs well in the open-world KGC task and even outperforms existing KGC models on the standard closed-world KGC task.
△ Less
Submitted 9 November, 2017;
originally announced November 2017.