-
Alice and the Caterpillar: A more descriptive null model for assessing data mining results
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Matteo Riondato
Abstract:
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number…
▽ More
We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Early linguistic fingerprints of online users who engage with conspiracy communities
Authors:
Francesco Corso,
Giuseppe Russo,
Francesco Pierri,
Gianmarco De Francisci Morales
Abstract:
Online social media platforms are often seen as catalysts for radicalization, as they provide spaces where extreme beliefs can take root and spread, sometimes leading to real-world consequences. Conspiracy theories represent a specific form of radicalization that is notoriously resistant to online moderation strategies. One explanation for this resilience is the presence of a "conspiratorial minds…
▽ More
Online social media platforms are often seen as catalysts for radicalization, as they provide spaces where extreme beliefs can take root and spread, sometimes leading to real-world consequences. Conspiracy theories represent a specific form of radicalization that is notoriously resistant to online moderation strategies. One explanation for this resilience is the presence of a "conspiratorial mindset", a cognitive framework that fundamentally shapes how conspiracy believers perceive reality. However, the role of this mindset in driving online user behavior remains poorly understood. In this study, we analyze the psycholinguistic patterns of Reddit users who become active in a prominent conspiracy community by examining their activity in mainstream communities, which allows us to isolate linguistic markers for the presence of a conspiratorial mindset. We find that conspiracy-engaged individuals exhibit distinct psycholinguistic fingerprints, setting them apart from the general user population. Crucially, this signal is already evident in their online activity prior to joining the conspiracy community, allowing us to predict their involvement years in advance. These findings suggest that individuals who adopt conspiracy beliefs do not radicalize through community involvement, but possess a pre-existing conspiratorial mindset, which predisposes them to seek out and join extreme communities. By challenging the view that online social media platforms actively radicalize users into conspiracy theory beliefs, our findings suggest that standard moderation strategies have limited impact on curbing radicalization, and highlight the need for more targeted, supportive interventions that encourage disengagement from extremist narratives. Ultimately, this work contributes to fostering safer online and offline environments for public discourse.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
The Dark Side of Digital Twins: Adversarial Attacks on AI-Driven Water Forecasting
Authors:
Mohammadhossein Homaei,
Victor Gonzalez Morales,
Oscar Mogollon-Gutierrez,
Andres Caro
Abstract:
Digital twins (DTs) are improving water distribution systems by using real-time data, analytics, and prediction models to optimize operations. This paper presents a DT platform designed for a Spanish water supply network, utilizing Long Short-Term Memory (LSTM) networks to predict water consumption. However, machine learning models are vulnerable to adversarial attacks, such as the Fast Gradient S…
▽ More
Digital twins (DTs) are improving water distribution systems by using real-time data, analytics, and prediction models to optimize operations. This paper presents a DT platform designed for a Spanish water supply network, utilizing Long Short-Term Memory (LSTM) networks to predict water consumption. However, machine learning models are vulnerable to adversarial attacks, such as the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). These attacks manipulate critical model parameters, injecting subtle distortions that degrade forecasting accuracy. To further exploit these vulnerabilities, we introduce a Learning Automata (LA) and Random LA-based approach that dynamically adjusts perturbations, making adversarial attacks more difficult to detect. Experimental results show that this approach significantly impacts prediction reliability, causing the Mean Absolute Percentage Error (MAPE) to rise from 26% to over 35%. Moreover, adaptive attack strategies amplify this effect, highlighting cybersecurity risks in AI-driven DTs. These findings emphasize the urgent need for robust defenses, including adversarial training, anomaly detection, and secure data pipelines.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Smart Water Security with AI and Blockchain-Enhanced Digital Twins
Authors:
Mohammadhossein Homaei,
Victor Gonzalez Morales,
Oscar Mogollon Gutierrez,
Ruben Molano Gomez,
Andres Caro
Abstract:
Water distribution systems in rural areas face serious challenges such as a lack of real-time monitoring, vulnerability to cyberattacks, and unreliable data handling. This paper presents an integrated framework that combines LoRaWAN-based data acquisition, a machine learning-driven Intrusion Detection System (IDS), and a blockchain-enabled Digital Twin (BC-DT) platform for secure and transparent w…
▽ More
Water distribution systems in rural areas face serious challenges such as a lack of real-time monitoring, vulnerability to cyberattacks, and unreliable data handling. This paper presents an integrated framework that combines LoRaWAN-based data acquisition, a machine learning-driven Intrusion Detection System (IDS), and a blockchain-enabled Digital Twin (BC-DT) platform for secure and transparent water management. The IDS filters anomalous or spoofed data using a Long Short-Term Memory (LSTM) Autoencoder and Isolation Forest before validated data is logged via smart contracts on a private Ethereum blockchain using Proof of Authority (PoA) consensus. The verified data feeds into a real-time DT model supporting leak detection, consumption forecasting, and predictive maintenance. Experimental results demonstrate that the system achieves over 80 transactions per second (TPS) with under 2 seconds of latency while remaining cost-effective and scalable for up to 1,000 smart meters. This work demonstrates a practical and secure architecture for decentralized water infrastructure in under-connected rural environments.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
On the Inference of Sociodemographics on Reddit
Authors:
Federico Cinus,
Corrado Monti,
Paolo Bajardi,
Gianmarco De Francisci Morales
Abstract:
Inference of sociodemographic attributes of social media users is an essential step for computational social science (CSS) research to link online and offline behavior. However, there is a lack of a systematic evaluation and clear guidelines for optimal methodologies for this task on Reddit, one of today's largest social media. In this study, we fill this gap by comparing state-of-the-art (SOTA) a…
▽ More
Inference of sociodemographic attributes of social media users is an essential step for computational social science (CSS) research to link online and offline behavior. However, there is a lack of a systematic evaluation and clear guidelines for optimal methodologies for this task on Reddit, one of today's largest social media. In this study, we fill this gap by comparing state-of-the-art (SOTA) and probabilistic models.
To this end, first we collect a novel data set of more than 850k self-declarations on age, gender, and partisan affiliation from Reddit comments. Then, we systematically compare alternatives to the widely used embedding-based model and labeling techniques for the definition of the ground-truth. We do so on two tasks: ($i$) predicting binary labels (classification); and ($ii$)~predicting the prevalence of a demographic class among a set of users (quantification).
Our findings reveal that Naive Bayes models not only offer transparency and interpretability by design but also consistently outperform the SOTA. Specifically, they achieve an improvement in ROC AUC of up to $19\%$ and maintain a mean absolute error (MAE) below $15\%$ in quantification for large-scale data settings. Finally, we discuss best practices for researchers in CSS, emphasizing coverage, interpretability, reliability, and scalability.
The code and model weights used for the experiments are publicly available.\footnote{https://anonymous.4open.science/r/SDI-submission-5234}
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Adaptive Sampling to Reduce Epistemic Uncertainty Using Prediction Interval-Generation Neural Networks
Authors:
Giorgio Morales,
John Sheppard
Abstract:
Obtaining high certainty in predictive models is crucial for making informed and trustworthy decisions in many scientific and engineering domains. However, extensive experimentation required for model accuracy can be both costly and time-consuming. This paper presents an adaptive sampling approach designed to reduce epistemic uncertainty in predictive models. Our primary contribution is the develo…
▽ More
Obtaining high certainty in predictive models is crucial for making informed and trustworthy decisions in many scientific and engineering domains. However, extensive experimentation required for model accuracy can be both costly and time-consuming. This paper presents an adaptive sampling approach designed to reduce epistemic uncertainty in predictive models. Our primary contribution is the development of a metric that estimates potential epistemic uncertainty leveraging prediction interval-generation neural networks. This estimation relies on the distance between the predicted upper and lower bounds and the observed data at the tested positions and their neighboring points. Our second contribution is the proposal of a batch sampling strategy based on Gaussian processes (GPs). A GP is used as a surrogate model of the networks trained at each iteration of the adaptive sampling process. Using this GP, we design an acquisition function that selects a combination of sampling locations to maximize the reduction of epistemic uncertainty across the domain. We test our approach on three unidimensional synthetic problems and a multi-dimensional dataset based on an agricultural field for selecting experimental fertilizer rates. The results demonstrate that our method consistently converges faster to minimum epistemic uncertainty levels compared to Normalizing Flows Ensembles, MC-Dropout, and simple GPs.
△ Less
Submitted 13 December, 2024;
originally announced December 2024.
-
GPU Sharing with Triples Mode
Authors:
Chansup Byun,
Albert Reuther,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alexander Bonn,
Daniel Burrill,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Piotr Luszczek,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
There is a tremendous amount of interest in AI/ML technologies due to the proliferation of generative AI applications such as ChatGPT. This trend has significantly increased demand on GPUs, which are the workhorses for training AI models. Due to the high costs of GPUs and lacking supply, it has become of interest to optimize GPU usage in HPC centers. MIT Lincoln Laboratory Supercomputing Center (L…
▽ More
There is a tremendous amount of interest in AI/ML technologies due to the proliferation of generative AI applications such as ChatGPT. This trend has significantly increased demand on GPUs, which are the workhorses for training AI models. Due to the high costs of GPUs and lacking supply, it has become of interest to optimize GPU usage in HPC centers. MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed an easy-to-use GPU sharing feature supported by LLSC-developed tools including LLsub and LLMapReduce. This approach overcomes some of the limitations with the existing methods for GPU sharing. This allows users to apply GPU sharing whenever possible while they are developing their AI/ML models and/or doing parametric study on their AI models or executing other GPU applications. Based on our initial experimental results with GPU sharing, GPU sharing with triples mode is easy to use and achieved significant improvement in GPU usage and throughput performance for certain types of AI applications.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
LLload: An Easy-to-Use HPC Utilization Tool
Authors:
Chansup Byun,
Albert Reuther,
Julie Mullen,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alexander Bonn,
Daniel Burrill,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Piotr Luszczek,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities fo…
▽ More
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users' jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Causal Modeling of Climate Activism on Reddit
Authors:
Jacopo Lenti,
Luca Maria Aiello,
Corrado Monti,
Gianmarco De Francisci Morales
Abstract:
Climate activism is crucial in stimulating collective societal and behavioral change towards sustainable practices through political pressure. Although multiple factors contribute to the participation in activism, their complex relationships and the scarcity of data on their interactions have restricted most prior research to studying them in isolation, thus preventing the development of a quantit…
▽ More
Climate activism is crucial in stimulating collective societal and behavioral change towards sustainable practices through political pressure. Although multiple factors contribute to the participation in activism, their complex relationships and the scarcity of data on their interactions have restricted most prior research to studying them in isolation, thus preventing the development of a quantitative, causal understanding of why people approach activism. In this work, we develop a comprehensive causal model of how and why Reddit users engage with activist communities driving mass climate protests (mainly the 2019 Earth Strike, Fridays for Future, and Extinction Rebellion). Our framework, based on Stochastic Variational Inference applied to Bayesian Networks, learns the causal pathways over multiple time periods. Distinct from previous studies, our approach uses large-scale and fine-grained longitudinal data (2016 to 2022) to jointly model the roles of sociodemographic makeup, experience of extreme weather events, exposure to climate-related news, and social influence through online interactions. We find that among users interested in climate change, participation in online activist communities is indeed influenced by direct interactions with activists and largely by recent exposure to media coverage of climate protests. Among people aware of climate change, left-leaning people from lower socioeconomic backgrounds are particularly represented in online activist groups. Our findings offer empirical validation for theories of media influence and critical mass, and lay the foundations to inform interventions and future studies to foster public participation in collective action.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
HPC with Enhanced User Separation
Authors:
Andrew Prout,
Albert Reuther,
Michael Houle,
Michael Jones,
Peter Michaleas,
LaToya Anderson,
William Arcand,
Bill Bergeron,
David Bestor,
Alex Bonn,
Daniel Burrill,
Chansup Byun,
Vijay Gadepally,
Matthew Hubbell,
Hayden Jananthan,
Piotr Luszczek,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Antonio Rosa,
Charles Yee,
Jeremy Kepner
Abstract:
HPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Supercomputing Center has deployed on its systems to manage the security implications of these workf…
▽ More
HPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Supercomputing Center has deployed on its systems to manage the security implications of these workflows by providing enforced separation for processes, filesystem access, network traffic, and accelerators to make every user feel like they are running on a personal HPC.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Anonymized Network Sensing Graph Challenge
Authors:
Hayden Jananthan,
Michael Jones,
William Arcand,
David Bestor,
William Bergeron,
Daniel Burrill,
Aydin Buluc,
Chansup Byun,
Timothy Davis,
Vijay Gadepally,
Daniel Grant,
Michael Houle,
Matthew Hubbell,
Piotr Luszczek,
Peter Michaleas,
Lauren Milechin,
Chasen Milner,
Guillermo Morales,
Andrew Morris,
Julie Mullen,
Ritesh Patel,
Alex Pentland,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther
, et al. (4 additional authors not shown)
Abstract:
The MIT/IEEE/Amazon GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to discover relationships between events as they unfold in the field. The anonymized network sensing Graph Challenge seeks to enable large, open, community-based approaches to protecting networks. Many large…
▽ More
The MIT/IEEE/Amazon GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to discover relationships between events as they unfold in the field. The anonymized network sensing Graph Challenge seeks to enable large, open, community-based approaches to protecting networks. Many large-scale networking problems can only be solved with community access to very broad data sets with the highest regard for privacy and strong community buy-in. Such approaches often require community-based data sharing. In the broader networking community (commercial, federal, and academia) anonymized source-to-destination traffic matrices with standard data sharing agreements have emerged as a data product that can meet many of these requirements. This challenge provides an opportunity to highlight novel approaches for optimizing the construction and analysis of anonymized traffic matrices using over 100 billion network packets derived from the largest Internet telescope in the world (CAIDA). This challenge specifies the anonymization, construction, and analysis of these traffic matrices. A GraphBLAS reference implementation is provided, but the use of GraphBLAS is not required in this Graph Challenge. As with prior Graph Challenges the goal is to provide a well-defined context for demonstrating innovation. Graph Challenge participants are free to select (with accompanying explanation) the Graph Challenge elements that are appropriate for highlighting their innovations.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Bifurcation Identification for Ultrasound-driven Robotic Cannulation
Authors:
Cecilia G. Morales,
Dhruv Srikanth,
Jack H. Good,
Keith A. Dufendach,
Artur Dubrawski
Abstract:
In trauma and critical care settings, rapid and precise intravascular access is key to patients' survival. Our research aims at ensuring this access, even when skilled medical personnel are not readily available. Vessel bifurcations are anatomical landmarks that can guide the safe placement of catheters or needles during medical procedures. Although ultrasound is advantageous in navigating anatomi…
▽ More
In trauma and critical care settings, rapid and precise intravascular access is key to patients' survival. Our research aims at ensuring this access, even when skilled medical personnel are not readily available. Vessel bifurcations are anatomical landmarks that can guide the safe placement of catheters or needles during medical procedures. Although ultrasound is advantageous in navigating anatomical landmarks in emergency scenarios due to its portability and safety, to our knowledge no existing algorithm can autonomously extract vessel bifurcations using ultrasound images. This is primarily due to the limited availability of ground truth data, in particular, data from live subjects, needed for training and validating reliable models. Researchers often resort to using data from anatomical phantoms or simulations. We introduce BIFURC, Bifurcation Identification for Ultrasound-driven Robot Cannulation, a novel algorithm that identifies vessel bifurcations and provides optimal needle insertion sites for an autonomous robotic cannulation system. BIFURC integrates expert knowledge with deep learning techniques to efficiently detect vessel bifurcations within the femoral region and can be trained on a limited amount of in-vivo data. We evaluated our algorithm using a medical phantom as well as real-world experiments involving live pigs. In all cases, BIFURC consistently identified bifurcation points and needle insertion locations in alignment with those identified by expert clinicians.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic
Authors:
Jeremy Kepner,
Hayden Jananthan,
Michael Jones,
William Arcand,
David Bestor,
William Bergeron,
Daniel Burrill,
Aydin Buluc,
Chansup Byun,
Timothy Davis,
Vijay Gadepally,
Daniel Grant,
Michael Houle,
Matthew Hubbell,
Piotr Luszczek,
Lauren Milechin,
Chasen Milner,
Guillermo Morales,
Andrew Morris,
Julie Mullen,
Ritesh Patel,
Alex Pentland,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther
, et al. (4 additional authors not shown)
Abstract:
Understanding what is normal is a key aspect of protecting a domain. Other domains invest heavily in observational science to develop models of normal behavior to better detect anomalies. Recent advances in high performance graph libraries, such as the GraphBLAS, coupled with supercomputers enables processing of the trillions of observations required. We leverage this approach to synthesize low-pa…
▽ More
Understanding what is normal is a key aspect of protecting a domain. Other domains invest heavily in observational science to develop models of normal behavior to better detect anomalies. Recent advances in high performance graph libraries, such as the GraphBLAS, coupled with supercomputers enables processing of the trillions of observations required. We leverage this approach to synthesize low-parameter observational models of anonymized Internet traffic with a high regard for privacy.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Polaris: Sampling from the Multigraph Configuration Model with Prescribed Color Assortativity
Authors:
Giulia Preti,
Matteo Riondato,
Aristides Gionis,
Gianmarco De Francisci Morales
Abstract:
We introduce Polaris, a network null model for colored multi-graphs that preserves the Joint Color Matrix. Polaris is specifically designed for studying network polarization, where vertices belong to a side in a debate or a partisan group, represented by a vertex color, and relations have different strengths, represented by an integer-valued edge multiplicity. The key feature of Polaris is preserv…
▽ More
We introduce Polaris, a network null model for colored multi-graphs that preserves the Joint Color Matrix. Polaris is specifically designed for studying network polarization, where vertices belong to a side in a debate or a partisan group, represented by a vertex color, and relations have different strengths, represented by an integer-valued edge multiplicity. The key feature of Polaris is preserving the Joint Color Matrix (JCM) of the multigraph, which specifies the number of edges connecting vertices of any two given colors. The JCM is the basic property that determines color assortativity, a fundamental aspect in studying homophily and segregation in polarized networks. By using Polaris, network scientists can test whether a phenomenon is entirely explained by the JCM of the observed network or whether other phenomena might be at play. Technically, our null model is an extension of the configuration model: an ensemble of colored multigraphs characterized by the same degree sequence and the same JCM. To sample from this ensemble, we develop a suite of Markov Chain Monte Carlo algorithms, collectively named Polaris-*. It includes Polaris-B, an adaptation of a generic Metropolis-Hastings algorithm, and Polaris-C, a faster, specialized algorithm with higher acceptance probabilities. This new null model and the associated algorithms provide a more nuanced toolset for examining polarization in social networks, thus enabling statistically sound conclusions.
△ Less
Submitted 18 December, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
Moral Judgments in Online Discourse are not Biased by Gender
Authors:
Lorenzo Betti,
Paolo Bajardi,
Gianmarco De Francisci Morales
Abstract:
The interaction between social norms and gender roles prescribes gender-specific behaviors that influence moral judgments. Here, we study how moral judgments are biased by the gender of the protagonist of a story. Using data from r/AITA, a Reddit community with 17 million members who share first-hand experiences seeking community judgment on their behavior, we employ machine learning techniques to…
▽ More
The interaction between social norms and gender roles prescribes gender-specific behaviors that influence moral judgments. Here, we study how moral judgments are biased by the gender of the protagonist of a story. Using data from r/AITA, a Reddit community with 17 million members who share first-hand experiences seeking community judgment on their behavior, we employ machine learning techniques to match stories describing similar situations that differ only by the protagonist's gender. We find no direct causal effect of the protagonist's gender on the received moral judgments, except for stories about ``friendship and relationships'', where male protagonists receive more negative judgments. Our findings complement existing correlational studies and suggest that gender roles may exert greater influence in specific social contexts. These results have implications for understanding sociological constructs and highlight potential biases in data used to train large language models.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Enhanced Uncertainty Estimation in Ultrasound Image Segmentation with MSU-Net
Authors:
Rohini Banerjee,
Cecilia G. Morales,
Artur Dubrawski
Abstract:
Efficient intravascular access in trauma and critical care significantly impacts patient outcomes. However, the availability of skilled medical personnel in austere environments is often limited. Autonomous robotic ultrasound systems can aid in needle insertion for medication delivery and support non-experts in such tasks. Despite advances in autonomous needle insertion, inaccuracies in vessel seg…
▽ More
Efficient intravascular access in trauma and critical care significantly impacts patient outcomes. However, the availability of skilled medical personnel in austere environments is often limited. Autonomous robotic ultrasound systems can aid in needle insertion for medication delivery and support non-experts in such tasks. Despite advances in autonomous needle insertion, inaccuracies in vessel segmentation predictions pose risks. Understanding the uncertainty of predictive models in ultrasound imaging is crucial for assessing their reliability. We introduce MSU-Net, a novel multistage approach for training an ensemble of U-Nets to yield accurate ultrasound image segmentation maps. We demonstrate substantial improvements, 18.1% over a single Monte Carlo U-Net, enhancing uncertainty evaluations, model transparency, and trustworthiness. By highlighting areas of model certainty, MSU-Net can guide safe needle insertions, empowering non-experts to accomplish such tasks.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Conspiracy theories and where to find them on TikTok
Authors:
Francesco Corso,
Francesco Pierri,
Gianmarco De Francisci Morales
Abstract:
TikTok has skyrocketed in popularity over recent years, especially among younger audiences. However, there are public concerns about the potential of this platform to promote and amplify harmful content. This study presents the first systematic analysis of conspiracy theories on TikTok. By leveraging the official TikTok Research API we collect a longitudinal dataset of 1.5M videos shared in the U.…
▽ More
TikTok has skyrocketed in popularity over recent years, especially among younger audiences. However, there are public concerns about the potential of this platform to promote and amplify harmful content. This study presents the first systematic analysis of conspiracy theories on TikTok. By leveraging the official TikTok Research API we collect a longitudinal dataset of 1.5M videos shared in the U.S. over three years. We estimate a lower bound on the prevalence of conspiratorial videos (up to 1000 new videos per month) and evaluate the effects of TikTok's Creativity Program for monetization, observing an overall increase in video duration regardless of content. Lastly, we evaluate the capabilities of state-of-the-art open-weight Large Language Models to identify conspiracy theories from audio transcriptions of videos. While these models achieve high precision in detecting harmful content (up to 96%), their overall performance remains comparable to fine-tuned traditional models such as RoBERTa. Our findings suggest that Large Language Models can serve as an effective tool for supporting content moderation strategies aimed at reducing the spread of harmful content on TikTok.
△ Less
Submitted 19 May, 2025; v1 submitted 17 July, 2024;
originally announced July 2024.
-
LLload: Simplifying Real-Time Job Monitoring for HPC Users
Authors:
Chansup Byun,
Julia Mullen,
Albert Reuther,
William Arcand,
William Bergeron,
David Bestor,
Daniel Burrill,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Peter Michaleas,
Guillermo Morales,
Andrew Prout,
Antonio Rosa,
Charles Yee,
Jeremy Kepner,
Lauren Milechin
Abstract:
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application…
▽ More
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Univariate Skeleton Prediction in Multivariate Systems Using Transformers
Authors:
Giorgio Morales,
John W. Sheppard
Abstract:
Symbolic regression (SR) methods attempt to learn mathematical expressions that approximate the behavior of an observed system. However, when dealing with multivariate systems, they often fail to identify the functional form that explains the relationship between each variable and the system's response. To begin to address this, we propose an explainable neural SR method that generates univariate…
▽ More
Symbolic regression (SR) methods attempt to learn mathematical expressions that approximate the behavior of an observed system. However, when dealing with multivariate systems, they often fail to identify the functional form that explains the relationship between each variable and the system's response. To begin to address this, we propose an explainable neural SR method that generates univariate symbolic skeletons that aim to explain how each variable influences the system's response. By analyzing multiple sets of data generated artificially, where one input variable varies while others are fixed, relationships are modeled separately for each input variable. The response of such artificial data sets is estimated using a regression neural network (NN). Finally, the multiple sets of input-response pairs are processed by a pre-trained Multi-Set Transformer that solves a problem we termed Multi-Set Skeleton Prediction and outputs a univariate symbolic skeleton. Thus, such skeletons represent explanations of the function approximated by the regression NN. Experimental results demonstrate that this method learns skeleton expressions matching the underlying functions and outperforms two GP-based and two neural SR methods.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Counterfactual Analysis of Neural Networks Used to Create Fertilizer Management Zones
Authors:
Giorgio Morales,
John Sheppard
Abstract:
In Precision Agriculture, the utilization of management zones (MZs) that take into account within-field variability facilitates effective fertilizer management. This approach enables the optimization of nitrogen (N) rates to maximize crop yield production and enhance agronomic use efficiency. However, existing works often neglect the consideration of responsivity to fertilizer as a factor influenc…
▽ More
In Precision Agriculture, the utilization of management zones (MZs) that take into account within-field variability facilitates effective fertilizer management. This approach enables the optimization of nitrogen (N) rates to maximize crop yield production and enhance agronomic use efficiency. However, existing works often neglect the consideration of responsivity to fertilizer as a factor influencing MZ determination. In response to this gap, we present a MZ clustering method based on fertilizer responsivity. We build upon the statement that the responsivity of a given site to the fertilizer rate is described by the shape of its corresponding N fertilizer-yield response (N-response) curve. Thus, we generate N-response curves for all sites within the field using a convolutional neural network (CNN). The shape of the approximated N-response curves is then characterized using functional principal component analysis. Subsequently, a counterfactual explanation (CFE) method is applied to discern the impact of various variables on MZ membership. The genetic algorithm-based CFE solves a multi-objective optimization problem and aims to identify the minimum combination of features needed to alter a site's cluster assignment. Results from two yield prediction datasets indicate that the features with the greatest influence on MZ membership are associated with terrain characteristics that either facilitate or impede fertilizer runoff, such as terrain slope or topographic aspect.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
Variational Inference of Parameters in Opinion Dynamics Models
Authors:
Jacopo Lenti,
Fabrizio Silvestri,
Gianmarco De Francisci Morales
Abstract:
Despite the frequent use of agent-based models (ABMs) for studying social phenomena, parameter estimation remains a challenge, often relying on costly simulation-based heuristics. This work uses variational inference to estimate the parameters of an opinion dynamics ABM, by transforming the estimation problem into an optimization task that can be solved directly.
Our proposal relies on probabili…
▽ More
Despite the frequent use of agent-based models (ABMs) for studying social phenomena, parameter estimation remains a challenge, often relying on costly simulation-based heuristics. This work uses variational inference to estimate the parameters of an opinion dynamics ABM, by transforming the estimation problem into an optimization task that can be solved directly.
Our proposal relies on probabilistic generative ABMs (PGABMs): we start by synthesizing a probabilistic generative model from the ABM rules. Then, we transform the inference process into an optimization problem suitable for automatic differentiation. In particular, we use the Gumbel-Softmax reparameterization for categorical agent attributes and stochastic variational inference for parameter estimation. Furthermore, we explore the trade-offs of using variational distributions with different complexity: normal distributions and normalizing flows.
We validate our method on a bounded confidence model with agent roles (leaders and followers). Our approach estimates both macroscopic (bounded confidence intervals and backfire thresholds) and microscopic ($200$ categorical, agent-level roles) more accurately than simulation-based and MCMC methods. Consequently, our technique enables experts to tune and validate their ABMs against real-world observations, thus providing insights into human behavior in social systems via data-driven analysis.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Higher-order null models as a lens for social systems
Authors:
Giulia Preti,
Adriano Fazzone,
Giovanni Petri,
Gianmarco De Francisci Morales
Abstract:
Despite the widespread adoption of higher-order mathematical structures such as hypergraphs, methodological tools for their analysis lag behind those for traditional graphs. This work addresses a critical gap in this context by proposing two micro-canonical random null models for directed hypergraphs: the Directed Hypergraph Configuration Model (DHCM) and the Directed Hypergraph JOINT Model (DHJM)…
▽ More
Despite the widespread adoption of higher-order mathematical structures such as hypergraphs, methodological tools for their analysis lag behind those for traditional graphs. This work addresses a critical gap in this context by proposing two micro-canonical random null models for directed hypergraphs: the Directed Hypergraph Configuration Model (DHCM) and the Directed Hypergraph JOINT Model (DHJM). These models preserve essential structural properties of directed hypergraphs such as node in- and out-degree sequences and hyperedge head and tail size sequences, or their joint tensor. We also describe two efficient MCMC algorithms, NuDHy-Degs and NuDHy-JOINT, to sample random hypergraphs from these ensembles.
To showcase the interdisciplinary applicability of the proposed null models, we present three distinct use cases in sociology, epidemiology, and economics. First, we reveal the oscillatory behavior of increased homophily in opposition parties in the US Congress over a 40-year span, emphasizing the role of higher-order structures in quantifying political group homophily. Second, we investigate non-linear contagion in contact hyper-networks, demonstrating that disparities between simulations and theoretical predictions can be explained by considering higher-order joint degree distributions. Last, we examine the economic complexity of countries in the global trade network, showing that local network properties preserved by NuDHy explain the main structural economic complexity indexes.
This work advances the development of null models for directed hypergraphs, addressing the intricate challenges posed by their complex entity relations, and providing a versatile suite of tools for researchers across various domains.
△ Less
Submitted 17 September, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
What we can learn from TikTok through its Research API
Authors:
Francesco Corso,
Francesco Pierri,
Gianmarco De Francisci Morales
Abstract:
TikTok is a social media platform that has gained immense popularity over the last few years, particularly among younger demographics, due to the viral trends and challenges shared worldwide. The recent release of a free Research API opens the door to collecting data on posted videos, associated comments, and user activities. Our study focuses on evaluating the reliability of the results returned…
▽ More
TikTok is a social media platform that has gained immense popularity over the last few years, particularly among younger demographics, due to the viral trends and challenges shared worldwide. The recent release of a free Research API opens the door to collecting data on posted videos, associated comments, and user activities. Our study focuses on evaluating the reliability of the results returned by the Research API, by collecting and analyzing a random sample of TikTok videos posted in a span of 6 years. Our preliminary results are instrumental for future research that aims to study the platform, highlighting caveats on the geographical distribution of videos and on the global prevalence of viral and conspiratorial hashtags.
△ Less
Submitted 4 April, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Navigating Multidimensional Ideologies with Reddit's Political Compass: Economic Conflict and Social Affinity
Authors:
Ernesto Colacrai,
Federico Cinus,
Gianmarco De Francisci Morales,
Michele Starnini
Abstract:
The prevalent perspective in quantitative research on opinion dynamics flattens the landscape of the online political discourse into a traditional left--right dichotomy. While this approach helps simplify the analysis and modeling effort, it also neglects the intrinsic multidimensional richness of ideologies. In this study, we analyze social interactions on Reddit, under the lens of a multi-dimens…
▽ More
The prevalent perspective in quantitative research on opinion dynamics flattens the landscape of the online political discourse into a traditional left--right dichotomy. While this approach helps simplify the analysis and modeling effort, it also neglects the intrinsic multidimensional richness of ideologies. In this study, we analyze social interactions on Reddit, under the lens of a multi-dimensional ideological framework: the political compass. We examine over 8 million comments posted on the subreddits /r/PoliticalCompass and /r/PoliticalCompassMemes during 2020--2022. By leveraging their self-declarations, we disentangle the ideological dimensions of users into economic (left--right) and social (libertarian--authoritarian) axes. In addition, we characterize users by their demographic attributes (age, gender, and affluence).
We find significant homophily for interactions along the social axis of the political compass and demographic attributes. Compared to a null model, interactions among individuals of similar ideology surpass expectations by 6%. In contrast, we uncover a significant heterophily along the economic axis: left/right interactions exceed expectations by 10%. Furthermore, heterophilic interactions are characterized by a higher language toxicity than homophilic interactions, which hints at a conflictual discourse between every opposite ideology. Our results help reconcile apparent contradictions in recent literature, which found a superposition of homophilic and heterophilic interactions in online political discussions. By disentangling such interactions into the economic and social axes we pave the way for a deeper understanding of opinion dynamics on social media.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Extracting the Multiscale Causal Backbone of Brain Dynamics
Authors:
Gabriele D'Acunto,
Francesco Bonchi,
Gianmarco De Francisci Morales,
Giovanni Petri
Abstract:
The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics, shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it.
Our approach le…
▽ More
The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics, shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it.
Our approach leverages recent advances in multiscale causal structure learning and optimizes the trade-off between the model fit and its complexity. Empirical assessment on synthetic data shows the superiority of our methodology over a baseline based on canonical functional connectivity networks. When applied to resting-state fMRI data, we find sparse MCBs for both the left and right brain hemispheres. Thanks to its multiscale nature, our approach shows that at low-frequency bands, causal dynamics are driven by brain regions associated with high-level cognitive functions; at higher frequencies instead, nodes related to sensory processing play a crucial role. Finally, our analysis of individual multiscale causal structures confirms the existence of a causal fingerprint of brain connectivity, thus supporting the existing extensive research in brain connectivity fingerprinting from a causal perspective.
△ Less
Submitted 19 March, 2024; v1 submitted 31 October, 2023;
originally announced November 2023.
-
Measuring Behavior Change with Observational Studies: a Review
Authors:
Arianna Pera,
Gianmarco de Francisci Morales,
Luca Maria Aiello
Abstract:
Exploring behavioral change in the digital age is imperative for societal progress in the context of 21st-century challenges. We analyzed 148 articles (2000-2023) and built a map that categorizes behaviors and change detection methodologies, platforms of reference, and theoretical frameworks that characterize online behavior change. Our findings uncover a focus on sentiment shifts, an emphasis on…
▽ More
Exploring behavioral change in the digital age is imperative for societal progress in the context of 21st-century challenges. We analyzed 148 articles (2000-2023) and built a map that categorizes behaviors and change detection methodologies, platforms of reference, and theoretical frameworks that characterize online behavior change. Our findings uncover a focus on sentiment shifts, an emphasis on API-restricted platforms, and limited theory integration. We call for methodologies able to capture a wider range of behavioral types, diverse data sources, and stronger theory-practice alignment in the study of online behavioral change.
△ Less
Submitted 2 November, 2023; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Systematic discrepancies in the delivery of political ads on Facebook and Instagram
Authors:
Dominik Bär,
Francesco Pierri,
Gianmarco De Francisci Morales,
Stefan Feuerriegel
Abstract:
Political advertising on social media has become a central element in election campaigns. However, granular information about political advertising on social media was previously unavailable, thus raising concerns regarding fairness, accountability, and transparency in the electoral process. In this paper, we analyze targeted political advertising on social media via a unique, large-scale dataset…
▽ More
Political advertising on social media has become a central element in election campaigns. However, granular information about political advertising on social media was previously unavailable, thus raising concerns regarding fairness, accountability, and transparency in the electoral process. In this paper, we analyze targeted political advertising on social media via a unique, large-scale dataset of over 80000 political ads from Meta during the 2021 German federal election, with more than 1.1 billion impressions. For each political ad, our dataset records granular information about targeting strategies, spending, and actual impressions. We then study (i) the prevalence of targeted ads across the political spectrum; (ii) the discrepancies between targeted and actual audiences due to algorithmic ad delivery; and (iii) which targeting strategies on social media attain a wide reach at low cost. We find that targeted ads are prevalent across the entire political spectrum. Moreover, there are considerable discrepancies between targeted and actual audiences, and systematic differences in the reach of political ads (in impressions-per-EUR) among parties, where the algorithm favors ads from populists over others.
△ Less
Submitted 24 June, 2024; v1 submitted 15 October, 2023;
originally announced October 2023.
-
Likelihood-Based Methods Improve Parameter Estimation in Opinion Dynamics Models
Authors:
Jacopo Lenti,
Corrado Monti,
Gianmarco De Francisci Morales
Abstract:
We show that a maximum likelihood approach for parameter estimation in agent-based models (ABMs) of opinion dynamics outperforms the typical simulation-based approach. Simulation-based approaches simulate the model repeatedly in search of a set of parameters that generates data similar enough to the observed one. In contrast, likelihood-based approaches derive a likelihood function that connects t…
▽ More
We show that a maximum likelihood approach for parameter estimation in agent-based models (ABMs) of opinion dynamics outperforms the typical simulation-based approach. Simulation-based approaches simulate the model repeatedly in search of a set of parameters that generates data similar enough to the observed one. In contrast, likelihood-based approaches derive a likelihood function that connects the unknown parameters to the observed data in a statistically principled way. We compare these two approaches on the well-known bounded-confidence model of opinion dynamics. We do so on three realistic scenarios of increasing complexity depending on data availability: (i) fully observed opinions and interactions, (ii) partially observed interactions, (iii) observed interactions with noisy proxies of the opinions. We highlight how identifying observed and latent variables is fundamental for connecting the model to the data. To realize the likelihood-based approach, we first cast the model into a probabilistic generative guise that supports a proper data likelihood. Then, we describe the three scenarios via probabilistic graphical models and show the nuances that go into translating the model. Finally, we implement the resulting probabilistic models in an automatic differentiation framework (PyTorch). This step enables easy and efficient maximum likelihood estimation via gradient descent. Our experimental results show that the maximum likelihood estimates are up to 4x more accurate and require up to 200x less computational time.
△ Less
Submitted 5 October, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Mapping of Internet "Coastlines" via Large Scale Anonymized Network Source Correlations
Authors:
Hayden Jananthan,
Jeremy Kepner,
Michael Jones,
William Arcand,
David Bestor,
William Bergeron,
Chansup Byun,
Timothy Davis,
Vijay Gadepally,
Daniel Grant,
Michael Houle,
Matthew Hubbell,
Anna Klein,
Lauren Milechin,
Guillermo Morales,
Andrew Morris,
Julie Mullen,
Ritesh Patel,
Alex Pentland,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Tyler Trigg
, et al. (3 additional authors not shown)
Abstract:
Expanding the scientific tools available to protect computer networks can be aided by a deeper understanding of the underlying statistical distributions of network traffic and their potential geometric interpretations. Analyses of large scale network observations provide a unique window into studying those underlying statistics. Newly developed GraphBLAS hypersparse matrices and D4M associative ar…
▽ More
Expanding the scientific tools available to protect computer networks can be aided by a deeper understanding of the underlying statistical distributions of network traffic and their potential geometric interpretations. Analyses of large scale network observations provide a unique window into studying those underlying statistics. Newly developed GraphBLAS hypersparse matrices and D4M associative array technologies enable the efficient anonymized analysis of network traffic on the scale of trillions of events. This work analyzes over 100,000,000,000 anonymized packets from the largest Internet telescope (CAIDA) and over 10,000,000 anonymized sources from the largest commercial honeyfarm (GreyNoise). Neither CAIDA nor GreyNoise actively emit Internet traffic and provide distinct observations of unsolicited Internet traffic (primarily botnets and scanners). Analysis of these observations confirms the previously observed Cauchy-like distributions describing temporal correlations between Internet sources. The Gull lighthouse problem is a well-known geometric characterization of the standard Cauchy distribution and motivates a potential geometric interpretation for Internet observations. This work generalizes the Gull lighthouse problem to accommodate larger classes of coastlines, deriving a closed-form solution for the resulting probability distributions, stating and examining the inverse problem of identifying an appropriate coastline given a continuous probability distribution, identifying a geometric heuristic for solving this problem computationally, and applying that heuristic to examine the temporal geometry of different subsets of network observations. Application of this method to the CAIDA and GreyNoise data reveals a several orders of magnitude difference between known benign and other traffic which can lead to potentially novel ways to protect networks.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Narratives of War: Ukrainian Memetic Warfare on Twitter
Authors:
Yelena Mejova,
Arthur Capozzi,
Corrado Monti,
Gianmarco De Francisci Morales
Abstract:
The 2022 Russian invasion of Ukraine has seen an intensification in the use of social media by governmental actors in cyber warfare. Wartime communication via memes has been a successful strategy used not only by independent accounts such as @uamemesforces, but also-for the first time in a full-scale interstate war-by official Ukrainian government accounts such as @Ukraine and @DefenceU. We study…
▽ More
The 2022 Russian invasion of Ukraine has seen an intensification in the use of social media by governmental actors in cyber warfare. Wartime communication via memes has been a successful strategy used not only by independent accounts such as @uamemesforces, but also-for the first time in a full-scale interstate war-by official Ukrainian government accounts such as @Ukraine and @DefenceU. We study this prominent example of memetic warfare through the lens of its narratives, and find them to be a key component of success: tweets with a 'victim' narrative garner twice as many retweets. However, malevolent narratives focusing on the enemy resonate more than those about heroism or victims with countries providing more assistance to Ukraine. Our findings present a nuanced examination of Ukraine's influence operations and of the worldwide response to it, thus contributing new insights into the evolution of socio-technical systems in times of war.
△ Less
Submitted 20 January, 2025; v1 submitted 15 September, 2023;
originally announced September 2023.
-
pPython Performance Study
Authors:
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on a single-node (e.g., a laptop) running Window…
▽ More
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on a single-node (e.g., a laptop) running Windows, Linux, or MacOS operating systems or on any combination of heterogeneous systems that support Python, including on a cluster through a Slurm scheduler interface so that pPython can be executed in a massively parallel computing environment. It is interesting to see what performance pPython can achieve compared to the traditional socket-based MPI communication because of its unique file-based messaging implementation. In this paper, we present the point-to-point and collective communication performances of pPython and compare them with those obtained by using mpi4py with OpenMPI. For large messages, pPython demonstrates comparable performance as compared to mpi4py.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Deployment of Real-Time Network Traffic Analysis using GraphBLAS Hypersparse Matrices and D4M Associative Arrays
Authors:
Michael Jones,
Jeremy Kepner,
Andrew Prout,
Timothy Davis,
William Arcand,
David Bestor,
William Bergeron,
Chansup Byun,
Vijay Gadepally,
Micheal Houle,
Matthew Hubbell,
Hayden Jananthan,
Anna Klein,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Ritesh Patel,
Sandeep Pisharody,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Peter Michaleas
Abstract:
Matrix/array analysis of networks can provide significant insight into their behavior and aid in their operation and protection. Prior work has demonstrated the analytic, performance, and compression capabilities of GraphBLAS (graphblas.org) hypersparse matrices and D4M (d4m.mit.edu) associative arrays (a mathematical superset of matrices). Obtaining the benefits of these capabilities requires int…
▽ More
Matrix/array analysis of networks can provide significant insight into their behavior and aid in their operation and protection. Prior work has demonstrated the analytic, performance, and compression capabilities of GraphBLAS (graphblas.org) hypersparse matrices and D4M (d4m.mit.edu) associative arrays (a mathematical superset of matrices). Obtaining the benefits of these capabilities requires integrating them into operational systems, which comes with its own unique challenges. This paper describes two examples of real-time operational implementations. First, is an operational GraphBLAS implementation that constructs anonymized hypersparse matrices on a high-bandwidth network tap. Second, is an operational D4M implementation that analyzes daily cloud gateway logs. The architectures of these implementations are presented. Detailed measurements of the resources and the performance are collected and analyzed. The implementations are capable of meeting their operational requirements using modest computational resources (a couple of processing cores). GraphBLAS is well-suited for low-level analysis of high-bandwidth connections with relatively structured network data. D4M is well-suited for higher-level analysis of more unstructured data. This work demonstrates that these technologies can be implemented in operational settings.
△ Less
Submitted 8 December, 2023; v1 submitted 4 September, 2023;
originally announced September 2023.
-
Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
Authors:
Jeremy Kepner,
Michael Jones,
Phil Dykstra,
Chansup Byun,
Timothy Davis,
Hayden Jananthan,
William Arcand,
David Bestor,
William Bergeron,
Vijay Gadepally,
Micheal Houle,
Matthew Hubbell,
Anna Klein,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Ritesh Patel,
Alex Pentland,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Tyler Trigg,
Charles Yee
, et al. (1 additional authors not shown)
Abstract:
Defending community-owned cyber space requires community-based efforts. Large-scale network observations that uphold the highest regard for privacy are key to protecting our shared cyberspace. Deployment of the necessary network sensors requires careful sensor placement, focusing, and calibration with significant volumes of network observations. This paper demonstrates novel focusing and calibrati…
▽ More
Defending community-owned cyber space requires community-based efforts. Large-scale network observations that uphold the highest regard for privacy are key to protecting our shared cyberspace. Deployment of the necessary network sensors requires careful sensor placement, focusing, and calibration with significant volumes of network observations. This paper demonstrates novel focusing and calibration procedures on a multi-billion packet dataset using high-performance GraphBLAS anonymized hypersparse matrices. The run-time performance on a real-world data set confirms previously observed real-time processing rates for high-bandwidth links while achieving significant data compression. The output of the analysis demonstrates the effectiveness of these procedures at focusing the traffic matrix and revealing the underlying stable heavy-tail statistical distributions that are necessary for anomaly detection. A simple model of the corresponding probability of detection ($p_{\rm d}$) and probability of false alarm ($p_{\rm fa}$) for these distributions highlights the criticality of network sensor focusing and calibration. Once a sensor is properly focused and calibrated it is then in a position to carry out two of the central tenets of good cybersecurity: (1) continuous observation of the network and (2) minimizing unbrokered network connections.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
An impossibility result for Markov Chain Monte Carlo sampling from micro-canonical bipartite graph ensembles
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Matteo Riondato
Abstract:
Markov Chain Monte Carlo (MCMC) algorithms are commonly used to sample from graph ensembles. Two graphs are neighbors in the state space if one can be obtained from the other with only a few modifications, e.g., edge rewirings. For many common ensembles, e.g., those preserving the degree sequences of bipartite graphs, rewiring operations involving two edges are sufficient to create a fully-connect…
▽ More
Markov Chain Monte Carlo (MCMC) algorithms are commonly used to sample from graph ensembles. Two graphs are neighbors in the state space if one can be obtained from the other with only a few modifications, e.g., edge rewirings. For many common ensembles, e.g., those preserving the degree sequences of bipartite graphs, rewiring operations involving two edges are sufficient to create a fully-connected state space, and they can be performed efficiently. We show that, for ensembles of bipartite graphs with fixed degree sequences and number of butterflies (k2,2 bi-cliques), there is no universal constant c such that a rewiring of at most c edges at every step is sufficient for any such ensemble to be fully connected. Our proof relies on an explicit construction of a family of pairs of graphs with the same degree sequences and number of butterflies, with each pair indexed by a natural c, and such that any sequence of rewiring operations transforming one graph into the other must include at least one rewiring operation involving at least c edges. Whether rewiring these many edges is sufficient to guarantee the full connectivity of the state space of any such ensemble remains an open question. Our result implies the impossibility of developing efficient, graph-agnostic, MCMC algorithms for these ensembles, as the necessity to rewire an impractically large number of edges may hinder taking a step on the state space.
△ Less
Submitted 10 September, 2024; v1 submitted 21 August, 2023;
originally announced August 2023.
-
Hyper-distance Oracles in Hypergraphs
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Francesco Bonchi
Abstract:
We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: the main one is that the line graph is typically orders of…
▽ More
We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: the main one is that the line graph is typically orders of magnitude larger than the original hypergraph. We then introduce HypED, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding constructing the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge s-distance queries for any value of s. A key observation at the basis of our framework is that, as s increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the s-connected components of the hypergraph. For this task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate HypED on several real-world hypergraphs and prove its versatility in answering s-distance queries for different values of s. Our framework allows answering such queries in fractions of a millisecond, while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the s-distance oracle in two applications, namely, hypergraph-based recommendation and the approximation of the s-closeness centrality of vertices and hyper-edges in the context of protein-to-protein interactions.
△ Less
Submitted 19 March, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Counterfactual Explanations of Neural Network-Generated Response Curves
Authors:
Giorgio Morales,
John Sheppard
Abstract:
Response curves exhibit the magnitude of the response of a sensitive system to a varying stimulus. However, response of such systems may be sensitive to multiple stimuli (i.e., input features) that are not necessarily independent. As a consequence, the shape of response curves generated for a selected input feature (referred to as "active feature") might depend on the values of the other input fea…
▽ More
Response curves exhibit the magnitude of the response of a sensitive system to a varying stimulus. However, response of such systems may be sensitive to multiple stimuli (i.e., input features) that are not necessarily independent. As a consequence, the shape of response curves generated for a selected input feature (referred to as "active feature") might depend on the values of the other input features (referred to as "passive features"). In this work, we consider the case of systems whose response is approximated using regression neural networks. We propose to use counterfactual explanations (CFEs) for the identification of the features with the highest relevance on the shape of response curves generated by neural network black boxes. CFEs are generated by a genetic algorithm-based approach that solves a multi-objective optimization problem. In particular, given a response curve generated for an active feature, a CFE finds the minimum combination of passive features that need to be modified to alter the shape of the response curve. We tested our method on a synthetic dataset with 1-D inputs and two crop yield prediction datasets with 2-D inputs. The relevance ranking of features and feature combinations obtained on the synthetic dataset coincided with the analysis of the equation that was used to generate the problem. Results obtained on the yield prediction datasets revealed that the impact on fertilizer responsivity of passive features depends on the terrain characteristics of each field.
△ Less
Submitted 13 April, 2023; v1 submitted 8 April, 2023;
originally announced April 2023.
-
Authority without Care: Moral Values behind the Mask Mandate Response
Authors:
Yelena Mejova,
Kyrieki Kalimeri,
Gianmarco De Francisci Morales
Abstract:
Face masks are one of the cheapest and most effective non-pharmaceutical interventions available against airborne diseases such as COVID-19. Unfortunately, they have been met with resistance by a substantial fraction of the populace, especially in the U.S. In this study, we uncover the latent moral values that underpin the response to the mask mandate, and paint them against the country's politica…
▽ More
Face masks are one of the cheapest and most effective non-pharmaceutical interventions available against airborne diseases such as COVID-19. Unfortunately, they have been met with resistance by a substantial fraction of the populace, especially in the U.S. In this study, we uncover the latent moral values that underpin the response to the mask mandate, and paint them against the country's political backdrop. We monitor the discussion about masks on Twitter, which involves almost 600k users in a time span of 7 months. By using a combination of graph mining, natural language processing, topic modeling, content analysis, and time series analysis, we characterize the responses to the mask mandate of both those in favor and against them. We base our analysis on the theoretical frameworks of Moral Foundation Theory and Hofstede's cultural dimensions. Our results show that, while the anti-mask stance is associated with a conservative political leaning, the moral values expressed by its adherents diverge from the ones typically used by conservatives. In particular, the expected emphasis on the values of authority and purity is accompanied by an atypical dearth of in-group loyalty. We find that after the mandate, both pro- and anti-mask sides decrease their emphasis on care about others, and increase their attention on authority and fairness, further politicizing the issue. In addition, the mask mandate reverses the expression of Individualism-Collectivism between the two sides, with an increase of individualism in the anti-mask narrative, and a decrease in the pro-mask one. We argue that monitoring the dynamics of moral positioning is crucial for designing effective public health campaigns that are sensitive to the underlying values of the target audience.
△ Less
Submitted 30 March, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Evidence of Demographic rather than Ideological Segregation in News Discussion on Reddit
Authors:
Corrado Monti,
Jacopo D'Ignazi,
Michele Starnini,
Gianmarco De Francisci Morales
Abstract:
We evaluate homophily and heterophily among ideological and demographic groups in a typical opinion formation context: online discussions of current news. We analyze user interactions across five years in the r/news community on Reddit, one of the most visited websites in the United States. Then, we estimate demographic and ideological attributes of these users. Thanks to a comparison with a caref…
▽ More
We evaluate homophily and heterophily among ideological and demographic groups in a typical opinion formation context: online discussions of current news. We analyze user interactions across five years in the r/news community on Reddit, one of the most visited websites in the United States. Then, we estimate demographic and ideological attributes of these users. Thanks to a comparison with a carefully-crafted network null model, we establish which pairs of attributes foster interactions and which ones inhibit them.
Individuals prefer to engage with the opposite ideological side, which contradicts the echo chamber narrative. Instead, demographic groups are homophilic, as individuals tend to interact within their own group - even in an online setting where such attributes are not directly observable. In particular, we observe age and income segregation consistently across years: users tend to avoid interactions when belonging to different groups. These results persist after controlling for the degree of interest by each demographic group in different news topics. Our findings align with the theory that affective polarization - the difficulty in socializing across political boundaries-is more connected with an increasingly divided society, rather than ideological echo chambers on social media.
We publicly release our anonymized data set and all the code to reproduce our results: https://github.com/corradomonti/demographic-homophily
△ Less
Submitted 5 July, 2023; v1 submitted 15 February, 2023;
originally announced February 2023.
-
The Thin Ideology of Populist Advertising on Facebook during the 2019 EU Elections
Authors:
Arthur Capozzi,
Gianmarco De Francisci Morales,
Yelena Mejova,
Corrado Monti,
André Panisson
Abstract:
Social media has been an important tool in the expansion of the populist message, and it is thought to have contributed to the electoral success of populist parties in the past decade. This study compares how populist parties advertised on Facebook during the 2019 European Parliamentary election. In particular, we examine commonalities and differences in which audiences they reach and on which iss…
▽ More
Social media has been an important tool in the expansion of the populist message, and it is thought to have contributed to the electoral success of populist parties in the past decade. This study compares how populist parties advertised on Facebook during the 2019 European Parliamentary election. In particular, we examine commonalities and differences in which audiences they reach and on which issues they focus. By using data from Meta (previously Facebook) Ad Library, we analyze 45k ad campaigns by 39 parties, both populist and mainstream, in Germany, United Kingdom, Italy, Spain, and Poland. While populist parties represent just over 20% of the total expenditure on political ads, they account for 40% of the total impressions$\unicode{x2013}$most of which from Eurosceptic and far-right parties$\unicode{x2013}$thus hinting at a competitive advantage for populist parties on Facebook. We further find that ads posted by populist parties are more likely to reach male audiences, and sometimes much older ones. In terms of issues, populist politicians focus on monetary policy, state bureaucracy and reforms, and security, while the focus on EU and Brexit is on par with non-populist, mainstream parties. However, issue preferences are largely country-specific, thus supporting the view in political science that populism is a "thin ideology", that does not have a universal, coherent policy agenda. This study illustrates the usefulness of publicly available advertising data for monitoring the populist outreach to, and engagement with, millions of potential voters, while outlining the limitations of currently available data.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Dual Accuracy-Quality-Driven Neural Network for Prediction Interval Generation
Authors:
Giorgio Morales,
John W. Sheppard
Abstract:
Accurate uncertainty quantification is necessary to enhance the reliability of deep learning models in real-world applications. In the case of regression tasks, prediction intervals (PIs) should be provided along with the deterministic predictions of deep learning models. Such PIs are useful or "high-quality" as long as they are sufficiently narrow and capture most of the probability density. In t…
▽ More
Accurate uncertainty quantification is necessary to enhance the reliability of deep learning models in real-world applications. In the case of regression tasks, prediction intervals (PIs) should be provided along with the deterministic predictions of deep learning models. Such PIs are useful or "high-quality" as long as they are sufficiently narrow and capture most of the probability density. In this paper, we present a method to learn prediction intervals for regression-based neural networks automatically in addition to the conventional target predictions. In particular, we train two companion neural networks: one that uses one output, the target estimate, and another that uses two outputs, the upper and lower bounds of the corresponding PI. Our main contribution is the design of a novel loss function for the PI-generation network that takes into account the output of the target-estimation network and has two optimization objectives: minimizing the mean prediction interval width and ensuring the PI integrity using constraints that maximize the prediction interval probability coverage implicitly. Furthermore, we introduce a self-adaptive coefficient that balances both objectives within the loss function, which alleviates the task of fine-tuning. Experiments using a synthetic dataset, eight benchmark datasets, and a real-world crop yield prediction dataset showed that our method was able to maintain a nominal probability coverage and produce significantly narrower PIs without detriment to its target estimation accuracy when compared to those PIs generated by three state-of-the-art neural-network-based methods. In other words, our method was shown to produce higher-quality PIs.
△ Less
Submitted 21 March, 2024; v1 submitted 13 December, 2022;
originally announced December 2022.
-
The language of opinion change on social media under the lens of communicative action
Authors:
Corrado Monti,
Luca Maria Aiello,
Gianmarco De Francisci Morales,
Francesco Bonchi
Abstract:
Which messages are more effective at inducing a change of opinion in the listener? We approach this question within the frame of Habermas' theory of communicative action, which posits that the illocutionary intent of the message (its pragmatic meaning) is the key. Thanks to recent advances in natural language processing, we are able to operationalize this theory by extracting the latent social dim…
▽ More
Which messages are more effective at inducing a change of opinion in the listener? We approach this question within the frame of Habermas' theory of communicative action, which posits that the illocutionary intent of the message (its pragmatic meaning) is the key. Thanks to recent advances in natural language processing, we are able to operationalize this theory by extracting the latent social dimensions of a message, namely archetypes of social intent of language, that come from social exchange theory. We identify key ingredients to opinion change by looking at more than 46k posts and more than 3.5M comments on Reddit's r/ChangeMyView, a debate forum where people try to change each other's opinion and explicitly mark opinion-changing comments with a special flag called "delta". Comments that express no intent are about 77% less likely to change the mind of the recipient, compared to comments that convey at least one social dimension. Among the various social dimensions, the ones that are most likely to produce an opinion change are knowledge, similarity, and trust, which resonates with Habermas' theory of communicative action. We also find other new important dimensions, such as appeals to power or empathetic expressions of support. Finally, in line with theories of constructive conflict, yet contrary to the popular characterization of conflict as the bane of modern social media, our findings show that voicing conflict in the context of a structured public debate can promote integration, especially when it is used to counter another conflictive stance. By leveraging recent advances in natural language processing, our work provides an empirical framework for Habermas' theory, finds concrete examples of its effects in the wild, and suggests its possible extension with a more faceted understanding of intent interpreted as social dimensions of language.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Python Implementation of the Dynamic Distributed Dimensional Data Model
Authors:
Hayden Jananthan,
Lauren Milechin,
Michael Jones,
William Arcand,
William Bergeron,
David Bestor,
Chansup Byun,
Michael Houle,
Matthew Hubbell,
Vijay Gadepally,
Anna Klein,
Peter Michaleas,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
Python has become a standard scientific computing language with fast-growing support of machine learning and data analysis modules, as well as an increasing usage of big data. The Dynamic Distributed Dimensional Data Model (D4M) offers a highly composable, unified data model with strong performance built to handle big data fast and efficiently. In this work we present an implementation of D4M in P…
▽ More
Python has become a standard scientific computing language with fast-growing support of machine learning and data analysis modules, as well as an increasing usage of big data. The Dynamic Distributed Dimensional Data Model (D4M) offers a highly composable, unified data model with strong performance built to handle big data fast and efficiently. In this work we present an implementation of D4M in Python. $D4M.py$ implements all foundational functionality of D4M and includes Accumulo and SQL database support via Graphulo. We describe the mathematical background and motivation, an explanation of the approaches made for its fundamental functions and building blocks, and performance results which compare $D4M.py$'s performance to D4M-MATLAB and D4M.jl.
△ Less
Submitted 22 November, 2022; v1 submitted 1 September, 2022;
originally announced September 2022.
-
Learning Multiscale Non-stationary Causal Structures
Authors:
Gabriele D'Acunto,
Gianmarco De Francisci Morales,
Paolo Bajardi,
Francesco Bonchi
Abstract:
This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative mode…
▽ More
This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative model by leveraging results from spectral and causality theories. Our model allows sampling an MN-DAG according to user-specified priors on the time-dependence and multiscale properties of the causal graph. Secondly, we devise a Bayesian method named Multiscale Non-stationary Causal Structure Learner (MN-CASTLE) that uses stochastic variational inference to estimate MN-DAGs. The method also exploits information from the local partial correlation between time series over different time resolutions. The data generated from an MN-DAG reproduces well-known features of time series in different domains, such as volatility clustering and serial correlation. Additionally, we show the superior performance of MN-CASTLE on synthetic data with different multiscale and non-stationary properties compared to baseline models. Finally, we apply MN-CASTLE to identify the drivers of the natural gas prices in the US market. Causal relationships have strengthened during the COVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline methods fail to capture. MN-CASTLE identifies the causal impact of critical economic drivers on natural gas prices, such as seasonal factors, economic uncertainty, oil prices, and gas storage deviations.
△ Less
Submitted 17 November, 2023; v1 submitted 31 August, 2022;
originally announced August 2022.
-
pPython for Parallel Python Programming
Authors:
Chansup Byun,
William Arcand,
David Bestor,
Bill Bergeron,
Vijay Gadepally,
Michael Houle,
Matthew Hubbell,
Hayden Jananthan,
Michael Jones,
Kurt Keville,
Anna Klein,
Peter Michaleas,
Lauren Milechin,
Guillermo Morales,
Julie Mullen,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Charles Yee,
Jeremy Kepner
Abstract:
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a map c…
▽ More
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a map construct. Communication operations between distributed arrays are abstracted away from the user and pPython transparently supports redistribution between any block-cyclic-overlapped distributions in up to four dimensions. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on any combination of heterogeneous systems that support Python, including Windows, Linux, and MacOS operating systems. In addition to running transparently on single-node (e.g., a laptop), pPython provides a scheduler interface, so that pPython can be executed in a massively parallel computing environment. The initial implementation uses the Slurm scheduler. Performance of pPython on the HPC Challenge benchmark suite demonstrates both ease of programming and scalability.
△ Less
Submitted 31 August, 2022;
originally announced August 2022.
-
On the Relation Between Opinion Change and Information Consumption on Reddit
Authors:
Flavio Petruzzellis,
Corrado Monti,
Gianmarco De Francisci Morales,
Francesco Bonchi
Abstract:
While much attention has been devoted to the causes of opinion change, little is known about its consequences. Our study sheds a light on the relationship between one user's opinion change episode and subsequent behavioral change on an online social media, Reddit. In particular, we look at r/ChangeMyView, an online community dedicated to debating one's own opinions. Interestingly, this forum adopt…
▽ More
While much attention has been devoted to the causes of opinion change, little is known about its consequences. Our study sheds a light on the relationship between one user's opinion change episode and subsequent behavioral change on an online social media, Reddit. In particular, we look at r/ChangeMyView, an online community dedicated to debating one's own opinions. Interestingly, this forum adopts a well-codified schema for explicitly self-reporting opinion change. Starting from this ground truth, we analyze changes in future online information consumption behavior that arise after a self-reported opinion change on sociopolitical topics; and in particular, operationalized in this work as the participation to sociopolitical subreddits. Such participation profile is important as it represents one's information diet, and is a reliable proxy for, e.g., political affiliation or health choices.
We find that people who report an opinion change are significantly more likely to change their future participation in a specific subset of online communities. We characterize which communities are more likely to be abandoned after opinion change, and find a significant association (r=0.46) between propaganda-like language used in a community and the increase in chances of leaving it. We find comparable results (r=0.39) for the opposite direction, i.e., joining a community. This finding suggests how propagandistic communities act as a first gateway to internalize a shift in one's sociopolitical opinion. Finally, we show that the textual content of the discussion associated with opinion change is indicative of which communities are going to be subject to a participation change. In fact, a predictive model based only on the opinion change post is able to pinpoint these communities with an AP@5 of 0.20, similar to what can be reached by using all the past history of participation in communities.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Open Arms: Open-Source Arms, Hands & Control
Authors:
David Hanson,
Alishba Imran,
Gerardo Morales,
Vytas Krisciunas,
Aditya Sagi,
Aman Malali,
Rushali Mohbe,
Raviteja Upadrashta
Abstract:
Open Arms is a novel open-source platform of realistic human-like robotic hands and arms hardware with 28 Degree-of-Freedom (DoF), designed to extend the capabilities and accessibility of humanoid robotic grasping and manipulation. The Open Arms framework includes an open SDK and development environment, simulation tools, and application development tools to build and operate Open Arms. This paper…
▽ More
Open Arms is a novel open-source platform of realistic human-like robotic hands and arms hardware with 28 Degree-of-Freedom (DoF), designed to extend the capabilities and accessibility of humanoid robotic grasping and manipulation. The Open Arms framework includes an open SDK and development environment, simulation tools, and application development tools to build and operate Open Arms. This paper describes these hands controls, sensing, mechanisms, aesthetic design, and manufacturing and their real-world applications with a teleoperated nursing robot. From 2015 to 2022, the authors have designed and established the manufacturing of Open Arms as a low-cost, high functionality robotic arms hardware and software framework to serve both humanoid robot applications and the urgent demand for low-cost prosthetics, as part of the Hanson Robotics Sophia Robot platform. Using the techniques of consumer product manufacturing, we set out to define modular, low-cost techniques for approximating the dexterity and sensitivity of human hands. To demonstrate the dexterity and control of our hands, we present a Generative Grasping Residual CNN (GGR-CNN) model that can generate robust antipodal grasps from input images of various objects in real-time speeds (22ms). We achieved state-of-the-art accuracy of 92.4% using our model architecture on a standard Cornell Grasping Dataset, which contains a diverse set of household objects.
△ Less
Submitted 15 July, 2022; v1 submitted 20 May, 2022;
originally announced May 2022.
-
On learning agent-based models from data
Authors:
Corrado Monti,
Marco Pangallo,
Gianmarco De Francisci Morales,
Francesco Bonchi
Abstract:
Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the lat…
▽ More
Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems from micro-level assumptions. However, ABMs typically can not estimate agent-specific (or "micro") variables: this is a major limitation which prevents ABMs from harnessing micro-level data availability and which greatly limits their predictive power. In this paper, we propose a protocol to learn the latent micro-variables of an ABM from data. The first step of our protocol is to reduce an ABM to a probabilistic model, characterized by a computationally tractable likelihood. This reduction follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Then, our protocol proceeds by maximizing the likelihood of the latent variables via a gradient-based expectation maximization algorithm. We demonstrate our protocol by applying it to an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods. We demonstrate that the obtained model allows accurate estimates of the latent variables, while preserving the general behavior of the ABM. We also show that our estimates can be used for out-of-sample forecasting. Our protocol can be seen as an alternative to black-box data assimilation methods, that forces the modeler to lay bare the assumptions of the model, to think about the inferential process, and to spot potential identification problems.
△ Less
Submitted 23 November, 2022; v1 submitted 10 May, 2022;
originally announced May 2022.
-
Modeling Political Activism around Gun Debate via Social Media
Authors:
Yelena Mejova,
Jisun An,
Gianmarco De Francisci Morales,
Haewoon Kwak
Abstract:
The United States have some of the highest rates of gun violence among developed countries. Yet, there is a disagreement about the extent to which firearms should be regulated. In this study, we employ social media signals to examine the predictors of offline political activism, at both population and individual level. We show that it is possible to classify the stance of users on the gun issue, e…
▽ More
The United States have some of the highest rates of gun violence among developed countries. Yet, there is a disagreement about the extent to which firearms should be regulated. In this study, we employ social media signals to examine the predictors of offline political activism, at both population and individual level. We show that it is possible to classify the stance of users on the gun issue, especially accurately when network information is available. Alongside socioeconomic variables, network information such as the relative size of the two sides of the debate is also predictive of state-level gun policy. On individual level, we build a statistical model using network, content, and psycho-linguistic features that predicts real-life political action, and explore the most predictive linguistic features. Thus, we argue that, alongside demographics and socioeconomic indicators, social media provides useful signals in the holistic modeling of political engagement around the gun debate.
△ Less
Submitted 30 April, 2022;
originally announced May 2022.
-
FreSCo: Mining Frequent Patterns in Simplicial Complexes
Authors:
Giulia Preti,
Gianmarco De Francisci Morales,
Francesco Bonchi
Abstract:
Simplicial complexes are a generalization of graphs that model higher-order relations. In this paper, we introduce simplicial patterns -- that we call simplets -- and generalize the task of frequent pattern mining from the realm of graphs to that of simplicial complexes. Our task is particularly challenging due to the enormous search space and the need for higher-order isomorphism. We show that fi…
▽ More
Simplicial complexes are a generalization of graphs that model higher-order relations. In this paper, we introduce simplicial patterns -- that we call simplets -- and generalize the task of frequent pattern mining from the realm of graphs to that of simplicial complexes. Our task is particularly challenging due to the enormous search space and the need for higher-order isomorphism. We show that finding the occurrences of simplets in a complex can be reduced to a bipartite graph isomorphism problem, in linear time and at most quadratic space. We then propose an anti-monotonic frequency measure that allows us to start the exploration from small simplets and stop expanding a simplet as soon as its frequency falls below the minimum frequency threshold. Equipped with these ideas and a clever data structure, we develop a memory-conscious algorithm that, by carefully exploiting the relationships among the simplices in the complex and among the simplets, achieves efficiency and scalability for our complex mining task. Our algorithm, FreSCo, comes in two flavors: it can compute the exact frequency of the simplets or, more quickly, it can determine whether a simplet is frequent, without having to compute the exact frequency. Experimental results prove the ability of FreSCo to mine frequent simplets in complexes of various size and dimension, and the significance of the simplets with respect to the traditional graph patterns.
△ Less
Submitted 26 January, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Two-dimensional Deep Regression for Early Yield Prediction of Winter Wheat
Authors:
Giorgio Morales,
John W. Sheppard
Abstract:
Crop yield prediction is one of the tasks of Precision Agriculture that can be automated based on multi-source periodic observations of the fields. We tackle the yield prediction problem using a Convolutional Neural Network (CNN) trained on data that combines radar satellite imagery and on-ground information. We present a CNN architecture called Hyper3DNetReg that takes in a multi-channel input im…
▽ More
Crop yield prediction is one of the tasks of Precision Agriculture that can be automated based on multi-source periodic observations of the fields. We tackle the yield prediction problem using a Convolutional Neural Network (CNN) trained on data that combines radar satellite imagery and on-ground information. We present a CNN architecture called Hyper3DNetReg that takes in a multi-channel input image and outputs a two-dimensional raster, where each pixel represents the predicted yield value of the corresponding input pixel. We utilize radar data acquired from the Sentinel-1 satellites, while the on-ground data correspond to a set of six raster features: nitrogen rate applied, precipitation, slope, elevation, topographic position index (TPI), and aspect. We use data collected during the early stage of the winter wheat growing season (March) to predict yield values during the harvest season (August). We present experiments over four fields of winter wheat and show that our proposed methodology yields better results than five compared methods, including multiple linear regression, an ensemble of feedforward networks using AdaBoost, a stacked autoencoder, and two other CNN architectures.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.