-
Modeling the 2022 Mpox Outbreak with a Mechanistic Network Model
Authors:
Emma G. Crenshaw,
Jukka-Pekka Onnela
Abstract:
We implemented a dynamic agent-based network model to simulate the spread of mpox in a United States-based MSM population. This model allowed us to implement data-informed dynamic network evolution to simulate realistic disease spreading and behavioral adaptations. We found that behavior change, the reduction in one-time partnerships, and widespread vaccination are effective in preventing the tran…
▽ More
We implemented a dynamic agent-based network model to simulate the spread of mpox in a United States-based MSM population. This model allowed us to implement data-informed dynamic network evolution to simulate realistic disease spreading and behavioral adaptations. We found that behavior change, the reduction in one-time partnerships, and widespread vaccination are effective in preventing the transmission of mpox and that earlier intervention has a greater effect, even when only a high-risk portion of the population participates. With no intervention, 16% of the population was infected (25th percentile, 75th percentiles of simulations: 15.3%, 16.6%). With vaccination and behavior change in only the 25% of individuals most likely to have a one-time partner, cumulative infections were reduced by 30%, or a total reduction in nearly 500 infections. Earlier intervention further reduces cumulative infections; beginning vaccination a year before the outbreak results in only 5.5% of men being infected, averting 950 infections or nearly 10% of the total population in our model. We also show that sustained partnerships drive the early outbreak, while one-time partnerships drive transmission after the first initial weeks. The median effective reproductive number, Rt, at t = 0 days is 1.30 for casual partnerships, 1.00 for main, and 0.6 for one-time. By t = 28, the median Rt for one-time partnerships has more than doubled to 1.48, while it decreased for casual and main partnerships: 0.46 and 0.29, respectively. With the ability to model individuals' behavior, mechanistic networks are particularly well suited to studying sexually transmitted infections, the spread and control of which are often governed by individual-level action. Our results contribute valuable insights into the role of different interventions and relationship types in mpox transmission dynamics.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Individual causal effect estimation accounting for latent disease state modification among bipolar participants in mobile health studies
Authors:
Charlotte R. Fowler,
Xiaoxuan Cai,
Habiballah Rahimi-Eichi,
Lisa Dixon,
Justin T. Baker,
Jukka-Pekka Onnela,
Linda Valeri
Abstract:
Individuals with bipolar disorder tend to cycle through disease states such as depression and mania. The heterogeneous nature of disease across states complicates the evaluation of interventions for bipolar disorder patients, as varied interventional success is observed within and across individuals. In fact, we hypothesize that disease state acts as an effect modifier for the causal effect of a g…
▽ More
Individuals with bipolar disorder tend to cycle through disease states such as depression and mania. The heterogeneous nature of disease across states complicates the evaluation of interventions for bipolar disorder patients, as varied interventional success is observed within and across individuals. In fact, we hypothesize that disease state acts as an effect modifier for the causal effect of a given intervention on health outcomes. To address this dilemma, we propose an N-of-1 approach using an adapted autoregressive hidden Markov model, applied to longitudinal mobile health data collected from individuals with bipolar disorder. This method allows us to identify a latent variable from mobile health data to be treated as an effect modifier between the exposure and outcome of interest while allowing for missing data in the outcome. A counterfactual approach is employed for causal inference and to obtain a g-formula estimator to recover said effect. The performance of the proposed method is compared with a naive approach across extensive simulations and application to a multi-year smartphone study of bipolar patients, evaluating the individual effect of digital social activity on sleep duration across different latent disease states.
△ Less
Submitted 4 February, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
Glucodensity Functional Profiles Outperform Traditional Continuous Glucose Monitoring Metrics
Authors:
Marcos Matabuena,
Rahul Ghosal,
Javier Enrique Aguilar,
Robert Wagner,
Carmen Fernández Merino,
Juan Sánchez Castro,
Vadim Zipunnikov,
Jukka-Pekka Onnela,
Francisco Gude
Abstract:
Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles…
▽ More
Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles remains a significant challenge. Numerous indices -- such as time-in-range metrics and glucose variability measures -- have been proposed, but evidence suggests these metrics overlook critical aspects of glucose dynamic homeostasis. As an alternative method, this paper explores the clinical value of glucodensity metrics in capturing glucose dynamics -- specifically the speed and acceleration of CGM time series -- as new biomarkers for predicting long-term glucose outcomes. Our results demonstrate significant information gains, exceeding 20\% in terms of adjusted $R^2$, in forecasting glycosylated hemoglobin (HbA1c) and fasting plasma glucose (FPG) at five and eight years from baseline AEGIS data, compared to traditional non-CGM and CGM glucose biomarkers. These findings underscore the importance of incorporating more complex CGM functional metrics, such as the glucodensity approach, to fully capture continuous glucose fluctuations across different time-scale resolutions.
△ Less
Submitted 1 October, 2024;
originally announced October 2024.
-
Connecting Mass-action Models and Network Models for Infectious Diseases
Authors:
Thien-Minh Le,
Jukka-Pekka Onnela
Abstract:
Infectious disease modeling is used to forecast epidemics and assess the effectiveness of intervention strategies. Although the core assumption of mass-action models of homogeneously mixed population is often implausible, they are nevertheless routinely used in studying epidemics and provide useful insights. Network models can account for the heterogeneous mixing of populations, which is especiall…
▽ More
Infectious disease modeling is used to forecast epidemics and assess the effectiveness of intervention strategies. Although the core assumption of mass-action models of homogeneously mixed population is often implausible, they are nevertheless routinely used in studying epidemics and provide useful insights. Network models can account for the heterogeneous mixing of populations, which is especially important for studying sexually transmitted diseases. Despite the abundance of research on mass-action and network models, the relationship between them is not well understood. Here, we attempt to bridge the gap by first identifying a spreading rule that results in an exact match between disease spreading on a fully connected network and the classic mass-action models. We then propose a method for mapping epidemic spread on arbitrary networks to a form similar to that of mass-action models. We also provide a theoretical justification for the procedure. Finally, we show the advantages of the proposed methods using synthetic data that is based on an empirical network. These findings help us understand when mass-action models and network models are expected to provide similar results and identify reasons when they do not.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Causal estimands and identification of time-varying effects in non-stationary time series from N-of-1 mobile device data
Authors:
Xiaoxuan Cai,
Li Zeng,
Charlotte Fowler,
Lisa Dixon,
Dost Ongur,
Justin T. Baker,
Jukka-Pekka Onnela,
Linda Valeri
Abstract:
Mobile technology (mobile phones and wearable devices) generates continuous data streams encompassing outcomes, exposures and covariates, presented as intensive longitudinal or multivariate time series data. The high frequency of measurements enables granular and dynamic evaluation of treatment effect, revealing their persistence and accumulation over time. Existing methods predominantly focus on…
▽ More
Mobile technology (mobile phones and wearable devices) generates continuous data streams encompassing outcomes, exposures and covariates, presented as intensive longitudinal or multivariate time series data. The high frequency of measurements enables granular and dynamic evaluation of treatment effect, revealing their persistence and accumulation over time. Existing methods predominantly focus on the contemporaneous effect, temporal-average, or population-average effects, assuming stationarity or invariance of treatment effects over time, which are inadequate both conceptually and statistically to capture dynamic treatment effects in personalized mobile health data. We here propose new causal estimands for multivariate time series in N-of-1 studies. These estimands summarize how time-varying exposures impact outcomes in both short- and long-term. We propose identifiability assumptions and a g-formula estimator that accounts for exposure-outcome and outcome-covariate feedback. The g-formula employs a state space model framework innovatively to accommodate time-varying behavior of treatment effects in non-stationary time series. We apply the proposed method to a multi-year smartphone observational study of bipolar patients and estimate the dynamic effect of phone-based communication on mood of patients with bipolar disorder in an N-of-1 setting. Our approach reveals substantial heterogeneity in treatment effects over time and across individuals. A simulation-based strategy is also proposed for the development of a short-term, dynamic, and personalized treatment recommendation based on patient's past information, in combination with a novel positivity diagnostics plot, validating proper causal inference in time series data.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Temporal Configuration Model: Statistical Inference and Spreading Processes
Authors:
Thien-Minh Le,
Hali Hambridge,
Jukka-Pekka Onnela
Abstract:
We introduce a family of parsimonious network models that are intended to generalize the configuration model to temporal settings. We present consistent estimators for the model parameters and perform numerical simulations to illustrate the properties of the estimators on finite samples. We also develop analytical solutions for basic and effective reproductive numbers for the early stage of discre…
▽ More
We introduce a family of parsimonious network models that are intended to generalize the configuration model to temporal settings. We present consistent estimators for the model parameters and perform numerical simulations to illustrate the properties of the estimators on finite samples. We also develop analytical solutions for basic and effective reproductive numbers for the early stage of discrete-time SIR spreading process. We apply three distinct temporal configuration models to empirical student proximity networks and compare their performance.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces
Authors:
Marcos Matabuena,
Rahul Ghosal,
Pavlo Mozharovskyi,
Oscar Hernan Madrid Padilla,
Jukka-Pekka Onnela
Abstract:
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantifi…
▽ More
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Screening for Diabetes Mellitus in the U.S. Population Using Neural Network Models and Complex Survey Designs
Authors:
Marcos Matabuena,
Juan C. Vidal,
Rahul Ghosal,
Jukka-Pekka Onnela
Abstract:
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential for minimizing selective biases in the statistical results. The objectives of this paper are to: (i) propose a general predictive framework for regression and classification using neur…
▽ More
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential for minimizing selective biases in the statistical results. The objectives of this paper are to: (i) propose a general predictive framework for regression and classification using neural network (NN) modeling that incorporates survey weights into the estimation process; (ii) introduce an uncertainty quantification algorithm for model prediction tailored to data from complex survey designs; and (iii) apply this method to develop robust risk score models for assessing the risk of Diabetes Mellitus in the US population, utilizing data from the NHANES 2011-2014 cohort. The results indicate that models of varying complexity, each utilizing a different set of variables, demonstrate different discriminative power for predicting diabetes (with different economic cost), yet yield generalizable results at the population level. Although the focus is on diabetes, this NN predictive framework is adaptable for developing clinical models across a diverse range of diseases and medical cohorts. The software and data used in this paper are publicly available on GitHub.
△ Less
Submitted 25 March, 2025; v1 submitted 28 March, 2024;
originally announced March 2024.
-
kNN Algorithm for Conditional Mean and Variance Estimation with Automated Uncertainty Quantification and Variable Selection
Authors:
Marcos Matabuena,
Juan C. Vidal,
Oscar Hernan Madrid Padilla,
Jukka-Pekka Onnela
Abstract:
In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach i…
▽ More
In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach incorporates a robust uncertainty quantification mechanism, leveraging our prior estimation work on conditional mean and variance. The employment of kNN ensures scalable computational efficiency in predicting intervals and statistical accuracy in line with optimal non-parametric rates. Additionally, we introduce a new kNN semi-parametric algorithm for estimating ROC curves, accounting for covariates. For selecting the smoothing parameter k, we propose an algorithm with theoretical guarantees.Incorporation of variable selection enhances the performance of the method significantly over conventional kNN techniques in various modeling tasks. We validate the approach through simulations in low, moderate, and high-dimensional covariate spaces. The algorithm's effectiveness is particularly notable in biomedical applications as demonstrated in two case studies. Concluding with a theoretical analysis, we highlight the consistency and convergence rate of our method over traditional kNN models, particularly when the underlying regression model takes values in a low-dimensional space.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Approximate Inference for Longitudinal Mechanistic HIV Contact Networks
Authors:
Octavious Smiley,
Till Hoffmann,
Jukka-Pekka Onnela
Abstract:
Network models are increasingly used to study infectious disease spread. Exponential Random Graph models have a history in this area, with scalable inference methods now available. An alternative approach uses mechanistic network models. Mechanistic network models directly capture individual behaviors, making them suitable for studying sexually transmitted diseases. Combining mechanistic models wi…
▽ More
Network models are increasingly used to study infectious disease spread. Exponential Random Graph models have a history in this area, with scalable inference methods now available. An alternative approach uses mechanistic network models. Mechanistic network models directly capture individual behaviors, making them suitable for studying sexually transmitted diseases. Combining mechanistic models with Approximate Bayesian Computation allows flexible modeling using domain-specific interaction rules among agents, avoiding network model oversimplifications. These models are ideal for longitudinal settings as they explicitly incorporate network evolution over time. We implemented a discrete-time version of a previously published continuous-time model of evolving contact networks for men who have sex with men (MSM) and proposed an ABC-based approximate inference scheme for it. As expected, we found that a two-wave longitudinal study design improves the accuracy of inference compared to a cross-sectional design. However, the gains in precision in collecting data twice, up to 18%, depend on the spacing of the two waves and are sensitive to the choice of summary statistics. In addition to methodological developments, our results inform the design of future longitudinal network studies in sexually transmitted diseases, specifically in terms of what data to collect from participants and when to do so.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Network Layout Algorithm with Covariate Smoothing
Authors:
Octavious Smiley,
Till Hoffmann,
Jukka-Pekka Onnela
Abstract:
Network science explores intricate connections among objects, employed in diverse domains like social interactions, fraud detection, and disease spread. Visualization of networks facilitates conceptualizing research questions and forming scientific hypotheses. Networks, as mathematical high-dimensional objects, require dimensionality reduction for (planar) visualization. Visualizing empirical netw…
▽ More
Network science explores intricate connections among objects, employed in diverse domains like social interactions, fraud detection, and disease spread. Visualization of networks facilitates conceptualizing research questions and forming scientific hypotheses. Networks, as mathematical high-dimensional objects, require dimensionality reduction for (planar) visualization. Visualizing empirical networks present additional challenges. They often contain false positive (spurious) and false negative (missing) edges. Traditional visualization methods don't account for errors in observation, potentially biasing interpretations. Moreover, contemporary network data includes rich nodal attributes. However, traditional methods neglect these attributes when computing node locations. Our visualization approach aims to leverage nodal attribute richness to compensate for network data limitations. We employ a statistical model estimating the probability of edge connections between nodes based on their covariates. We enhance the Fruchterman-Reingold algorithm to incorporate estimated dyad connection probabilities, allowing practitioners to balance reliance on observed versus estimated edges. We explore optimal smoothing levels, offering a natural way to include relevant nodal information in layouts. Results demonstrate the effectiveness of our method in achieving robust network visualization, providing insights for improved analysis.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Nonparametric Additive Value Functions: Interpretable Reinforcement Learning with an Application to Surgical Recovery
Authors:
Patrick Emedom-Nnamdi,
Timothy R. Smith,
Jukka-Pekka Onnela,
Junwei Lu
Abstract:
We propose a nonparametric additive model for estimating interpretable value functions in reinforcement learning. Learning effective adaptive clinical interventions that rely on digital phenotyping features is a major for concern medical practitioners. With respect to spine surgery, different post-operative recovery recommendations concerning patient mobilization can lead to significant variation…
▽ More
We propose a nonparametric additive model for estimating interpretable value functions in reinforcement learning. Learning effective adaptive clinical interventions that rely on digital phenotyping features is a major for concern medical practitioners. With respect to spine surgery, different post-operative recovery recommendations concerning patient mobilization can lead to significant variation in patient recovery. While reinforcement learning has achieved widespread success in domains such as games, recent methods heavily rely on black-box methods, such neural networks. Unfortunately, these methods hinder the ability of examining the contribution each feature makes in producing the final suggested decision. While such interpretations are easily provided in classical algorithms such as Least Squares Policy Iteration, basic linearity assumptions prevent learning higher-order flexible interactions between features. In this paper, we present a novel method that offers a flexible technique for estimating action-value functions without making explicit parametric assumptions regarding their additive functional form. This nonparametric estimation strategy relies on incorporating local kernel regression and basis expansion to obtain a sparse, additive representation of the action-value function. Under this approach, we are able to locally approximate the action-value function and retrieve the nonlinear, independent contribution of select features as well as joint feature pairs. We validate the proposed approach with a simulation study, and, in an application to spine disease, uncover recovery recommendations that are inline with related clinical knowledge.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Scalable Gaussian Process Inference with Stan
Authors:
Till Hoffmann,
Jukka-Pekka Onnela
Abstract:
Gaussian processes (GPs) are sophisticated distributions to model functional data. Whilst theoretically appealing, they are computationally cumbersome except for small datasets. We implement two methods for scaling GP inference in Stan: First, a general sparse approximation using a directed acyclic dependency graph; second, a fast, exact method for regularly spaced data modeled by GPs with station…
▽ More
Gaussian processes (GPs) are sophisticated distributions to model functional data. Whilst theoretically appealing, they are computationally cumbersome except for small datasets. We implement two methods for scaling GP inference in Stan: First, a general sparse approximation using a directed acyclic dependency graph; second, a fast, exact method for regularly spaced data modeled by GPs with stationary kernels using the fast Fourier transform. Based on benchmark experiments, we offer guidance for practitioners to decide between different methods and parameterizations. We consider two real-world examples to illustrate the package. The implementation follows Stan's design and exposes performant inference through a familiar interface. Full posterior inference for ten thousand data points is feasible on a laptop in less than 20 seconds. Details on how to get started using the popular interfaces cmdstanpy for Python and cmdstanr for R are provided.
△ Less
Submitted 10 January, 2024; v1 submitted 20 January, 2023;
originally announced January 2023.
-
A Generalized Estimating Equation Approach to Network Regression
Authors:
Riddhi Pratim Ghosh,
Jukka-Pekka Onnela,
Ian Barnett
Abstract:
Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by th…
▽ More
Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by the important observation that controlling for community structure can, when a network is modular, significantly account for meaningful correlation between observations induced by network connections. We propose a generalized estimating equation (GEE) approach to learn model parameters based on clusters defined through any single-membership community detection algorithm applied to the observed network. We provide a necessary condition on the network size and edge formation probabilities to establish the asymptotic normality of the model parameters under the assumption that the graph structure is a stochastic block model. We evaluate the performance of our approach through simulations and apply it to estimate the joint impact of baseline covariates and network effects on COVID-19 incidence rate among countries connected by a network of commercial airline traffic. We find that during the beginning of the pandemic the network effect has some influence, the percentage of urban population has more influence on the incidence rate compared to the network effect after the travel ban was in effect.
△ Less
Submitted 14 February, 2024; v1 submitted 11 January, 2023;
originally announced January 2023.
-
Testing unit root non-stationarity in the presence of missing data in univariate time series of mobile health studies
Authors:
Charlotte Fowler,
Xiaoxuan Cai,
Justin T. Baker,
Jukka-Pekka Onnela,
Linda Valeri
Abstract:
The use of digital devices to collect data in mobile health (mHealth) studies introduces a novel application of time series methods, with the constraint of potential data missing at random (MAR) or missing not at random (MNAR). In time series analysis, testing for stationarity is an important preliminary step to inform appropriate later analyses. The augmented Dickey-Fuller (ADF) test was develope…
▽ More
The use of digital devices to collect data in mobile health (mHealth) studies introduces a novel application of time series methods, with the constraint of potential data missing at random (MAR) or missing not at random (MNAR). In time series analysis, testing for stationarity is an important preliminary step to inform appropriate later analyses. The augmented Dickey-Fuller (ADF) test was developed to test the null hypothesis of unit root non-stationarity, under no missing data. Beyond recommendations under data missing completely at random (MCAR) for complete case analysis or last observation carry forward imputation, researchers have not extended unit root non-stationarity testing to a context with more complex missing data mechanisms. Multiple imputation with chained equations, Kalman smoothing imputation, and linear interpolation have also been proposed for time series data, however such methods impose constraints on the autocorrelation structure, and thus impact unit root testing. We propose maximum likelihood estimation and multiple imputation using state space model approaches to adapt the ADF test to a context with missing data. We further develop sensitivity analysis techniques to examine the impact of MNAR data. We evaluate the performance of existing and proposed methods across different missing mechanisms in extensive simulations and in their application to a multi-year smartphone study of bipolar patients.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
State space model multiple imputation for missing data in non-stationary multivariate time series with application in digital Psychiatry
Authors:
Xiaoxuan Cai,
Xinru Wang,
Li Zeng,
Habiballah Rahimi Eichi,
Dost Ongur,
Lisa Dixon,
Justin T. Baker,
Jukka-Pekka Onnela,
Linda Valeri
Abstract:
Mobile technology enables unprecedented continuous monitoring of an individual's behavior, social interactions, symptoms, and other health conditions, presenting an enormous opportunity for therapeutic advancements and scientific discoveries regarding the etiology of psychiatric illness. Continuous collection of mobile data results in the generation of a new type of data: entangled multivariate ti…
▽ More
Mobile technology enables unprecedented continuous monitoring of an individual's behavior, social interactions, symptoms, and other health conditions, presenting an enormous opportunity for therapeutic advancements and scientific discoveries regarding the etiology of psychiatric illness. Continuous collection of mobile data results in the generation of a new type of data: entangled multivariate time series of outcome, exposure, and covariates. Missing data is a pervasive problem in biomedical and social science research, and the Ecological Momentary Assessment (EMA) using mobile devices in psychiatric research is no exception. However, the complex structure of multivariate time series introduces new challenges in handling missing data for proper causal inference. Data imputation is commonly recommended to enhance data utility and estimation efficiency. The majority of available imputation methods are either designed for longitudinal data with limited follow-up times or for stationary time series, which are incompatible with potentially non-stationary time series. In the field of psychiatry, non-stationary data are frequently encountered as symptoms and treatment regimens may experience dramatic changes over time. To address missing data in possibly non-stationary multivariate time series, we propose a novel multiple imputation strategy based on the state space model (SSMmp) and a more computationally efficient variant (SSMimpute). We demonstrate their advantages over other widely used missing data strategies by evaluating their theoretical properties and empirical performance in simulations of both stationary and non-stationary time series, subject to various missing mechanisms. We apply the SSMimpute to investigate the association between social network size and negative mood using a multi-year observational smartphone study of bipolar patients, controlling for confounding variables.
△ Less
Submitted 12 April, 2023; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Unifying Summary Statistic Selection for Approximate Bayesian Computation
Authors:
Till Hoffmann,
Jukka-Pekka Onnela
Abstract:
Extracting low-dimensional summary statistics from large datasets is essential for efficient (likelihood-free) inference. We characterize different classes of summaries and demonstrate their importance for correctly analysing dimensionality reduction algorithms. We demonstrate that minimizing the expected posterior entropy (EPE) under the prior predictive distribution of the model subsumes many ex…
▽ More
Extracting low-dimensional summary statistics from large datasets is essential for efficient (likelihood-free) inference. We characterize different classes of summaries and demonstrate their importance for correctly analysing dimensionality reduction algorithms. We demonstrate that minimizing the expected posterior entropy (EPE) under the prior predictive distribution of the model subsumes many existing methods. They are equivalent to or are special or limiting cases of minimizing the EPE. We offer a unifying framework for obtaining informative summaries, provide concrete recommendations for practitioners, and propose a practical method to obtain high-fidelity summaries whose utility we demonstrate for both benchmark and practical examples.
△ Less
Submitted 25 April, 2025; v1 submitted 5 June, 2022;
originally announced June 2022.
-
Combining Accelerometer and Gyroscope Data in Smartphone-Based Activity Recognition using Movelets
Authors:
Emily Huang,
Kebin Yan,
Jukka-Pekka Onnela
Abstract:
Physical activity patterns can be informative about a patient's health status. Traditionally, activity data have been gathered using patient self-report. However, these subjective data can suffer from bias and are difficult to collect over long time periods. Smartphones offer an opportunity to address these challenges. The smartphone has built-in sensors that can be programmed to collect data obje…
▽ More
Physical activity patterns can be informative about a patient's health status. Traditionally, activity data have been gathered using patient self-report. However, these subjective data can suffer from bias and are difficult to collect over long time periods. Smartphones offer an opportunity to address these challenges. The smartphone has built-in sensors that can be programmed to collect data objectively, unobtrusively, and continuously. Due to their widespread adoption, smartphones are also accessible to most of the population. A main challenge in smartphone-based activity recognition is extracting information optimally from multiple sensors to identify the unique features of different activities. In our study, we analyze data collected by the accelerometer and gyroscope, which measure the phone's acceleration and angular velocity, respectively. We propose an extension to the "movelet method" that jointly incorporates both sensors. We also apply this joint-sensor method to a data set we collected previously. The findings show that combining data from the two sensors can result in more accurate activity recognition than using each sensor alone. For example, the joint-sensor method reduces errors of the gyroscope-only method in differentiating between standing and sitting. It also reduces errors of the accelerometer-only method in classifying vigorous activities.
△ Less
Submitted 5 February, 2022; v1 submitted 2 September, 2021;
originally announced September 2021.
-
Maximum likelihood estimation for mechanistic network models
Authors:
Jonathan Larson,
Jukka-Pekka Onnela
Abstract:
Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is n…
▽ More
Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, and test a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the best-performing algorithm on a human protein-protein interaction network and four non-human protein-protein interaction networks. Although we focus on a specific mechanistic network model here, the proposed framework is more generally applicable to reversible models.
△ Less
Submitted 18 July, 2023; v1 submitted 16 June, 2021;
originally announced June 2021.
-
Inferring the minimum spanning tree from a sample network
Authors:
Jonathan Larson,
Jukka-Pekka Onnela
Abstract:
Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if…
▽ More
Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if $n$ nodes (the sample) are selected uniformly at random from a complete graph with $N$ nodes and unique edge weights (the population), the probability that an edge is in the population graph's MST given that it is in the sample graph's MST is $\frac{n}{N}$. We use simulation to investigate this conditional probability for $G(N,p)$ graphs, Barabási-Albert (BA) graphs, graphs whose nodes are distributed in $\mathbb{R}^2$ according to a bivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, results for the complete, $G(N,p)$, and normal graphs are similar, and results for the BA and empirical HIV graphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodes from the population so that they maximize the probability that an edge is in the population MST given that it is in the sample MST.
△ Less
Submitted 12 May, 2021; v1 submitted 19 February, 2021;
originally announced February 2021.
-
Cost-based feature selection for network model choice
Authors:
Louis Raynal,
Till Hoffmann,
Jukka-Pekka Onnela
Abstract:
Selecting a small set of informative features from a large number of possibly noisy candidates is a challenging problem with many applications in machine learning and approximate Bayesian computation. In practice, the cost of computing informative features also needs to be considered. This is particularly important for networks because the computational costs of individual features can span severa…
▽ More
Selecting a small set of informative features from a large number of possibly noisy candidates is a challenging problem with many applications in machine learning and approximate Bayesian computation. In practice, the cost of computing informative features also needs to be considered. This is particularly important for networks because the computational costs of individual features can span several orders of magnitude. We addressed this issue for the network model selection problem using two approaches. First, we adapted nine feature selection methods to account for the cost of features. We show for two classes of network models that the cost can be reduced by two orders of magnitude without considerably affecting classification accuracy (proportion of correctly identified models). Second, we selected features using pilot simulations with smaller networks. This approach reduced the computational cost by a factor of 50 without affecting classification accuracy. To demonstrate the utility of our approach, we applied it to three different yeast protein interaction networks and identified the best-fitting duplication divergence model.
△ Less
Submitted 1 September, 2022; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Scalable Approximate Bayesian Computation for Growing Network Models via Extrapolated and Sampled Summaries
Authors:
Louis Raynal,
Sixing Chen,
Antonietta Mira,
Jukka-Pekka Onnela
Abstract:
Approximate Bayesian computation (ABC) is a simulation-based likelihood-free method applicable to both model selection and parameter estimation. ABC parameter estimation requires the ability to forward simulate datasets from a candidate model, but because the sizes of the observed and simulated datasets usually need to match, this can be computationally expensive. Additionally, since ABC inference…
▽ More
Approximate Bayesian computation (ABC) is a simulation-based likelihood-free method applicable to both model selection and parameter estimation. ABC parameter estimation requires the ability to forward simulate datasets from a candidate model, but because the sizes of the observed and simulated datasets usually need to match, this can be computationally expensive. Additionally, since ABC inference is based on comparisons of summary statistics computed on the observed and simulated data, using computationally expensive summary statistics can lead to further losses in efficiency. ABC has recently been applied to the family of mechanistic network models, an area that has traditionally lacked tools for inference and model choice. Mechanistic models of network growth repeatedly add nodes to a network until it reaches the size of the observed network, which may be of the order of millions of nodes. With ABC, this process can quickly become computationally prohibitive due to the resource intensive nature of network simulations and evaluation of summary statistics. We propose two methodological developments to enable the use of ABC for inference in models for large growing networks. First, to save time needed for forward simulating model realizations, we propose a procedure to extrapolate (via both least squares and Gaussian processes) summary statistics from small to large networks. Second, to reduce computation time for evaluating summary statistics, we use sample-based rather than census-based summary statistics. We show that the ABC posterior obtained through this approach, which adds two additional layers of approximation to the standard ABC, is similar to a classic ABC posterior. Although we deal with growing network models, both extrapolated summaries and sampled summaries are expected to be relevant in other ABC settings where the data are generated incrementally.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Affiliation network model of HIV transmission in MSM
Authors:
Jonathan Larson,
Jukka-Pekka Onnela
Abstract:
Black men who have sex with men (MSM) in the U.S. are more likely to be HIV-positive than White MSM. Intentional and unintentional segregation of Black from non-Black MSM in sex partner meeting places may perpetuate this disparity, a fact that is ignored by current HIV risk indices, which mainly focus on individual behaviors and not systemic factors. This paper capitalizes on recent studies in whi…
▽ More
Black men who have sex with men (MSM) in the U.S. are more likely to be HIV-positive than White MSM. Intentional and unintentional segregation of Black from non-Black MSM in sex partner meeting places may perpetuate this disparity, a fact that is ignored by current HIV risk indices, which mainly focus on individual behaviors and not systemic factors. This paper capitalizes on recent studies in which the venues where MSM meet their sex partners are known. Connecting individuals and venues leads to so-called affiliation networks; we propose a model for how HIV might spread along these networks, and we formulate a new risk index based on this model. We test this new risk index on an affiliation network of 466 African-American MSM in Chicago, and in simulation. The new risk index works well when there are two groups of people, one with higher HIV prevalence than the other, with limited overlap in where they meet their sex partners.
△ Less
Submitted 12 May, 2021; v1 submitted 13 April, 2020;
originally announced April 2020.
-
Efficient vaccination strategies for epidemic control using network information
Authors:
Yingrui Yang,
Ashley McKhann,
Sixing Chen,
Guy Harling,
Jukka-Pekka Onnela
Abstract:
Network-based interventions against epidemic spread are most powerful when the full network structure is known. However, in practice, resource constraints require decisions to be made based on partial network information. We investigated how the accuracy of network data available at individual and village levels affected network-based vaccination effectiveness. We simulated a Susceptible-Infected-…
▽ More
Network-based interventions against epidemic spread are most powerful when the full network structure is known. However, in practice, resource constraints require decisions to be made based on partial network information. We investigated how the accuracy of network data available at individual and village levels affected network-based vaccination effectiveness. We simulated a Susceptible-Infected-Recovered process on empirical social networks from 75 villages. First, we used regression to predict the percentage of individuals ever infected based on village-level network. Second, we simulated vaccinating 10 percent of each of the 75 empirical village networks at baseline, selecting vaccinees through one of five network-based approaches: random individuals; random contacts of random individuals; random high-degree individuals; highest degree individuals; or most central individuals. The first three approaches require only sample data; the latter two require full network data. We also simulated imposing a limit on how many contacts an individual can nominate (Fixed Choice Design, FCD), which reduces the data collection burden but generates only partially observed networks. We found mean and standard deviation of the degree distribution to strongly predict cumulative incidence. In simulations, the Nomination method reduced cumulative incidence by one-sixth compared to Random vaccination; full network methods reduced infection by two-thirds. The High Degree approach had intermediate effectiveness. Surprisingly, FCD truncating individuals' degrees at three was as effective as using complete networks. Using even partial network information to prioritize vaccines at either the village or individual level substantially improved epidemic outcomes. Such approaches may be feasible and effective in outbreak settings, and full ascertainment of network structure may not be required.
△ Less
Submitted 18 March, 2019;
originally announced March 2019.
-
A Bootstrap Method for Goodness of Fit and Model Selection with a Single Observed Network
Authors:
Sixing Chen,
Jukka-Pekka Onnela
Abstract:
Network models are applied in numerous domains where data can be represented as a system of interactions among pairs of actors. While both statistical and mechanistic network models are increasingly capable of capturing various dependencies amongst these actors, these dependencies imply the lack of independence. This poses statistical challenges for analyzing such data, especially when there is on…
▽ More
Network models are applied in numerous domains where data can be represented as a system of interactions among pairs of actors. While both statistical and mechanistic network models are increasingly capable of capturing various dependencies amongst these actors, these dependencies imply the lack of independence. This poses statistical challenges for analyzing such data, especially when there is only a single observed network, and often leads to intractable likelihoods regardless of the modeling paradigm, which limit the application of existing statistical methods for networks. We explore a subsampling bootstrap procedure to serve as the basis for goodness of fit and model selection with a single observed network that circumvents the intractability of such likelihoods. Our approach is based on flexible resampling distributions formed from the single observed network, allowing for finer and higher dimensional comparisons than simply point estimates of quantities of interest. We include worked examples for model selection, with simulation, and assessment of goodness of fit, with duplication-divergence model fits for yeast (S.cerevisiae) protein-protein interaction data from the literature. The proposed procedure produces a flexible resampling distribution that can be based on any statistics of one's choosing and can be employed regardless of choice of model.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
Connected but Segregated: Social Networks in Rural Villages
Authors:
Felipe Montes,
Roberto C. Jimenez,
Jukka-Pekka Onnela
Abstract:
There is an increased appreciation for, and utilization of, social networks to disseminate various kinds of interventions in a target population. Homophily, the tendency of people to be similar to those they interact with, can create within-group cohesion but at the same time can also lead to societal segregation. In public health, social segregation can form barriers to the spread of health inter…
▽ More
There is an increased appreciation for, and utilization of, social networks to disseminate various kinds of interventions in a target population. Homophily, the tendency of people to be similar to those they interact with, can create within-group cohesion but at the same time can also lead to societal segregation. In public health, social segregation can form barriers to the spread of health interventions from one group to another. We analyzed the structure of social networks in 75 villages in Karnataka, India, both at the level of individuals and network communities. We found all villages to be strongly segregated at the community level, especially along the lines of caste and sex, whereas other socioeconomic variables, such as age and education, were only weakly associated with these groups in the network. While the studied networks are densely connected, our results indicate that the villages are highly segregated.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
Bayesian method for inferring the impact of geographical distance on intensity of communication
Authors:
Fei Li,
Jukka-Pekka Onnela,
Victor DeGruttola
Abstract:
Both theoretical models and empirical findings suggest that the intensity of communication among groups of people declines with their degree of geographical separation. There is some evidence that rather than decaying uniformly with distance, the intensity of communication might decline at different rates for shorter and longer distances. Using Bayesian LASSO for model selection, we introduce a st…
▽ More
Both theoretical models and empirical findings suggest that the intensity of communication among groups of people declines with their degree of geographical separation. There is some evidence that rather than decaying uniformly with distance, the intensity of communication might decline at different rates for shorter and longer distances. Using Bayesian LASSO for model selection, we introduce a statistical model for estimating the rate of communication decline with geographic distance that allows for discontinuities in this rate. We apply our method to an anonymized mobile phone communication dataset. Our results are potentially useful in settings where understanding social and spatial mixing of people is important, such as in cluster randomized trials design.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Flexible model selection for mechanistic network models
Authors:
Sixing Chen,
Antonietta Mira,
Jukka-Pekka Onnela
Abstract:
Network models are applied across many domains where data can be represented as a network. Two prominent paradigms for modeling networks are statistical models (probabilistic models for the observed network) and mechanistic models (models for network growth and/or evolution). Mechanistic models are better suited for incorporating domain knowledge, to study effects of interventions (such as changes…
▽ More
Network models are applied across many domains where data can be represented as a network. Two prominent paradigms for modeling networks are statistical models (probabilistic models for the observed network) and mechanistic models (models for network growth and/or evolution). Mechanistic models are better suited for incorporating domain knowledge, to study effects of interventions (such as changes to specific mechanisms) and to forward simulate, but they typically have intractable likelihoods. As such, and in a stark contrast to statistical models, there is a relative dearth of research on model selection for such models despite the otherwise large body of extant work. In this paper, we propose a simulator-based procedure for mechanistic network model selection that borrows aspects from Approximate Bayesian Computation (ABC) along with a means to quantify the uncertainty in the selected model. To select the most suitable network model, we consider and assess the performance of several learning algorithms, most notably the so-called Super Learner, which makes our framework less sensitive to the choice of a particular learning algorithm. Our approach takes advantage of the ease to forward simulate from mechanistic network models to circumvent their intractable likelihoods. The overall process is flexible and widely applicable. Our simulation results demonstrate the approach's ability to accurately discriminate between competing mechanistic models. Finally, we showcase our approach with a protein-protein interaction network model from the literature for yeast (Saccharomyces cerevisiae).
△ Less
Submitted 19 June, 2019; v1 submitted 31 March, 2018;
originally announced April 2018.
-
Generalizations of Edge Overlap to Weighted and Directed Networks
Authors:
Heather Mattie,
Jukka-Pekka Onnela
Abstract:
With the increasing availability of behavioral data from diverse digital sources, such as social media sites and cell phones, it is now possible to obtain detailed information about the structure, strength, and directionality of social interactions in varied settings. While most metrics of network structure have traditionally been defined for unweighted and undirected networks only, the richness o…
▽ More
With the increasing availability of behavioral data from diverse digital sources, such as social media sites and cell phones, it is now possible to obtain detailed information about the structure, strength, and directionality of social interactions in varied settings. While most metrics of network structure have traditionally been defined for unweighted and undirected networks only, the richness of current network data calls for extending these metrics to weighted and directed networks. One fundamental metric in social networks is edge overlap, the proportion of friends shared by two connected individuals. Here we extend definitions of edge overlap to weighted and directed networks, and present closed-form expressions for the mean and variance of each version for the Erdos-Renyi random graph and its weighted and directed counterparts. We apply these results to social network data collected in rural villages in southern Karnataka, India. We use our analytical results to quantify the extent to which the average overlap of the empirical social network deviates from that of corresponding random graphs and compare the values of overlap across networks. Our novel definitions allow the calculation of edge overlap for more complex networks and our derivations provide a statistically rigorous way for comparing edge overlap across networks.
△ Less
Submitted 30 November, 2020; v1 submitted 19 December, 2017;
originally announced December 2017.
-
ABCpy: A High-Performance Computing Perspective to Approximate Bayesian Computation
Authors:
Ritabrata Dutta,
Marcel Schoengens,
Lorenzo Pacchiardi,
Avinash Ummadisingu,
Nicole Widmer,
Pierre Künzli,
Jukka-Pekka Onnela,
Antonietta Mira
Abstract:
ABCpy is a highly modular scientific library for Approximate Bayesian Computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization. Further, AB…
▽ More
ABCpy is a highly modular scientific library for Approximate Bayesian Computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization. Further, ABCpy enables ABC experts to easily develop new inference schemes and evaluate them in a standardized environment and to extend the library with new algorithms. These benefits come mainly from the modularity of ABCpy. We give an overview of the design of ABCpy and provide a performance evaluation concentrating on parallelization. This points us towards the inherent imbalance in some of the ABC algorithms. We develop a dynamic scheduling MPI implementation to mitigate this issue and evaluate the various ABC algorithms according to their adaptability towards high-performance computing.
△ Less
Submitted 17 December, 2021; v1 submitted 13 November, 2017;
originally announced November 2017.
-
The Social Bow Tie
Authors:
Heather Mattie,
Kenth Engø-Monsen,
Rich Ling,
Jukka-Pekka Onnela
Abstract:
Understanding tie strength in social networks, and the factors that influence it, have received much attention in a myriad of disciplines for decades. Several models incorporating indicators of tie strength have been proposed and used to quantify relationships in social networks, and a standard set of structural network metrics have been applied to predominantly online social media sites to predic…
▽ More
Understanding tie strength in social networks, and the factors that influence it, have received much attention in a myriad of disciplines for decades. Several models incorporating indicators of tie strength have been proposed and used to quantify relationships in social networks, and a standard set of structural network metrics have been applied to predominantly online social media sites to predict tie strength. Here, we introduce the concept of the "social bow tie" framework, a small subgraph of the network that consists of a collection of nodes and ties that surround a tie of interest, forming a topological structure that resembles a bow tie. We also define several intuitive and interpretable metrics that quantify properties of the bow tie. We use random forests and regression models to predict categorical and continuous measures of tie strength from different properties of the bow tie, including nodal attributes. We also investigate what aspects of the bow tie are most predictive of tie strength in two distinct social networks: a collection of 75 rural villages in India and a nationwide call network of European mobile phone users. Our results indicate several of the bow tie metrics are highly predictive of tie strength, and we find the more the social circles of two individuals overlap, the stronger their tie, consistent with previous findings. However, we also find that the more tightly-knit their non-overlapping social circles, the weaker the tie. This new finding complements our current understanding of what drives the strength of ties in social networks.
△ Less
Submitted 12 October, 2017; v1 submitted 11 October, 2017;
originally announced October 2017.
-
Bayesian Inference of Spreading Processes on Networks
Authors:
Ritabrata Dutta,
Antonietta Mira,
Jukka-Pekka Onnela
Abstract:
Infectious diseases are studied to understand their spreading mechanisms, to evaluate control strategies and to predict the risk and course of future outbreaks. Because people only interact with a small number of individuals, and because the structure of these interactions matters for spreading processes, the pairwise relationships between individuals in a population can be usefully represented by…
▽ More
Infectious diseases are studied to understand their spreading mechanisms, to evaluate control strategies and to predict the risk and course of future outbreaks. Because people only interact with a small number of individuals, and because the structure of these interactions matters for spreading processes, the pairwise relationships between individuals in a population can be usefully represented by a network. Although the underlying processes of transmission are different, the network approach can be used to study the spread of pathogens in a contact network or the spread of rumors in an online social network. We study simulated simple and complex epidemics on synthetic networks and on two empirical networks, a social / contact network in an Indian village and an online social network in the U.S. Our goal is to learn simultaneously about the spreading process parameters and the source node (first infected node) of the epidemic, given a fixed and known network structure, and observations about state of nodes at several points in time. Our inference scheme is based on approximate Bayesian computation (ABC), an inference technique for complex models with likelihood functions that are either expensive to evaluate or analytically intractable. ABC enables us to adopt a Bayesian approach to the problem despite the posterior distribution being very complex. Our method is agnostic about the topology of the network and the nature of the spreading process. It generally performs well and, somewhat counter-intuitively, the inference problem appears to be easier on more heterogeneous network topologies, which enhances its future applicability to real-world settings where few networks have homogeneous topologies.
△ Less
Submitted 21 May, 2018; v1 submitted 26 September, 2017;
originally announced September 2017.
-
Leveraging contact network structure in the design of cluster randomized trials
Authors:
Guy Harling,
Rui Wang,
Jukka-Pekka Onnela,
Victor De Gruttola
Abstract:
Background: In settings where proof-of-principle trials have succeeded but the effectiveness of different forms of implementation remains uncertain, trials that not only generate information about intervention effects but also provide public health benefit would be useful. Cluster randomized trials (CRT) capture both direct and indirect intervention effects; the latter depends heavily on contact n…
▽ More
Background: In settings where proof-of-principle trials have succeeded but the effectiveness of different forms of implementation remains uncertain, trials that not only generate information about intervention effects but also provide public health benefit would be useful. Cluster randomized trials (CRT) capture both direct and indirect intervention effects; the latter depends heavily on contact networks within and across clusters. We propose a novel class of connectivity-informed trial designs that leverages information about such networks in order to improve public health impact and preserve ability to detect intervention effects.
Methods: We consider CRTs in which the order of enrollment is based on the total number of ties between individuals across clusters (based either on the total number of inter-cluster connections or on connections only to untreated clusters). We include options analogous both to traditional Parallel and Stepped Wedge designs. We also allow for control clusters to be "held-back" from re-randomization for some period. We investigate the performance epidemic control and power to detect vaccine effect performance of these designs by simulating vaccination trials during an SEIR-type epidemic using a network-structured agent-based model.
Results: In our simulations, connectivity-informed designs have lower peak infectiousness than comparable traditional designs and reduce cumulative incidence by 20%, but with little impact on time to end of epidemic and reduced power to detect differences in incidence across clusters. However even a brief "holdback" period restores most of the power lost compared to traditional approaches.
Conclusion: Incorporating information about cluster connectivity in design of CRTs can increase their public health impact, especially in acute outbreak settings, with modest cost in power to detect an effective intervention.
△ Less
Submitted 28 October, 2016;
originally announced October 2016.
-
Feature-Based Classification of Networks
Authors:
Ian Barnett,
Nishant Malik,
Marieke L. Kuijjer,
Peter J. Mucha,
Jukka-Pekka Onnela
Abstract:
Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural building blocks. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. At a finer scale of classification within each su…
▽ More
Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural building blocks. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. At a finer scale of classification within each such class, networks describing more similar systems tend to have more similar features. This occurs presumably because networks representing similar purposes or constructions would be expected to be generated by a shared set of domain specific mechanisms, and it should therefore be possible to classify these networks into categories based on their features at various structural levels. Here we describe and demonstrate a new, hybrid approach that combines manual selection of features of potential interest with existing automated classification methods. In particular, selecting well-known and well-studied features that have been used throughout social network analysis and network science and then classifying with methods such as random forests that are of special utility in the presence of feature collinearity, we find that we achieve higher accuracy, in shorter computation time, with greater interpretability of the network classification results.
△ Less
Submitted 19 October, 2016;
originally announced October 2016.
-
Leveraging Contact Network Information in Clustered Randomized Studies of Contagion Processes
Authors:
Maxwell H Wang,
Patrick Staples,
Mélanie Prague,
Victor De Gruttola,
Jukka-Pekka Onnela
Abstract:
In a randomized study, leveraging covariates related to the outcome (e.g. disease status) may produce less variable estimates of the effect of exposure. For contagion processes operating on a contact network, transmission can only occur through ties that connect affected and unaffected individuals; the outcome of such a process is known to depend intimately on the structure of the network. In this…
▽ More
In a randomized study, leveraging covariates related to the outcome (e.g. disease status) may produce less variable estimates of the effect of exposure. For contagion processes operating on a contact network, transmission can only occur through ties that connect affected and unaffected individuals; the outcome of such a process is known to depend intimately on the structure of the network. In this paper, we investigate the use of contact network features as efficiency covariates in exposure effect estimation. Using augmented generalized estimating equations (GEE), we estimate how gains in efficiency depend on the network structure and spread of the contagious agent or behavior. We apply this approach to simulated randomized trials using a stochastic compartmental contagion model on a collection of model-based contact networks and compare the bias, power, and variance of the estimated exposure effects using an assortment of network covariate adjustment strategies. We also demonstrate the use of network-augmented GEEs on a clustered randomized trial evaluating the effects of wastewater monitoring on COVID-19 cases in residential buildings at the the University of California San Diego.
△ Less
Submitted 14 February, 2023; v1 submitted 30 September, 2016;
originally announced October 2016.
-
Inferring Mobility Measures from GPS Traces with Missing Data
Authors:
Ian Barnett,
Jukka-Pekka Onnela
Abstract:
With increasing availability of smartphones with GPS capabilities, large-scale studies relating individual-level mobility patterns to a wide variety of patient-centered outcomes, from mood disorders to surgical recovery, are becoming a reality. Similar past studies have been small in scale and have provided wearable GPS devices to subjects. These devices typically collect mobility traces continuou…
▽ More
With increasing availability of smartphones with GPS capabilities, large-scale studies relating individual-level mobility patterns to a wide variety of patient-centered outcomes, from mood disorders to surgical recovery, are becoming a reality. Similar past studies have been small in scale and have provided wearable GPS devices to subjects. These devices typically collect mobility traces continuously without significant gaps in the data, and consequently the problem of data missingness has been safely ignored. Leveraging subjects' own smartphones makes it possible to scale up and extend the duration of these types of studies, but at the same time introduces a substantial challenge: to preserve a smartphone's battery, GPS can be active only for a small portion of the time, frequently less than $10\%$, leading to a tremendous missing data problem. We introduce a principled statistical approach, based on weighted resampling of the observed data, to impute the missing mobility traces, which we then summarize using different mobility measures. We compare the strengths of our approach to linear interpolation, a popular approach for dealing with missing data, both analytically and through simulation of missingness for empirical data. We conclude that our imputation approach better mirrors human mobility both theoretically and over a sample of GPS mobility traces from 182 individuals in the Geolife data set, where, relative to linear interpolation, imputation resulted in a 10-fold reduction in the error averaged across all mobility features.
△ Less
Submitted 19 April, 2018; v1 submitted 20 June, 2016;
originally announced June 2016.
-
Social and Spatial Clustering of People at Humanity's Largest Gathering
Authors:
Ian Barnett,
Tarun Khanna,
Jukka-Pekka Onnela
Abstract:
Macroscopic behavior of scientific and societal systems results from the aggregation of microscopic behaviors of their constituent elements, but connecting the macroscopic with the microscopic in human behavior has traditionally been difficult. Manifestations of homophily, the notion that individuals tend to interact with others who resemble them, have been observed in many small and intermediate…
▽ More
Macroscopic behavior of scientific and societal systems results from the aggregation of microscopic behaviors of their constituent elements, but connecting the macroscopic with the microscopic in human behavior has traditionally been difficult. Manifestations of homophily, the notion that individuals tend to interact with others who resemble them, have been observed in many small and intermediate size settings. However, whether this behavior translates to truly macroscopic levels, and what its consequences may be, remains unknown. Here, we use call detail records (CDRs) to examine the population dynamics and manifestations of social and spatial homophily at a macroscopic level among the residents of 23 states of India at the Kumbh Mela, a 3-month-long Hindu festival. We estimate that the festival was attended by 61 million people, making it the largest gathering in the history of humanity. While we find strong overall evidence for both types of homophily for residents of different states, participants from low-representation states show considerably stronger propensity for both social and spatial homophily than those from high-representation states. These manifestations of homophily are amplified on crowded days, such as the peak day of the festival, which we estimate was attended by 25 million people. Our findings confirm that homophily, which here likely arises from social influence, permeates all scales of human behavior.
△ Less
Submitted 23 May, 2016;
originally announced May 2016.
-
Impact of degree truncation on the spread of a contagious process on networks
Authors:
Guy Harling,
Jukka-Pekka Onnela
Abstract:
Understanding how person-to-person contagious processes spread through a population requires accurate information on connections between population members. However, such connectivity data, when collected via interview, is often incomplete due to partial recall, respondent fatigue or study design, e.g., fixed choice designs (FCD) truncate out-degree by limiting the number of contacts each responde…
▽ More
Understanding how person-to-person contagious processes spread through a population requires accurate information on connections between population members. However, such connectivity data, when collected via interview, is often incomplete due to partial recall, respondent fatigue or study design, e.g., fixed choice designs (FCD) truncate out-degree by limiting the number of contacts each respondent can report. Past research has shown how FCD truncation affects network properties, but its implications for predicted speed and size of spreading processes remain largely unexplored. To study the impact of degree truncation on spreading processes, we generated collections of synthetic networks containing specific properties (degree distribution, degree-assortativity, clustering), and also used empirical social network data from 75 villages in Karnataka, India. We simulated FCD using various truncation thresholds and ran a susceptible-infectious-recovered (SIR) process on each network. We found that spreading processes propagated on truncated networks resulted in slower and smaller epidemics, with a sudden decrease in prediction accuracy at a level of truncation that varied by network type. Our results have implications beyond FCD to truncation due to any limited sampling from a larger network. We conclude that knowledge of network structure is important for understanding the accuracy of predictions of process spread on degree truncated networks.
△ Less
Submitted 10 February, 2016;
originally announced February 2016.
-
Incorporating Contact Network Structure in Cluster Randomized Trials
Authors:
Patrick C. Staples,
Elizabeth L. Ogburn,
Jukka-Pekka Onnela
Abstract:
Whenever possible, the efficacy of a new treatment, such as a drug or behavioral intervention, is investigated by randomly assigning some individuals to a treatment condition and others to a control condition, and comparing the outcomes between the two groups. Often, when the treatment aims to slow an infectious disease, groups or clusters of individuals are assigned en masse to each treatment arm…
▽ More
Whenever possible, the efficacy of a new treatment, such as a drug or behavioral intervention, is investigated by randomly assigning some individuals to a treatment condition and others to a control condition, and comparing the outcomes between the two groups. Often, when the treatment aims to slow an infectious disease, groups or clusters of individuals are assigned en masse to each treatment arm. The structure of interactions within and between clusters can reduce the power of the trial, i.e. the probability of correctly detecting a real treatment effect. We investigate the relationships among power, within-cluster structure, between-cluster mixing, and infectivity by simulating an infectious process on a collection of clusters. We demonstrate that current power calculations may be conservative for low levels of between-cluster mixing, but failing to account for moderate or high amounts can result in severely underpowered studies. Power also depends on within-cluster network structure for certain kinds of infectious spreading. Infections that spread opportunistically through very highly connected individuals have unpredictable infectious breakouts, which makes it harder to distinguish between random variation and real treatment effects. Our approach can be used before conducting a trial to assess power using network information if it is available, and we demonstrate how empirical data can inform the extent of between-cluster mixing.
△ Less
Submitted 30 April, 2015;
originally announced May 2015.
-
Change Point Detection in Correlation Networks
Authors:
Ian Barnett,
Jukka-Pekka Onnela
Abstract:
Many systems of interacting elements can be conceptualized as networks, where network nodes represent the elements and network ties represent interactions between the elements. In systems where the underlying network evolves in time, it is useful to determine the points in time where the network structure changes significantly as these may correspond also to functional change points. We propose a…
▽ More
Many systems of interacting elements can be conceptualized as networks, where network nodes represent the elements and network ties represent interactions between the elements. In systems where the underlying network evolves in time, it is useful to determine the points in time where the network structure changes significantly as these may correspond also to functional change points. We propose a method for detecting these change points in correlation networks that, unlike previous change point detection methods designed for time series data, requires no distributional assumptions. We investigate the difficulty of change point detection near the boundaries of data in correlation networks and demonstrate the power of our method and a competing method through simulation. We also show the generalizable nature of our method by applying it to stock price data as well as fMRI data.
△ Less
Submitted 4 May, 2015; v1 submitted 3 October, 2014;
originally announced October 2014.