-
Optimization perspective on raking
Authors:
Ariane Ducellier,
Alexander Hsu,
Parkes Kendrick,
Bill Gustafson,
Laura Dwyer-Lindgren,
Christopher Murray,
Peng Zheng,
Aleksandr Aravkin
Abstract:
Raking is widely used in survey inference and global health models to adjust the observations in contingency tables to given marginals, in the latter case reconciling estimates between models with different granularities. We review the convex optimization foundation of raking and focus on a dual perspective that simplifies and streamlines prior raking extensions and provides new functionality, ena…
▽ More
Raking is widely used in survey inference and global health models to adjust the observations in contingency tables to given marginals, in the latter case reconciling estimates between models with different granularities. We review the convex optimization foundation of raking and focus on a dual perspective that simplifies and streamlines prior raking extensions and provides new functionality, enabling a unified approach to n-dimensional raking, raking with differential weights, ensuring bounds on estimates are respected, raking to margins either as hard constraints or as aggregate observations, handling missing data, and allowing efficient uncertainty propagation. The dual perspective also enables a uniform fast and scalable matrix-free optimization approach for all of these extensions. All of the methods are implemented in an open source Python package with an intuitive user interface, installable from PyPi (https://pypi.org/project/raking/), and we illustrate the capabilities using synthetic data and real mortality estimates.
△ Less
Submitted 8 May, 2025; v1 submitted 29 July, 2024;
originally announced July 2024.
-
Digital Twin Generators for Disease Modeling
Authors:
Nameyeh Alam,
Jake Basilico,
Daniele Bertolini,
Satish Casie Chetty,
Heather D'Angelo,
Ryan Douglas,
Charles K. Fisher,
Franklin Fuller,
Melissa Gomes,
Rishabh Gupta,
Alex Lang,
Anton Loukianov,
Rachel Mak-McCully,
Cary Murray,
Hanalei Pham,
Susanna Qiao,
Elena Ryapolova-Webb,
Aaron Smith,
Dimitri Theoharatos,
Anil Tolwani,
Eric W. Tramel,
Anna Vidovszky,
Judy Viduya,
Jonathan R. Walsh
Abstract:
A patient's digital twin is a computational model that describes the evolution of their health over time. Digital twins have the potential to revolutionize medicine by enabling individual-level computer simulations of human health, which can be used to conduct more efficient clinical trials or to recommend personalized treatment options. Due to the overwhelming complexity of human biology, machine…
▽ More
A patient's digital twin is a computational model that describes the evolution of their health over time. Digital twins have the potential to revolutionize medicine by enabling individual-level computer simulations of human health, which can be used to conduct more efficient clinical trials or to recommend personalized treatment options. Due to the overwhelming complexity of human biology, machine learning approaches that leverage large datasets of historical patients' longitudinal health records to generate patients' digital twins are more tractable than potential mechanistic models. In this manuscript, we describe a neural network architecture that can learn conditional generative models of clinical trajectories, which we call Digital Twin Generators (DTGs), that can create digital twins of individual patients. We show that the same neural network architecture can be trained to generate accurate digital twins for patients across 13 different indications simply by changing the training set and tuning hyperparameters. By introducing a general purpose architecture, we aim to unlock the ability to scale machine learning approaches to larger datasets and across more indications so that a digital twin could be created for any patient in the world.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Robust Nonparametric Stochastic Frontier Analysis
Authors:
Peng Zheng,
Nahom Worku,
Marlena Bannick,
Joseph Dielemann,
Marcia Weaver,
Christopher Murray,
Aleksandr Aravkin
Abstract:
Benchmarking tools, including stochastic frontier analysis (SFA), data envelopment analysis (DEA), and its stochastic extension (StoNED) are core tools in economics used to estimate an efficiency envelope and production inefficiencies from data. The problem appears in a wide range of fields -- for example, in global health the frontier can quantify efficiency of interventions and funding of health…
▽ More
Benchmarking tools, including stochastic frontier analysis (SFA), data envelopment analysis (DEA), and its stochastic extension (StoNED) are core tools in economics used to estimate an efficiency envelope and production inefficiencies from data. The problem appears in a wide range of fields -- for example, in global health the frontier can quantify efficiency of interventions and funding of health initiatives. Despite their wide use, classic benchmarking approaches have key limitations that preclude even wider applicability. Here we propose a robust non-parametric stochastic frontier meta-analysis (SFMA) approach that fills these gaps. First, we use flexible basis splines and shape constraints to model the frontier function, so specifying a functional form of the frontier as in classic SFA is no longer necessary. Second, the user can specify relative errors on input datapoints, enabling population-level analyses. Third, we develop a likelihood-based trimming strategy to robustify the approach to outliers, which otherwise break available benchmarking methods. We provide a custom optimization algorithm for fast and reliable performance. We implement the approach and algorithm in an open source Python package `sfma'. Synthetic and real examples show the new capabilities of the method, and are used to compare SFMA to state of the art benchmarking packages that implement DEA, SFA, and StoNED.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Addressing Bias in Active Learning with Depth Uncertainty Networks... or Not
Authors:
Chelsea Murray,
James U. Allingham,
Javier Antorán,
José Miguel Hernández-Lobato
Abstract:
Farquhar et al. [2021] show that correcting for active learning bias with underparameterised models leads to improved downstream performance. For overparameterised models such as NNs, however, correction leads either to decreased or unchanged performance. They suggest that this is due to an "overfitting bias" which offsets the active learning bias. We show that depth uncertainty networks operate i…
▽ More
Farquhar et al. [2021] show that correcting for active learning bias with underparameterised models leads to improved downstream performance. For overparameterised models such as NNs, however, correction leads either to decreased or unchanged performance. They suggest that this is due to an "overfitting bias" which offsets the active learning bias. We show that depth uncertainty networks operate in a low overfitting regime, much like underparameterised models. They should therefore see an increase in performance with bias correction. Surprisingly, they do not. We propose that this negative result, as well as the results Farquhar et al. [2021], can be explained via the lens of the bias-variance decomposition of generalisation error.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
Depth Uncertainty Networks for Active Learning
Authors:
Chelsea Murray,
James U. Allingham,
Javier Antorán,
José Miguel Hernández-Lobato
Abstract:
In active learning, the size and complexity of the training dataset changes over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem…
▽ More
In active learning, the size and complexity of the training dataset changes over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem using Depth Uncertainty Networks (DUNs), a BNN variant in which the depth of the network, and thus its complexity, is inferred. We find that DUNs outperform other BNN variants on several active learning tasks. Importantly, we show that on the tasks in which DUNs perform best they present notably less overfitting than baselines.
△ Less
Submitted 4 May, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.
-
Symptom extraction from the narratives of personal experiences with COVID-19 on Reddit
Authors:
Curtis Murray,
Lewis Mitchell,
Jonathan Tuke,
Mark Mackay
Abstract:
Social media discussion of COVID-19 provides a rich source of information into how the virus affects people's lives that is qualitatively different from traditional public health datasets. In particular, when individuals self-report their experiences over the course of the virus on social media, it can allow for identification of the emotions each stage of symptoms engenders in the patient. Posts…
▽ More
Social media discussion of COVID-19 provides a rich source of information into how the virus affects people's lives that is qualitatively different from traditional public health datasets. In particular, when individuals self-report their experiences over the course of the virus on social media, it can allow for identification of the emotions each stage of symptoms engenders in the patient. Posts to the Reddit forum r/COVID19Positive contain first-hand accounts from COVID-19 positive patients, giving insight into personal struggles with the virus. These posts often feature a temporal structure indicating the number of days after developing symptoms the text refers to. Using topic modelling and sentiment analysis, we quantify the change in discussion of COVID-19 throughout individuals' experiences for the first 14 days since symptom onset. Discourse on early symptoms such as fever, cough, and sore throat was concentrated towards the beginning of the posts, while language indicating breathing issues peaked around ten days. Some conversation around critical cases was also identified and appeared at a roughly constant rate. We identified two clear clusters of positive and negative emotions associated with the evolution of these symptoms and mapped their relationships. Our results provide a perspective on the patient experience of COVID-19 that complements other medical data streams and can potentially reveal when mental health issues might appear.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Trimmed Constrained Mixed Effects Models: Formulations and Algorithms
Authors:
Peng Zheng,
Ryan Barber,
Reed J. D. Sorensen,
Christopher J. L. Murray,
Aleksandr Y. Aravkin
Abstract:
Mixed effects (ME) models inform a vast array of problems in the physical and social sciences, and are pervasive in meta-analysis. We consider ME models where the random effects component is linear. We then develop an efficient approach for a broad problem class that allows nonlinear measurements, priors, and constraints, and finds robust estimates in all of these cases using trimming in the assoc…
▽ More
Mixed effects (ME) models inform a vast array of problems in the physical and social sciences, and are pervasive in meta-analysis. We consider ME models where the random effects component is linear. We then develop an efficient approach for a broad problem class that allows nonlinear measurements, priors, and constraints, and finds robust estimates in all of these cases using trimming in the associated marginal likelihood.
The software accompanying this paper is disseminated as an open-source Python package called LimeTr. LimeTr is able to recover results more accurately in the presence of outliers compared to available packages for both standard longitudinal analysis and meta-analysis, and is also more computationally efficient than competing robust alternatives. Supplementary materials that reproduce the simulations, as well as run LimeTr and third party code are available online. We also present analyses of global health data, where we use advanced functionality of LimeTr, including constraints to impose monotonicity and concavity for dose-response relationships. Nonlinear observation models allow new analyses in place of classic approximations, such as log-linear models. Robust extensions in all analyses ensure that spurious data points do not drive our understanding of either mean relationships or between-study heterogeneity.
△ Less
Submitted 27 October, 2020; v1 submitted 23 September, 2019;
originally announced September 2019.