-
Analysis of Diagnostics (Part II): Prevalence, Linear Independence, and Unsupervised Learning
Authors:
Paul N. Patrone,
Raquel A. Binder,
Catherine S. Forconi,
Ann M. Moormann,
Anthony J. Kearsley
Abstract:
This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key id…
▽ More
This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key idea of that analysis was to train a family of discriminative classifiers by minimizing a sum of prevalence-weighted empirical risk functions. The resulting outputs can be interpreted as relative probability level-sets, which thereby yield uncertainty estimates in the class labels. This procedure also demonstrated that certain discriminative and generative ML models are equivalent. Part II considers the extent to which these results can be extended to tasks in unsupervised learning through recourse to ideas in linear algebra. We first observe that the distribution of an impure population, for which the class of a corresponding sample is unknown, can be parameterized in terms of a prevalence. This motivates us to introduce the concept of linearly independent populations, which have different but unknown prevalence values. Using this, we identify an isomorphism between classifiers defined in terms of impure and pure populations. In certain cases, this also leads to a nonlinear system of equations whose solution yields the prevalence values of the linearly independent populations, fully realizing unsupervised learning as a generalization of supervised learning. We illustrate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent assay (ELISA).
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Prevalence estimation methods for time-dependent antibody kinetics of infected and vaccinated individuals: a graph-theoretic approach
Authors:
Prajakta Bedekar,
Rayanne A. Luke,
Anthony J. Kearsley
Abstract:
Immune events such as infection, vaccination, and a combination of the two result in distinct time-dependent antibody responses in affected individuals. These responses and event prevalences combine non-trivially to govern antibody levels sampled from a population. Time-dependence and disease prevalence pose considerable modeling challenges that need to be addressed to provide a rigorous mathemati…
▽ More
Immune events such as infection, vaccination, and a combination of the two result in distinct time-dependent antibody responses in affected individuals. These responses and event prevalences combine non-trivially to govern antibody levels sampled from a population. Time-dependence and disease prevalence pose considerable modeling challenges that need to be addressed to provide a rigorous mathematical underpinning of the underlying biology. We propose a time-inhomogeneous Markov chain model for event-to-event transitions coupled with a probabilistic framework for anti-body kinetics and demonstrate its use in a setting in which individuals can be infected or vaccinated but not both. We prove the equivalency of this approach to the framework developed in our previous work. Synthetic data are used to demonstrate the modeling process and conduct prevalence estimation via transition probability matrices. This approach is ideal to model sequences of infections and vaccinations, or personal trajectories in a population, making it an important first step towards a mathematical characterization of reinfection, vaccination boosting, and cross-events of infection after vaccination or vice versa.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning
Authors:
Paul N. Patrone,
Raquel A. Binder,
Catherine S. Forconi,
Ann M. Moormann,
Anthony J. Kearsley
Abstract:
Diagnostic testing provides a unique setting for studying and developing tools in classification theory. In such contexts, the concept of prevalence, i.e. the number of individuals with a given condition, is fundamental, both as an inherent quantity of interest and as a parameter that controls classification accuracy. This manuscript is the first in a two-part series that studies deeper connection…
▽ More
Diagnostic testing provides a unique setting for studying and developing tools in classification theory. In such contexts, the concept of prevalence, i.e. the number of individuals with a given condition, is fundamental, both as an inherent quantity of interest and as a parameter that controls classification accuracy. This manuscript is the first in a two-part series that studies deeper connections between classification theory and prevalence, showing how the latter establishes a more complete theory of uncertainty quantification (UQ) for certain types of machine learning (ML). We motivate this analysis via a lemma demonstrating that general classifiers minimizing a prevalence-weighted error contain the same probabilistic information as Bayes-optimal classifiers, which depend on conditional probability densities. This leads us to study relative probability level-sets $B^\star (q)$, which are reinterpreted as both classification boundaries and useful tools for quantifying uncertainty in class labels. To realize this in practice, we also propose a numerical, homotopy algorithm that estimates the $B^\star (q)$ by minimizing a prevalence-weighted empirical error. The successes and shortcomings of this method motivate us to revisit properties of the level sets, and we deduce the corresponding classifiers obey a useful monotonicity property that stabilizes the numerics and points to important extensions to UQ of ML. Throughout, we validate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
△ Less
Submitted 28 August, 2024; v1 submitted 30 August, 2023;
originally announced September 2023.
-
Optimal classification and generalized prevalence estimates for diagnostic settings with more than two classes
Authors:
Rayanne A. Luke,
Anthony J. Kearsley,
Paul N. Patrone
Abstract:
An accurate multiclass classification strategy is crucial to interpreting antibody tests. However, traditional methods based on confidence intervals or receiver operating characteristics lack clear extensions to settings with more than two classes. We address this problem by developing a multiclass classification based on probabilistic modeling and optimal decision theory that minimizes the convex…
▽ More
An accurate multiclass classification strategy is crucial to interpreting antibody tests. However, traditional methods based on confidence intervals or receiver operating characteristics lack clear extensions to settings with more than two classes. We address this problem by developing a multiclass classification based on probabilistic modeling and optimal decision theory that minimizes the convex combination of false classification rates. The classification process is challenging when the relative fraction of the population in each class, or generalized prevalence, is unknown. Thus, we also develop a method for estimating the generalized prevalence of test data that is independent of classification. We validate our approach on serological data with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) naïve, previously infected, and vaccinated classes. Synthetic data are used to demonstrate that (i) prevalence estimates are unbiased and converge to true values and (ii) our procedure applies to arbitrary measurement dimensions. In contrast to the binary problem, the multiclass setting offers wide-reaching utility as the most general framework and provides new insight into prevalence estimation best practices.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Prevalence Estimation and Optimal Classification Methods to Account for Time Dependence in Antibody Levels
Authors:
Prajakta Bedekar,
Anthony J. Kearsley,
Paul N. Patrone
Abstract:
Serology testing can identify past infection by quantifying the immune response of an infected individual providing important public health guidance. Individual immune responses are time-dependent, which is reflected in antibody measurements. Moreover, the probability of obtaining a particular measurement changes due to prevalence as the disease progresses. Taking into account these personal and p…
▽ More
Serology testing can identify past infection by quantifying the immune response of an infected individual providing important public health guidance. Individual immune responses are time-dependent, which is reflected in antibody measurements. Moreover, the probability of obtaining a particular measurement changes due to prevalence as the disease progresses. Taking into account these personal and population-level effects, we develop a mathematical model that suggests a natural adaptive scheme for estimating prevalence as a function of time. We then combine the estimated prevalence with optimal decision theory to develop a time-dependent probabilistic classification scheme that minimizes error. We validate this analysis by using a combination of real-world and synthetic SARS-CoV-2 data and discuss the type of longitudinal studies needed to execute this scheme in real-world settings.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Modeling in higher dimensions to improve diagnostic testing accuracy: theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays
Authors:
Rayanne A. Luke,
Anthony J. Kearsley,
Nora Pisanic,
Yukari C. Manabe,
David L. Thomas,
Christopher D. Heaney,
Paul N. Patrone
Abstract:
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification s…
▽ More
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification schemes fail to account for complicated structure in data. We address these problems through a mathematical framework that combines high dimensional data modeling and optimal decision theory. Specifically, we show that appropriately increasing the dimension of data better separates positive and negative populations and reveals nuanced structure that can be described in terms of mathematical models. We combine these models with optimal decision theory to yield a classification scheme that better separates positive and negative samples relative to traditional methods such as confidence intervals (CIs) and receiver operating characteristics. We validate the usefulness of this approach in the context of a multiplex salivary SARS-CoV-2 immunoglobulin G assay dataset. This example illustrates how our analysis: (i) improves the assay accuracy (e.g. lowers classification errors by up to 42 % compared to CI methods); (ii) reduces the number of indeterminate samples when an inconclusive class is permissible (e.g. by 40 % compared to the original analysis of the example multiplex dataset); and (iii) decreases the number of antigens needed to classify samples. Our work showcases the power of mathematical modeling in diagnostic classification and highlights a method that can be adopted broadly in public health and clinical settings.
△ Less
Submitted 9 November, 2022; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Optimal Decision Theory for Diagnostic Testing: Minimizing Indeterminate Classes with Applications to Saliva-Based SARS-CoV-2 Antibody Assays
Authors:
Paul N. Patrone,
Prajakta Bedekar,
Nora Pisanic,
Yukari C. Manabe,
David L. Thomas,
Christopher D. Heaney,
Anthony J. Kearsley
Abstract:
In diagnostic testing, establishing an indeterminate class is an effective way to identify samples that cannot be accurately classified. However, such approaches also make testing less efficient and must be balanced against overall assay performance. We address this problem by reformulating data classification in terms of a constrained optimization problem that (i) minimizes the probability of lab…
▽ More
In diagnostic testing, establishing an indeterminate class is an effective way to identify samples that cannot be accurately classified. However, such approaches also make testing less efficient and must be balanced against overall assay performance. We address this problem by reformulating data classification in terms of a constrained optimization problem that (i) minimizes the probability of labeling samples as indeterminate while (ii) ensuring that the remaining ones are classified with an average target accuracy X. We show that the solution to this problem is expressed in terms of a bathtub principle that holds out those samples with the lowest local accuracy up to an X-dependent threshold. To illustrate the usefulness of this analysis, we apply it to a multiplex, saliva-based SARS-CoV-2 antibody assay and demonstrate up to a 30 % reduction in the number of indeterminate samples relative to more traditional approaches.
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
Classification Under Uncertainty: Data Analysis for Diagnostic Antibody Testing
Authors:
Paul N. Patrone,
Anthony J. Kearsley
Abstract:
Formulating accurate and robust classification strategies is a key challenge of developing diagnostic and antibody tests. Methods that do not explicitly account for disease prevalence and uncertainty therein can lead to significant classification errors. We present a novel method that leverages optimal decision theory to address this problem. As a preliminary step, we develop an analysis that uses…
▽ More
Formulating accurate and robust classification strategies is a key challenge of developing diagnostic and antibody tests. Methods that do not explicitly account for disease prevalence and uncertainty therein can lead to significant classification errors. We present a novel method that leverages optimal decision theory to address this problem. As a preliminary step, we develop an analysis that uses an assumed prevalence and conditional probability models of diagnostic measurement outcomes to define optimal (in the sense of minimizing rates of false positives and false negatives) classification domains. Critically, we demonstrate how this strategy can be generalized to a setting in which the prevalence is unknown by either: (i) defining a third class of hold-out samples that require further testing; or (ii) using an adaptive algorithm to estimate prevalence prior to defining classification domains. We also provide examples for a recently published SARS-CoV-2 serology test and discuss how measurement uncertainty (e.g. associated with instrumentation) can be incorporated into the analysis. We find that our new strategy decreases classification error by up to a decade relative to more traditional methods based on confidence intervals. Moreover, it establishes a theoretical foundation for generalizing techniques such as receiver operating characteristics (ROC) by connecting them to the broader field of optimization.
△ Less
Submitted 9 April, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
The Role of Data Analysis in Uncertainty Quantification: Case Studies for Materials Modeling
Authors:
Paul N. Patrone,
Anthony J. Kearsley,
Andrew M. Dienstfrey
Abstract:
In computational materials science, mechanical properties are typically extracted from simulations by means of analysis routines that seek to mimic their experimental counterparts. However, simulated data often exhibit uncertainties that can propagate into final predictions in unexpected ways. Thus, modelers require data analysis tools that (i) address the problems posed by simulated data, and (ii…
▽ More
In computational materials science, mechanical properties are typically extracted from simulations by means of analysis routines that seek to mimic their experimental counterparts. However, simulated data often exhibit uncertainties that can propagate into final predictions in unexpected ways. Thus, modelers require data analysis tools that (i) address the problems posed by simulated data, and (ii) facilitate uncertainty quantification. In this manuscript, we discuss three case studies in materials modeling where careful data analysis can be leveraged to address specific instances of these issues. As a unifying theme, we highlight the idea that attention to physical and mathematical constraints surrounding the generation of computational data can significantly enhance its analysis.
△ Less
Submitted 5 December, 2017;
originally announced December 2017.