-
Asymmetric Interactions Shape Survival During Population Range Expansions
Authors:
Jason M. Gray,
Rowan J. Barker-Clarke,
Jacob G. Scott,
Michael Hinczewski
Abstract:
An organism that is newly introduced into an existing population has a survival probability that is dependent on both the population density of its environment and the competition it experiences with the members of that population. Expanding populations naturally form regions of high and low density, and simultaneously experience ecological interactions both internally and at the boundary of their…
▽ More
An organism that is newly introduced into an existing population has a survival probability that is dependent on both the population density of its environment and the competition it experiences with the members of that population. Expanding populations naturally form regions of high and low density, and simultaneously experience ecological interactions both internally and at the boundary of their range. For this reason, systems of expanding populations are ideal for studying the combination of density and ecological effects. Conservation ecologists have been studying the ability of an invasive species to establish for some time, attributing success to both ecological and spatial factors. Similar behaviors have been observed in spatially structured cell populations, such as those found in cancerous tumors and bacterial biofilms. In these scenarios, novel organisms may be the introduction of a new mutation or bacterial species with some form of drug resistance, leading to the possibility of treatment failure. In order to gain insight into the relationship between population density and ecological interactions, we study an expanding population of interacting wild-type cells and mutant cells. We simulate these interactions in time and study the spatially dependent probability for a mutant to survive or to take over the front of the population wave (gene surfing). Additionally, we develop a mathematical model that describes this survival probability and find agreement when the payoff for the mutant is positive (corresponding to cooperation, exploitation, or commensalism). By knowing the types of interactions, our model provides insight into the spatial distribution of survival probability. Conversely, given a spatial distribution of survival probabilities, our model provides insight into the types of interactions that were involved to generate it.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
Conditional diffusions for amortized neural posterior estimation
Authors:
Tianyu Chen,
Vansh Bansal,
James G. Scott
Abstract:
Neural posterior estimation (NPE), a simulation-based computational approach for Bayesian inference, has shown great success in approximating complex posterior distributions. Existing NPE methods typically rely on normalizing flows, which approximate a distribution by composing many simple, invertible transformations. But flow-based models, while state of the art for NPE, are known to suffer from…
▽ More
Neural posterior estimation (NPE), a simulation-based computational approach for Bayesian inference, has shown great success in approximating complex posterior distributions. Existing NPE methods typically rely on normalizing flows, which approximate a distribution by composing many simple, invertible transformations. But flow-based models, while state of the art for NPE, are known to suffer from several limitations, including training instability and sharp trade-offs between representational power and computational cost. In this work, we demonstrate the effectiveness of conditional diffusions coupled with high-capacity summary networks for amortized NPE. Conditional diffusions address many of the challenges faced by flow-based methods. Our results show that, across a highly varied suite of benchmarking problems for NPE architectures, diffusions offer improved stability, superior accuracy, and faster training times, even with simpler, shallower models. Building on prior work on diffusions for NPE, we show that these gains persist across a variety of different summary network architectures. Code is available at https://github.com/TianyuCodings/cDiff.
△ Less
Submitted 12 March, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
Inferring Density-Dependent Population Dynamics Mechanisms through Rate Disambiguation for Logistic Birth-Death Processes
Authors:
Linh Huynh,
Jacob G. Scott,
Peter J. Thomas
Abstract:
Density dependence is important in the ecology and evolution of microbial and cancer cells. Typically, we can only measure net growth rates, but the underlying density-dependent mechanisms that give rise to the observed dynamics can manifest in birth processes, death processes, or both. Therefore, we utilize the mean and variance of cell number fluctuations to separately identify birth and death r…
▽ More
Density dependence is important in the ecology and evolution of microbial and cancer cells. Typically, we can only measure net growth rates, but the underlying density-dependent mechanisms that give rise to the observed dynamics can manifest in birth processes, death processes, or both. Therefore, we utilize the mean and variance of cell number fluctuations to separately identify birth and death rates from time series that follow stochastic birth-death processes with logistic growth. Our method provides a novel perspective on stochastic parameter identifiability, which we validate by analyzing the accuracy in terms of the discretization bin size. We apply our method to the scenario where a homogeneous cell population goes through three stages: (1) grows naturally to its carrying capacity, (2) is treated with a drug that reduces its carrying capacity, and (3) overcomes the drug effect to restore its original carrying capacity. In each stage, we disambiguate whether it happens through the birth process, death process, or some combination of the two, which contributes to understanding drug resistance mechanisms. In the case of limited data sets, we provide an alternative method based on maximum likelihood and solve a constrained nonlinear optimization problem to identify the most likely density dependence parameter for a given cell number time series. Our methods can be applied to other biological systems at different scales to disambiguate density-dependent mechanisms underlying the same net growth rate.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
Accessing United States Bulk Patent Data with patentpy and patentr
Authors:
James Yu,
Hayley Beltz,
Milind Y. Desai,
Péter Érdi,
Jacob G. Scott,
Raoul R. Wadhwa
Abstract:
The United States Patent and Trademark Office (USPTO) provides publicly accessible bulk data files containing information for all patents from 1976 onward. However, the format of these files changes over time and is memory-inefficient, which can pose issues for individual researchers. Here, we introduce the patentpy and patentr packages for the Python and R programming languages. They allow users…
▽ More
The United States Patent and Trademark Office (USPTO) provides publicly accessible bulk data files containing information for all patents from 1976 onward. However, the format of these files changes over time and is memory-inefficient, which can pose issues for individual researchers. Here, we introduce the patentpy and patentr packages for the Python and R programming languages. They allow users to programmatically fetch bulk data from the USPTO website and access it locally in a cleaned, rectangular format. Research depending on United States patent data would benefit from the use of patentpy and patentr. We describe package implementation, quality control mechanisms, and present use cases highlighting simple, yet effective, applications of this software.
△ Less
Submitted 18 July, 2021;
originally announced July 2021.
-
Exploring complex networks with the ICON R package
Authors:
Raoul R. Wadhwa,
Jacob G. Scott
Abstract:
We introduce ICON, an R package that contains 1075 complex network datasets in a standard edgelist format. All provided datasets have associated citations and have been indexed by the Colorado Index of Complex Networks - also referred to as ICON. In addition to supplying a large and diverse corpus of useful real-world networks, ICON also implements an S3 generic to work with the network and ggnetw…
▽ More
We introduce ICON, an R package that contains 1075 complex network datasets in a standard edgelist format. All provided datasets have associated citations and have been indexed by the Colorado Index of Complex Networks - also referred to as ICON. In addition to supplying a large and diverse corpus of useful real-world networks, ICON also implements an S3 generic to work with the network and ggnetwork R packages for network analysis and visualization, respectively. Sample code in this report also demonstrates how ICON can be used in conjunction with the igraph package. Currently, the Comprehensive R Archive Network hosts ICON v0.4.0. We hope that ICON will serve as a standard corpus for complex network research and prevent redundant work that would be otherwise necessary by individual research groups. The open source code for ICON and for this reproducible report can be found at https://github.com/rrrlw/ICON.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Monotone function estimation in the presence of extreme data coarsening: Analysis of preeclampsia and birth weight in urban Uganda
Authors:
Jennifer E. Starling,
Catherine E. Aiken,
Jared S. Murray,
Annettee Nakimuli,
James G. Scott
Abstract:
This paper proposes a Bayesian hierarchical model to characterize the relationship between birth weight and maternal pre-eclampsia across gestation at a large maternity hospital in urban Uganda. Key scientific questions we investigate include: 1) how pre-eclampsia compares to other maternal-fetal covariates as a predictor of birth weight; and 2) whether the impact of pre-eclampsia on birthweight v…
▽ More
This paper proposes a Bayesian hierarchical model to characterize the relationship between birth weight and maternal pre-eclampsia across gestation at a large maternity hospital in urban Uganda. Key scientific questions we investigate include: 1) how pre-eclampsia compares to other maternal-fetal covariates as a predictor of birth weight; and 2) whether the impact of pre-eclampsia on birthweight varies across gestation. Our model addresses several key statistical challenges: it correctly encodes the prior medical knowledge that birth weight should vary smoothly and monotonically with gestational age, yet it also avoids assumptions about functional form along with assumptions about how birth weight varies with other covariates. Our model also accounts for the fact that a high proportion (83%) of birth weights in our data set are rounded to the nearest 100 grams. Such extreme data coarsening is rare in maternity hospitals in high resource obstetrics settings but common for data sets collected in low and middle-income countries (LMICs); this introduces a substantial extra layer of uncertainty into the problem and is a major reason why we adopt a Bayesian approach.
Our proposed non-parametric regression model, which we call Projective Smooth BART (psBART), builds upon the highly successful Bayesian Additive Regression Tree (BART) framework. This model captures complex nonlinear relationships and interactions, induces smoothness and monotonicity in a single target covariate, and provides a full posterior for uncertainty quantification. The results of our analysis show that pre-eclampsia is a dominant predictor of birth weight in this urban Ugandan setting, and therefore an important risk factor for perinatal mortality.
△ Less
Submitted 14 December, 2019;
originally announced December 2019.
-
Controlling the speed and trajectory of evolution with counterdiabatic driving
Authors:
Shamreen Iram,
Emily Dolson,
Joshua Chiel,
Julia Pelesko,
Nikhil Krishnan,
Özenç Güngör,
Benjamin Kuznets-Speck,
Sebastian Deffner,
Efe Ilker,
Jacob G. Scott,
Michael Hinczewski
Abstract:
The pace and unpredictability of evolution are critically relevant in a variety of modern challenges: combating drug resistance in pathogens and cancer, understanding how species respond to environmental perturbations like climate change, and developing artificial selection approaches for agriculture. Great progress has been made in quantitative modeling of evolution using fitness landscapes, allo…
▽ More
The pace and unpredictability of evolution are critically relevant in a variety of modern challenges: combating drug resistance in pathogens and cancer, understanding how species respond to environmental perturbations like climate change, and developing artificial selection approaches for agriculture. Great progress has been made in quantitative modeling of evolution using fitness landscapes, allowing a degree of prediction for future evolutionary histories. Yet fine-grained control of the speed and the distributions of these trajectories remains elusive. We propose an approach to achieve this using ideas originally developed in a completely different context: counterdiabatic driving to control the behavior of quantum states for applications like quantum computing and manipulating ultra-cold atoms. Implementing these ideas for the first time in a biological context, we show how a set of external control parameters (i.e. varying drug concentrations / types, temperature, nutrients) can guide the probability distribution of genotypes in a population along a specified path and time interval. This level of control, allowing empirical optimization of evolutionary speed and trajectories, has myriad potential applications, from enhancing adaptive therapies for diseases, to the development of thermotolerant crops in preparation for climate change, to accelerating bioengineering methods built on evolutionary models, like directed evolution of biomolecules.
△ Less
Submitted 3 June, 2020; v1 submitted 8 December, 2019;
originally announced December 2019.
-
How Likely are Ride-share Drivers to Earn a Living Wage? Large-scale Spatio-temporal Density Smoothing with the Graph-fused Elastic Net
Authors:
Mauricio Tec,
Natalia Zuniga-Garcia,
Randy B. Machemehl,
James G. Scott
Abstract:
Ride-sourcing or transportation network companies (TNCs) provide on-demand transportation service for compensation, connecting drivers of personal vehicles with passengers through smartphone applications. In this study, we consider the problem of estimating a spatiotemporally varying probability distribution for the productivity of a TNC driver, using data on more than 1.2 million TNC trips in Aus…
▽ More
Ride-sourcing or transportation network companies (TNCs) provide on-demand transportation service for compensation, connecting drivers of personal vehicles with passengers through smartphone applications. In this study, we consider the problem of estimating a spatiotemporally varying probability distribution for the productivity of a TNC driver, using data on more than 1.2 million TNC trips in Austin, Texas. We propose a graph-based smoothing approach that allows for distinct spatial and temporal dynamics, including different degrees of smoothness, spatio-temporal interactions, and interpolation in regions with little or no data. For such a goal, we introduce the Graph-fused Elastic Net (GFEN) and use it in combination with a dyadic tree decomposition for density estimation. In addition, we present an optimization-driven approach for fast point estimates scalable to massive graphs. Bayesian inference and uncertainty quantification with MCMC are also illustrated. The main results demonstrate that the optimization strategy is an effective exploration tool for selecting adequate regularization schemes using Bayesian optimization of the cross-validation loss. Two key empirical findings made possible by our method include: 1) the probability that a TNC driver can expect to earn a living wage in Austin exhibits high variability in space and time, from as low as 25% to as high as 85%; and 2) some drivers suffer considerable "tail risk", with the bottom 10% of the earnings distribution falling below $10 per hour -- grossly below a living wage in Austin for a single adult -- for specific times and locations. All code and data for the paper are publicly available, as a Shiny app for visualizing the results and a software package in Julia for implementing the GFEN.
△ Less
Submitted 9 July, 2021; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Targeted Smooth Bayesian Causal Forests: An analysis of heterogeneous treatment effects for simultaneous versus interval medical abortion regimens over gestation
Authors:
Jennifer E. Starling,
Jared S. Murray,
Patricia A. Lohr,
Abigail R. A. Aiken,
Carlos M. Carvalho,
James G. Scott
Abstract:
We introduce Targeted Smooth Bayesian Causal Forests (tsBCF), a nonparametric Bayesian approach for estimating heterogeneous treatment effects which vary smoothly over a single covariate in the observational data setting. The tsBCF method induces smoothness by parameterizing terminal tree nodes with smooth functions, and allows for separate regularization of treatment effects versus prognostic eff…
▽ More
We introduce Targeted Smooth Bayesian Causal Forests (tsBCF), a nonparametric Bayesian approach for estimating heterogeneous treatment effects which vary smoothly over a single covariate in the observational data setting. The tsBCF method induces smoothness by parameterizing terminal tree nodes with smooth functions, and allows for separate regularization of treatment effects versus prognostic effect of control covariates. Smoothing parameters for prognostic and treatment effects can be chosen to reflect prior knowledge or tuned in a data-dependent way.
We use tsBCF to analyze a new clinical protocol for early medical abortion. Our aim is to assess relative effectiveness of simultaneous versus interval administration of mifepristone and misoprostol over the first nine weeks of gestation. The model reflects our expectation that the relative effectiveness varies smoothly over gestation, but not necessarily over other covariates. We demonstrate the performance of the tsBCF method on benchmarking experiments. Software for tsBCF is available at https://github.com/jestarling/tsbcf/.
△ Less
Submitted 23 February, 2020; v1 submitted 22 May, 2019;
originally announced May 2019.
-
A flat persistence diagram for improved visualization of persistent homology
Authors:
Raoul R. Wadhwa,
Andrew Dhawan,
Drew F. K. Williamson,
Jacob G. Scott
Abstract:
Visualization in the emerging field of topological data analysis has progressed from persistence barcodes and persistence diagrams to display of two-parameter persistent homology. Although persistence barcodes and diagrams have permitted insight into the geometry underlying complex datasets, visualization of even single-parameter persistent homology has significant room for improvement. Here, we p…
▽ More
Visualization in the emerging field of topological data analysis has progressed from persistence barcodes and persistence diagrams to display of two-parameter persistent homology. Although persistence barcodes and diagrams have permitted insight into the geometry underlying complex datasets, visualization of even single-parameter persistent homology has significant room for improvement. Here, we propose a modification to the conventional persistence diagram - the flat persistence diagram - that more efficiently displays information relevant to persistent homology and simultaneously corrects for visual bias present in the former. Flat persistence diagrams display equivalent information as their predecessor, while providing researchers with an intuitive horizontal reference axis in contrast to the usual diagonal reference line. Reducing visual bias through the use of appropriate graphical displays not only provides more accurate, but also deeper insights into the topology that underlies complex datasets. Introducing flat persistence diagrams into widespread use would bring researchers one step closer to practical application of topological data analysis.
△ Less
Submitted 5 January, 2019; v1 submitted 11 December, 2018;
originally announced December 2018.
-
Optimizing adaptive cancer therapy: dynamic programming and evolutionary game theory
Authors:
Mark Gluzman,
Jacob G. Scott,
Alexander Vladimirsky
Abstract:
Recent clinical trials have shown that the adaptive drug therapy can be more efficient than a standard MTD-based policy in treatment of cancer patients. The adaptive therapy paradigm is not based on a preset schedule; instead, the doses are administered based on the current state of tumor. But the adaptive treatment policies examined so far have been largely ad hoc. In this paper we propose a meth…
▽ More
Recent clinical trials have shown that the adaptive drug therapy can be more efficient than a standard MTD-based policy in treatment of cancer patients. The adaptive therapy paradigm is not based on a preset schedule; instead, the doses are administered based on the current state of tumor. But the adaptive treatment policies examined so far have been largely ad hoc. In this paper we propose a method for systematically optimizing the rules of adaptive policies based on an Evolutionary Game Theory model of cancer dynamics. Given a set of treatment objectives, we use the framework of dynamic programming to find the optimal treatment strategies. In particular, we optimize the total drug usage and time to recovery by solving a Hamilton-Jacobi-Bellman equation based on a mathematical model of tumor evolution. We compare adaptive/optimal treatment strategy with MTD-based treatment policy. We show that optimal treatment strategies can dramatically decrease the total amount of drugs prescribed as well as increase the fraction of initial tumour states from which the recovery is possible. We also examine the optimization trade-offs between the total administered drugs and recovery time. The adaptive therapy combined with optimal control theory is a promising concept in the cancer treatment and should be integrated into clinical trial design.
△ Less
Submitted 10 December, 2018; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Optimal post-selection inference for sparse signals: a nonparametric empirical-Bayes approach
Authors:
Spencer Woody,
Oscar Hernan Madrid Padilla,
James G. Scott
Abstract:
Many recently developed Bayesian methods have focused on sparse signal detection. However, much less work has been done addressing the natural follow-up question: how to make valid inferences for the magnitude of those signals after selection. Ordinary Bayesian credible intervals suffer from selection bias, owing to the fact that the target of inference is chosen adaptively. Existing Bayesian appr…
▽ More
Many recently developed Bayesian methods have focused on sparse signal detection. However, much less work has been done addressing the natural follow-up question: how to make valid inferences for the magnitude of those signals after selection. Ordinary Bayesian credible intervals suffer from selection bias, owing to the fact that the target of inference is chosen adaptively. Existing Bayesian approaches for correcting this bias produce credible intervals with poor frequentist properties, while existing frequentist approaches require sacrificing the benefits of shrinkage typical in Bayesian methods, resulting in confidence intervals that are needlessly wide. We address this gap by proposing a nonparametric empirical-Bayes approach for constructing optimal selection-adjusted confidence sets. Our method produces confidence sets that are as short as possible on average, while both adjusting for selection and maintaining exact frequentist coverage uniformly over the parameter space. Our main theoretical result establishes an important consistency property of our procedure: that under mild conditions, it asymptotically converges to the results of an oracle-Bayes analysis in which the prior distribution of signal sizes is known exactly. Across a series of examples, the method outperforms existing frequentist techniques for post-selection inference, producing confidence sets that are notably shorter but with the same coverage guarantee.
△ Less
Submitted 13 November, 2020; v1 submitted 25 October, 2018;
originally announced October 2018.
-
Evaluation of Ride-Sourcing Search Frictions and Driver Productivity: A Spatial Denoising Approach
Authors:
Natalia Zuniga-Garcia,
Mauricio Tec,
James G. Scott,
Natalia Ruiz-Juri,
Randy B. Machemehl
Abstract:
This paper considers the problem of measuring spatial and temporal variation in driver productivity on ride-sourcing trips. This variation is especially important from a driver's perspective: if a platform's drivers experience systematic disparities in earnings because of variation in their riders' destinations, they may perceive the pricing model as inequitable. This perception can exacerbate sea…
▽ More
This paper considers the problem of measuring spatial and temporal variation in driver productivity on ride-sourcing trips. This variation is especially important from a driver's perspective: if a platform's drivers experience systematic disparities in earnings because of variation in their riders' destinations, they may perceive the pricing model as inequitable. This perception can exacerbate search frictions if it leads drivers to avoid locations where they believe they may be assigned "unlucky" fares. To characterize any such systematic disparities in productivity, we develop an analytic framework with three key components. First, we propose a productivity metric that looks two consecutive trips ahead, thus capturing the effect on expected earnings of market conditions at drivers' drop-off locations. Second, we develop a natural experiment by analyzing trips with a common origin but varying destinations, thus isolating purely spatial effects on productivity. Third, we apply a spatial denoising method that allows us to work with raw spatial information exhibiting high levels of noise and sparsity, without having to aggregate data into large, low-resolution spatial zones. By applying our framework to data on more than 1.4 million rides in Austin, Texas, we find significant spatial variation in ride-sourcing driver productivity and search frictions. Drivers at the same location experienced disparities in productivity after being dispatched on trips with different destinations, with origin-based surge pricing increasing these earnings disparities. Our results show that trip distance is the dominant factor in driver productivity: short trips yielded lower productivity, even when ending in areas with high demand. These findings suggest that new pricing strategies are required to minimize random disparities in driver earnings.
△ Less
Submitted 11 October, 2019; v1 submitted 26 September, 2018;
originally announced September 2018.
-
BART with Targeted Smoothing: An analysis of patient-specific stillbirth risk
Authors:
Jennifer E. Starling,
Jared S. Murray,
Carlos M. Carvalho,
Radek K. Bukowski,
James G. Scott
Abstract:
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate t, while not necessarily requiring smoothness over other covariates x. TsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. TsBART extends BART…
▽ More
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate t, while not necessarily requiring smoothness over other covariates x. TsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. TsBART extends BART by parameterizing each tree's terminal nodes with smooth functions of t, rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps regularize the estimate.
After introducing and benchmarking the tsBART model, we apply it to our motivating example: pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age (t), based on maternal and fetal risk factors (x). Obstetricians expect stillbirth risk to vary smoothly over gestational age, but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of perinatal mortality. All methods described here are implemented in the R package tsbart.
△ Less
Submitted 3 June, 2019; v1 submitted 19 May, 2018;
originally announced May 2018.
-
Socioeconomic bias in influenza surveillance
Authors:
Samuel V. Scarpino,
James G. Scott,
Rosalind M. Eggo,
Bruce Clements,
Nedialko B. Dimitrov,
Lauren Ancel Meyers
Abstract:
Individuals in low socioeconomic brackets are considered at-risk for developing influenza-related complications and often exhibit higher than average influenza-related hospitalization rates. This disparity has been attributed to various factors, including restricted access to preventative and therapeutic health care, limited sick leave, and household structure. Adequate influenza surveillance in t…
▽ More
Individuals in low socioeconomic brackets are considered at-risk for developing influenza-related complications and often exhibit higher than average influenza-related hospitalization rates. This disparity has been attributed to various factors, including restricted access to preventative and therapeutic health care, limited sick leave, and household structure. Adequate influenza surveillance in these at-risk populations is a critical precursor to accurate risk assessments and effective intervention. However, the United States of America's primary national influenza surveillance system (ILINet) monitors outpatient healthcare providers, which may be largely inaccessible to lower socioeconomic populations. Recent initiatives to incorporate internet-source and hospital electronic medical records data into surveillance systems seek to improve the timeliness, coverage, and accuracy of outbreak detection and situational awareness. Here, we use a flexible statistical framework for integrating multiple surveillance data sources to evaluate the adequacy of traditional (ILINet) and next generation (BioSense 2.0 and Google Flu Trends) data for situational awareness of influenza across poverty levels. We find that zip codes in the highest poverty quartile are a critical blind-spot for ILINet that the integration of next generation data fails to ameliorate.
△ Less
Submitted 1 April, 2018;
originally announced April 2018.
-
Interpretable Low-Dimensional Regression via Data-Adaptive Smoothing
Authors:
Wesley Tansey,
Jesse Thomason,
James G. Scott
Abstract:
We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present Maximum Variance Total Variation denoising (MVTV), an approach that is conceptually related both to CART…
▽ More
We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present Maximum Variance Total Variation denoising (MVTV), an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a state-of-the-art alternative method for interpretable nonlinear regression. MVTV divides the feature space into blocks of constant value and fits the value of all blocks jointly via a convex optimization routine. Our method is fully data-adaptive, in that it incorporates highly robust routines for tuning all hyperparameters automatically. We compare our approach against CART and CRISP via both a complexity-accuracy tradeoff metric and a human study, demonstrating that that MVTV is a more powerful and interpretable method.
△ Less
Submitted 6 August, 2017;
originally announced August 2017.
-
Evolutionary dynamics of incubation periods
Authors:
Bertrand Ottino-Loffler,
Jacob G. Scott,
Steven H. Strogatz
Abstract:
The incubation period of a disease is the time between an initiating pathologic event and the onset of symptoms. For typhoid fever, polio, measles, leukemia and many other diseases, the incubation period is highly variable. Some affected people take much longer than average to show symptoms, leading to a distribution of incubation periods that is right skewed and often approximately lognormal. Alt…
▽ More
The incubation period of a disease is the time between an initiating pathologic event and the onset of symptoms. For typhoid fever, polio, measles, leukemia and many other diseases, the incubation period is highly variable. Some affected people take much longer than average to show symptoms, leading to a distribution of incubation periods that is right skewed and often approximately lognormal. Although this statistical pattern was discovered more than sixty years ago, it remains an open question to explain its ubiquity. Here we propose an explanation based on evolutionary dynamics on graphs. For simple models of a mutant or pathogen invading a network-structured population of healthy cells, we show that skewed distributions of incubation periods emerge for a wide range of assumptions about invader fitness, competition dynamics, and network structure. The skewness stems from stochastic mechanisms associated with two classic problems in probability theory: the coupon collector and the random walk. Unlike previous explanations that rely crucially on heterogeneity, our results hold even for homogeneous populations. Thus, we predict that two equally healthy individuals subjected to equal doses of equally pathogenic agents may, by chance alone, show remarkably different time courses of disease.
△ Less
Submitted 30 May, 2017;
originally announced May 2017.
-
GapTV: Accurate and Interpretable Low-Dimensional Regression and Classification
Authors:
Wesley Tansey,
James G. Scott
Abstract:
We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present GapTV, an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a s…
▽ More
We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present GapTV, an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a state-of-the-art alternative method for interpretable nonlinear regression. GapTV divides the feature space into blocks of constant value and fits the value of all blocks jointly via a convex optimization routine. Our method is fully data-adaptive, in that it incorporates highly robust routines for tuning all hyperparameters automatically. We compare our approach against CART and CRISP and demonstrate that GapTV finds a much better trade-off between accuracy and interpretability.
△ Less
Submitted 23 February, 2017;
originally announced February 2017.
-
Deep Nonparametric Estimation of Discrete Conditional Distributions via Smoothed Dyadic Partitioning
Authors:
Wesley Tansey,
Karl Pichotta,
James G. Scott
Abstract:
We present an approach to deep estimation of discrete conditional probability distributions. Such models have several applications, including generative modeling of audio, image, and video data. Our approach combines two main techniques: dyadic partitioning and graph-based smoothing of the discrete space. By recursively decomposing each dimension into a series of binary splits and smoothing over t…
▽ More
We present an approach to deep estimation of discrete conditional probability distributions. Such models have several applications, including generative modeling of audio, image, and video data. Our approach combines two main techniques: dyadic partitioning and graph-based smoothing of the discrete space. By recursively decomposing each dimension into a series of binary splits and smoothing over the resulting distribution using graph-based trend filtering, we impose a strict structure to the model and achieve much higher sample efficiency. We demonstrate the advantages of our model through a series of benchmarks on both synthetic and real-world datasets, in some cases reducing the error by nearly half in comparison to other popular methods in the literature. All of our models are implemented in Tensorflow and publicly available at https://github.com/tansey/sdp .
△ Less
Submitted 28 February, 2017; v1 submitted 23 February, 2017;
originally announced February 2017.
-
Takeover times for a simple model of network infection
Authors:
Bertrand Ottino-Löffler,
Jacob G. Scott,
Steven H. Strogatz
Abstract:
We study a stochastic model of infection spreading on a network. At each time step a node is chosen at random, along with one of its neighbors. If the node is infected and the neighbor is susceptible, the neighbor becomes infected. How many time steps $T$ does it take to completely infect a network of $N$ nodes, starting from a single infected node? An analogy to the classic "coupon collector" pro…
▽ More
We study a stochastic model of infection spreading on a network. At each time step a node is chosen at random, along with one of its neighbors. If the node is infected and the neighbor is susceptible, the neighbor becomes infected. How many time steps $T$ does it take to completely infect a network of $N$ nodes, starting from a single infected node? An analogy to the classic "coupon collector" problem of probability theory reveals that the takeover time $T$ is dominated by extremal behavior, either when there are only a few infected nodes near the start of the process or a few susceptible nodes near the end. We show that for $N \gg 1$, the takeover time $T$ is distributed as a Gumbel for the star graph; as the sum of two Gumbels for a complete graph and an Erdős-Rényi random graph; as a normal for a one-dimensional ring and a two-dimensional lattice; and as a family of intermediate skewed distributions for $d$-dimensional lattices with $d \ge 3$ (these distributions approach the sum of two Gumbels as $d$ approaches infinity). Connections to evolutionary dynamics, cancer, incubation periods of infectious diseases, first-passage percolation, and other spreading phenomena in biology and physics are discussed.
△ Less
Submitted 2 February, 2017;
originally announced February 2017.
-
Sequential nonparametric tests for a change in distribution: an application to detecting radiological anomalies
Authors:
Oscar Hernan Madrid Padilla,
Alex Athey,
Alex Reinhart,
James G. Scott
Abstract:
We propose a sequential nonparametric test for detecting a change in distribution, based on windowed Kolmogorov--Smirnov statistics. The approach is simple, robust, highly computationally efficient, easy to calibrate, and requires no parametric assumptions about the underlying null and alternative distributions. We show that both the false-alarm rate and the power of our procedure are amenable to…
▽ More
We propose a sequential nonparametric test for detecting a change in distribution, based on windowed Kolmogorov--Smirnov statistics. The approach is simple, robust, highly computationally efficient, easy to calibrate, and requires no parametric assumptions about the underlying null and alternative distributions. We show that both the false-alarm rate and the power of our procedure are amenable to rigorous analysis, and that the method outperforms existing sequential testing procedures in practice. We then apply the method to the problem of detecting radiological anomalies, using data collected from measurements of the background gamma-radiation spectrum on a large university campus. In this context, the proposed method leads to substantial improvements in time-to-detection for the kind of radiological anomalies of interest in law-enforcement and border-security applications.
△ Less
Submitted 22 December, 2016;
originally announced December 2016.
-
Diet2Vec: Multi-scale analysis of massive dietary data
Authors:
Wesley Tansey,
Edward W. Lowe Jr.,
James G. Scott
Abstract:
Smart phone apps that enable users to easily track their diets have become widespread in the last decade. This has created an opportunity to discover new insights into obesity and weight loss by analyzing the eating habits of the users of such apps. In this paper, we present diet2vec: an approach to modeling latent structure in a massive database of electronic diet journals. Through an iterative c…
▽ More
Smart phone apps that enable users to easily track their diets have become widespread in the last decade. This has created an opportunity to discover new insights into obesity and weight loss by analyzing the eating habits of the users of such apps. In this paper, we present diet2vec: an approach to modeling latent structure in a massive database of electronic diet journals. Through an iterative contract-and-expand process, our model learns real-valued embeddings of users' diets, as well as embeddings for individual foods and meals. We demonstrate the effectiveness of our approach on a real dataset of 55K users of the popular diet-tracking app LoseIt\footnote{http://www.loseit.com/}. To the best of our knowledge, this is the largest fine-grained diet tracking study in the history of nutrition and obesity research. Our results suggest that diet2vec finds interpretable results at all levels, discovering intuitive representations of foods, meals, and diets.
△ Less
Submitted 1 December, 2016;
originally announced December 2016.
-
The DFS Fused Lasso: Linear-Time Denoising over General Graphs
Authors:
Oscar Hernan Madrid Padilla,
James G. Scott,
James Sharpnack,
Ryan J. Tibshirani
Abstract:
The fused lasso, also known as (anisotropic) total variation denoising, is widely used for piecewise constant signal estimation with respect to a given undirected graph. The fused lasso estimate is highly nontrivial to compute when the underlying graph is large and has an arbitrary structure. But for a special graph structure, namely, the chain graph, the fused lasso---or simply, 1d fused lasso---…
▽ More
The fused lasso, also known as (anisotropic) total variation denoising, is widely used for piecewise constant signal estimation with respect to a given undirected graph. The fused lasso estimate is highly nontrivial to compute when the underlying graph is large and has an arbitrary structure. But for a special graph structure, namely, the chain graph, the fused lasso---or simply, 1d fused lasso---can be computed in linear time. In this paper, we establish a surprising connection between the total variation of a generic signal defined over an arbitrary graph, and the total variation of this signal over a chain graph induced by running depth-first search (DFS) over the nodes of the graph. Specifically, we prove that for any signal, its total variation over the induced chain graph is no more than twice its total variation over the original graph. This connection leads to several interesting theoretical and computational conclusions. Denoting by $m$ and $n$ the number of edges and nodes, respectively, of the graph in question, our result implies that for an underlying signal with total variation $t$ over the graph, the fused lasso achieves a mean squared error rate of \smash{$t^{2/3} n^{-2/3}$}. Moreover, precisely the same mean squared error rate is achieved by running the 1d fused lasso on the induced chain graph from running DFS. Importantly, the latter estimator is simple and computationally cheap, requiring only $O(m)$ operations for constructing the DFS-induced chain and $O(n)$ operations for computing the 1d fused lasso solution over this chain. Further, for trees that have bounded max degree, the error rate of \smash{$t^{2/3} n^{-2/3}$} cannot be improved, in the sense that it is the minimax rate for signals that have total variation $t$ over the tree.
△ Less
Submitted 1 March, 2017; v1 submitted 11 August, 2016;
originally announced August 2016.
-
Cancer treatment scheduling and dynamic heterogeneity in social dilemmas of tumour acidity and vasculature
Authors:
Artem Kaznatcheev,
Robert Vander Velde,
Jacob G. Scott,
David Basanta
Abstract:
Background: Tumours are diverse ecosystems with persistent heterogeneity in various cancer hallmarks like self-sufficiency of growth factor production for angiogenesis and reprogramming of energy-metabolism for aerobic glycolysis. This heterogeneity has consequences for diagnosis, treatment, and disease progression.
Methods: We introduce the double goods game to study the dynamics of these trait…
▽ More
Background: Tumours are diverse ecosystems with persistent heterogeneity in various cancer hallmarks like self-sufficiency of growth factor production for angiogenesis and reprogramming of energy-metabolism for aerobic glycolysis. This heterogeneity has consequences for diagnosis, treatment, and disease progression.
Methods: We introduce the double goods game to study the dynamics of these traits using evolutionary game theory. We model glycolytic acid production as a public good for all tumour cells and oxygen from vascularization via VEGF production as a club good benefiting non-glycolytic tumour cells. This results in three viable phenotypic strategies: glycolytic, angiogenic, and aerobic non-angiogenic.
Results: We classify the dynamics into three qualitatively distinct regimes: (1) fully glycolytic, (2) fully angiogenic, or (3) polyclonal in all three cell types. The third regime allows for dynamic heterogeneity even with linear goods, something that was not possible in prior public good models that considered glycolysis or growth-factor production in isolation.
Conclusion: The cyclic dynamics of the polyclonal regime stress the importance of timing for anti-glycolysis treatments like lonidamine. The existence of qualitatively different dynamic regimes highlights the order effects of treatments. In particular, we consider the potential of vascular renormalization as a neoadjuvant therapy before follow up with interventions like buffer therapy.
△ Less
Submitted 2 August, 2016;
originally announced August 2016.
-
Better Conditional Density Estimation for Neural Networks
Authors:
Wesley Tansey,
Karl Pichotta,
James G. Scott
Abstract:
The vast majority of the neural network literature focuses on predicting point values for a given set of response variables, conditioned on a feature vector. In many cases we need to model the full joint conditional distribution over the response variables rather than simply making point predictions. In this paper, we present two novel approaches to such conditional density estimation (CDE): Multi…
▽ More
The vast majority of the neural network literature focuses on predicting point values for a given set of response variables, conditioned on a feature vector. In many cases we need to model the full joint conditional distribution over the response variables rather than simply making point predictions. In this paper, we present two novel approaches to such conditional density estimation (CDE): Multiscale Nets (MSNs) and CDE Trend Filtering. Multiscale nets transform the CDE regression task into a hierarchical classification task by decomposing the density into a series of half-spaces and learning boolean probabilities of each split. CDE Trend Filtering applies a k-th order graph trend filtering penalty to the unnormalized logits of a multinomial classifier network, with each edge in the graph corresponding to a neighboring point on a discretized version of the density. We compare both methods against plain multinomial classifier networks and mixture density networks (MDNs) on a simulated dataset and three real-world datasets. The results suggest the two methods are complementary: MSNs work well in a high-data-per-feature regime and CDE-TF is well suited for few-samples-per-feature scenarios where overfitting is a primary concern.
△ Less
Submitted 7 June, 2016;
originally announced June 2016.
-
A deconvolution path for mixtures
Authors:
Oscar Hernan Madrid Padilla,
Nicholas G. Polson,
James G. Scott
Abstract:
We propose a class of estimators for deconvolution in mixture models based on a simple two-step "bin-and-smooth" procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mixing distribution across a range of plausible d…
▽ More
We propose a class of estimators for deconvolution in mixture models based on a simple two-step "bin-and-smooth" procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mixing distribution across a range of plausible degrees of smoothness, at far less cost than a full Bayesian analysis. This enables practitioners to conduct a sensitivity analysis with minimal effort. This is especially important for applied data analysis, given the ill-posed nature of the deconvolution problem. Our results establish the favorable theoretical properties of our estimator and show that it offers state-of-the-art performance when compared to benchmark methods across a range of scenarios.
△ Less
Submitted 25 May, 2017; v1 submitted 20 November, 2015;
originally announced November 2015.
-
Nonparametric density estimation by histogram trend filtering
Authors:
Oscar Hernan Madrid Padilla,
James G. Scott
Abstract:
We propose a novel approach for density estimation called histogram trend filtering. Our estimator arises from looking at surrogate Poisson model for counts of observations in a partition of the support of the data. We begin by showing consistency for a variational estimator for this density estimation problem. We then study a discrete estimator that can be efficiently found via convex optimizatio…
▽ More
We propose a novel approach for density estimation called histogram trend filtering. Our estimator arises from looking at surrogate Poisson model for counts of observations in a partition of the support of the data. We begin by showing consistency for a variational estimator for this density estimation problem. We then study a discrete estimator that can be efficiently found via convex optimization. We show that the estimator enjoys strong statistical guarantees, yet is much more practical and computationally efficient than other estimators that enjoy similar guarantees. Finally, in our simulation study the proposed method showed smaller averaged mean square error than competing methods. This favorable blend of properties makes histogram trend filtering an ideal candidate for use in routine data-analysis applications that call for a quick, efficient, accurate density estimate.
△ Less
Submitted 6 February, 2016; v1 submitted 14 September, 2015;
originally announced September 2015.
-
Multiscale spatial density smoothing: an application to large-scale radiological survey and anomaly detection
Authors:
Wesley Tansey,
Alex Athey,
Alex Reinhart,
James G. Scott
Abstract:
We consider the problem of estimating a spatially varying density function, motivated by problems that arise in large-scale radiological survey and anomaly detection. In this context, the density functions to be estimated are the background gamma-ray energy spectra at sites spread across a large geographical area, such as nuclear production and waste-storage sites, military bases, medical faciliti…
▽ More
We consider the problem of estimating a spatially varying density function, motivated by problems that arise in large-scale radiological survey and anomaly detection. In this context, the density functions to be estimated are the background gamma-ray energy spectra at sites spread across a large geographical area, such as nuclear production and waste-storage sites, military bases, medical facilities, university campuses, or the downtown of a city. Several challenges combine to make this a difficult problem. First, the spectral density at any given spatial location may have both smooth and non-smooth features. Second, the spatial correlation in these density functions is neither stationary nor locally isotropic. Finally, at some spatial locations, there is very little data. We present a method called multiscale spatial density smoothing that successfully addresses these challenges. The method is based on recursive dyadic partition of the sample space, and therefore shares much in common with other multiscale methods, such as wavelets and Pólya-tree priors. We describe an efficient algorithm for finding a maximum a posteriori (MAP) estimate that leverages recent advances in convex optimization for non-smooth functions.
We apply multiscale spatial density smoothing to real data collected on the background gamma-ray spectra at locations across a large university campus. The method exhibits state-of-the-art performance for spatial smoothing in density estimation, and it leads to substantial improvements in power when used in conjunction with existing methods for detecting the kinds of radiological anomalies that may have important consequences for public health and safety.
△ Less
Submitted 16 September, 2016; v1 submitted 26 July, 2015;
originally announced July 2015.
-
A Fast and Flexible Algorithm for the Graph-Fused Lasso
Authors:
Wesley Tansey,
James G. Scott
Abstract:
We propose a new algorithm for solving the graph-fused lasso (GFL), a method for parameter estimation that operates under the assumption that the signal tends to be locally constant over a predefined graph structure. Our key insight is to decompose the graph into a set of trails which can then each be solved efficiently using techniques for the ordinary (1D) fused lasso. We leverage these trails i…
▽ More
We propose a new algorithm for solving the graph-fused lasso (GFL), a method for parameter estimation that operates under the assumption that the signal tends to be locally constant over a predefined graph structure. Our key insight is to decompose the graph into a set of trails which can then each be solved efficiently using techniques for the ordinary (1D) fused lasso. We leverage these trails in a proximal algorithm that alternates between closed form primal updates and fast dual trail updates. The resulting techinque is both faster than previous GFL methods and more flexible in the choice of loss function and graph structure. Furthermore, we present two algorithms for constructing trail sets and show empirically that they offer a tradeoff between preprocessing time and convergence rate.
△ Less
Submitted 1 June, 2015; v1 submitted 24 May, 2015;
originally announced May 2015.
-
Tensor decomposition with generalized lasso penalties
Authors:
Oscar Hernan Madrid Padilla,
James G. Scott
Abstract:
We present an approach for penalized tensor decomposition (PTD) that estimates smoothly varying latent factors in multi-way data. This generalizes existing work on sparse tensor decomposition and penalized matrix decompositions, in a manner parallel to the generalized lasso for regression and smoothing problems. Our approach presents many nontrivial challenges at the intersection of modeling and c…
▽ More
We present an approach for penalized tensor decomposition (PTD) that estimates smoothly varying latent factors in multi-way data. This generalizes existing work on sparse tensor decomposition and penalized matrix decompositions, in a manner parallel to the generalized lasso for regression and smoothing problems. Our approach presents many nontrivial challenges at the intersection of modeling and computation, which are studied in detail. An efficient coordinate-wise optimization algorithm for (PTD) is presented, and its convergence properties are characterized. The method is applied both to simulated data and real data on flu hospitalizations in Texas. These results show that our penalized tensor decomposition can offer major improvements on existing methods for analyzing multi-way data that exhibit smooth spatial or temporal features.
△ Less
Submitted 12 May, 2016; v1 submitted 24 February, 2015;
originally announced February 2015.
-
Proximal Algorithms in Statistics and Machine Learning
Authors:
Nicholas G. Polson,
James G. Scott,
Brandon T. Willard
Abstract:
In this paper we develop proximal methods for statistical learning. Proximal point algorithms are useful in statistics and machine learning for obtaining optimization solutions for composite functions. Our approach exploits closed-form solutions of proximal operators and envelope representations based on the Moreau, Forward-Backward, Douglas-Rachford and Half-Quadratic envelopes. Envelope represen…
▽ More
In this paper we develop proximal methods for statistical learning. Proximal point algorithms are useful in statistics and machine learning for obtaining optimization solutions for composite functions. Our approach exploits closed-form solutions of proximal operators and envelope representations based on the Moreau, Forward-Backward, Douglas-Rachford and Half-Quadratic envelopes. Envelope representations lead to novel proximal algorithms for statistical optimisation of composite objective functions which include both non-smooth and non-convex objectives. We illustrate our methodology with regularized Logistic and Poisson regression and non-convex bridge penalties with a fused lasso norm. We provide a discussion of convergence of non-descent algorithms with acceleration and for non-convex functions. Finally, we provide directions for future research.
△ Less
Submitted 30 May, 2015; v1 submitted 10 February, 2015;
originally announced February 2015.
-
False discovery rate smoothing
Authors:
Wesley Tansey,
Oluwasanmi Koyejo,
Russell A. Poldrack,
James G. Scott
Abstract:
We present false discovery rate smoothing, an empirical-Bayes method for exploiting spatial structure in large multiple-testing problems. FDR smoothing automatically finds spatially localized regions of significant test statistics. It then relaxes the threshold of statistical significance within these regions, and tightens it elsewhere, in a manner that controls the overall false-discovery rate at…
▽ More
We present false discovery rate smoothing, an empirical-Bayes method for exploiting spatial structure in large multiple-testing problems. FDR smoothing automatically finds spatially localized regions of significant test statistics. It then relaxes the threshold of statistical significance within these regions, and tightens it elsewhere, in a manner that controls the overall false-discovery rate at a given level. This results in increased power and cleaner spatial separation of signals from noise. The approach requires solving a non-standard high-dimensional optimization problem, for which an efficient augmented-Lagrangian algorithm is presented. In simulation studies, FDR smoothing exhibits state-of-the-art performance at modest computational cost. In particular, it is shown to be far more robust than existing methods for spatially dependent multiple testing. We also apply the method to a data set from an fMRI experiment on spatial working memory, where it detects patterns that are much more biologically plausible than those detected by standard FDR-controlling methods. All code for FDR smoothing is publicly available in Python and R.
△ Less
Submitted 14 November, 2016; v1 submitted 22 November, 2014;
originally announced November 2014.
-
Vertical-likelihood Monte Carlo
Authors:
Nicholas G. Polson,
James G. Scott
Abstract:
In this review, we address the use of Monte Carlo methods for approximating definite integrals of the form $Z = \int L(x) d P(x)$, where $L$ is a target function (often a likelihood) and $P$ a finite measure. We present vertical-likelihood Monte Carlo, which is an approach for designing the importance function $g(x)$ used in importance sampling. Our approach exploits a duality between two random v…
▽ More
In this review, we address the use of Monte Carlo methods for approximating definite integrals of the form $Z = \int L(x) d P(x)$, where $L$ is a target function (often a likelihood) and $P$ a finite measure. We present vertical-likelihood Monte Carlo, which is an approach for designing the importance function $g(x)$ used in importance sampling. Our approach exploits a duality between two random variables: the random draw $X \sim g$, and the corresponding random likelihood ordinate $Y\equiv L(X)$ of the draw. It is natural to specify $g(x)$ and ask: what is the the implied distribution of $Y$? In this paper, we take up the opposite question: what should the distribution of $Y$ be so that the implied importance function $g(x)$ is good for approximating $Z$? Our answer turns out to unite seven seemingly disparate classes of algorithms under the vertical-likelihood perspective: importance sampling, slice sampling, simulated annealing/tempering, the harmonic-mean estimator, the vertical-density sampler, nested sampling, and energy-level sampling (a suite of related methods from statistical physics). In particular, we give an alterate presentation of nested sampling, paying special attention to the connection between this method and the vertical-likelihood perspective articulated here. As an alternative to nested sampling, we describe an MCMC method based on re-weighted slice sampling. This method's convergence properties are studied, and two examples demonstrate the promise of the overall approach.
△ Less
Submitted 23 June, 2015; v1 submitted 11 September, 2014;
originally announced September 2014.
-
Mixtures, envelopes, and hierarchical duality
Authors:
Nicholas G. Polson,
James G. Scott
Abstract:
We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection using the term "hierarchical duality." Our results suggest an interesting and previously under-exploited relationship between marginalization and profiling, or equivalently between the Fenchel--Moreau theorem for convex functions and the Berns…
▽ More
We develop a connection between mixture and envelope representations of objective functions that arise frequently in statistics. We refer to this connection using the term "hierarchical duality." Our results suggest an interesting and previously under-exploited relationship between marginalization and profiling, or equivalently between the Fenchel--Moreau theorem for convex functions and the Bernstein--Widder theorem for Laplace transforms. We give several different sets of conditions under which such a duality result obtains. We then extend existing work on envelope representations in several ways, including novel generalizations to variance-mean models and to multivariate Gaussian location models. This turns out to provide an elegant missing-data interpretation of the proximal gradient method, a widely used algorithm in machine learning. We show several statistical applications in which the proposed framework leads to easily implemented algorithms, including a robust version of the fused lasso, nonlinear quantile regression via trend filtering, and the binomial fused double Pareto model. Code for the examples is available on GitHub at https://github.com/jgscott/hierduals.
△ Less
Submitted 22 February, 2015; v1 submitted 1 June, 2014;
originally announced June 2014.
-
Sampling Polya-Gamma random variates: alternate and approximate techniques
Authors:
Jesse Windle,
Nicholas G. Polson,
James G. Scott
Abstract:
Efficiently sampling from the Pólya-Gamma distribution, ${PG}(b,z)$, is an essential element of Pólya-Gamma data augmentation. Polson et. al (2013) show how to efficiently sample from the ${PG}(1,z)$ distribution. We build two new samplers that offer improved performance when sampling from the ${PG}(b,z)$ distribution and $b$ is not unity.
Efficiently sampling from the Pólya-Gamma distribution, ${PG}(b,z)$, is an essential element of Pólya-Gamma data augmentation. Polson et. al (2013) show how to efficiently sample from the ${PG}(1,z)$ distribution. We build two new samplers that offer improved performance when sampling from the ${PG}(b,z)$ distribution and $b$ is not unity.
△ Less
Submitted 2 May, 2014;
originally announced May 2014.
-
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
Authors:
Mingyuan Zhou,
Oscar Hernan Madrid Padilla,
James G. Scott
Abstract:
We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes. Because the models lead to closed-form Gibbs sampling update equations, they are natural candidates for nonparametric Bayesian priors…
▽ More
We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes. Because the models lead to closed-form Gibbs sampling update equations, they are natural candidates for nonparametric Bayesian priors over count matrices. A key aspect of our analysis is the recognition that, although the random count matrices within the family are defined by a row-wise construction, their columns can be shown to be i.i.d. This fact is used to derive explicit formulas for drawing all the columns at once. Moreover, by analyzing these matrices' combinatorial structure, we describe how to sequentially construct a column-i.i.d. random count matrix one row at a time, and derive the predictive distribution of a new row count vector with previously unseen features. We describe the similarities and differences between the three priors, and argue that the greater flexibility of the gamma- and beta- negative binomial processes, especially their ability to model over-dispersed, heavy-tailed count data, makes these well suited to a wide variety of real-world applications. As an example of our framework, we construct a naive-Bayes text classifier to categorize a count vector to one of several existing random count matrices of different categories. The classifier supports an unbounded number of features, and unlike most existing methods, it does not require a predefined finite vocabulary to be shared by all the categories, and needs neither feature selection nor parameter tuning. Both the gamma- and beta- negative binomial processes are shown to significantly outperform the gamma-Poisson process for document categorization, with comparable performance to other state-of-the-art supervised text classification algorithms.
△ Less
Submitted 13 July, 2015; v1 submitted 12 April, 2014;
originally announced April 2014.
-
A filter-flow perspective of hematogenous metastasis offers a non-genetic paradigm for personalized cancer therapy
Authors:
Jacob G. Scott,
Alexander G. Fletcher,
Philip K. Maini,
Alexander R. A. Anderson,
Philip Gerlee
Abstract:
Research into mechanisms of hematogenous metastasis has largely become genetic in focus, attempting to understand the molecular basis of `seed-soil' relationships. Preceeding this biological mechanism is the physical process of dissemination of circulating tumour cells (CTCs). We utilize a `filter-flow' paradigm to show that assumptions about CTC dynamics strongly affect metastatic efficiency: wit…
▽ More
Research into mechanisms of hematogenous metastasis has largely become genetic in focus, attempting to understand the molecular basis of `seed-soil' relationships. Preceeding this biological mechanism is the physical process of dissemination of circulating tumour cells (CTCs). We utilize a `filter-flow' paradigm to show that assumptions about CTC dynamics strongly affect metastatic efficiency: without data on CTC dynamics, any attempt to predict metastatic spread in individual patients is impossible.
△ Less
Submitted 19 September, 2013;
originally announced September 2013.
-
Efficient Data Augmentation in Dynamic Models for Binary and Count Data
Authors:
Jesse Windle,
Carlos M. Carvalho,
James G. Scott,
Liang Sun
Abstract:
Dynamic linear models with Gaussian observations and Gaussian states lead to closed-form formulas for posterior simulation. However, these closed-form formulas break down when the response or state evolution ceases to be Gaussian. Dynamic, generalized linear models exemplify a class of models for which this is the case, and include, amongst other models, dynamic binomial logistic regression and dy…
▽ More
Dynamic linear models with Gaussian observations and Gaussian states lead to closed-form formulas for posterior simulation. However, these closed-form formulas break down when the response or state evolution ceases to be Gaussian. Dynamic, generalized linear models exemplify a class of models for which this is the case, and include, amongst other models, dynamic binomial logistic regression and dynamic negative binomial regression. Finding and appraising posterior simulation techniques for these models is important since modeling temporally correlated categories or counts is useful in a variety of disciplines, including ecology, economics, epidemiology, medicine, and neuroscience. In this paper, we present one such technique, Pólya-Gamma data augmentation, and compare it against two competing methods. We find that the Pólya-Gamma approach works well for dynamic logistic regression and for dynamic negative binomial regression when the count sizes are small. Supplementary files are provided for replicating the benchmarks.
△ Less
Submitted 19 September, 2013; v1 submitted 3 August, 2013;
originally announced August 2013.
-
Edge effects in game theoretic dynamics of spatially structured tumours
Authors:
Artem Kaznatcheev,
Jacob G. Scott,
David Basanta
Abstract:
Background: Analysing tumour architecture for metastatic potential usually focuses on phenotypic differences due to cellular morphology or specific genetic mutations, but often ignore the cell's position within the heterogeneous substructure. Similar disregard for local neighborhood structure is common in mathematical models.
Methods: We view the dynamics of disease progression as an evolutionar…
▽ More
Background: Analysing tumour architecture for metastatic potential usually focuses on phenotypic differences due to cellular morphology or specific genetic mutations, but often ignore the cell's position within the heterogeneous substructure. Similar disregard for local neighborhood structure is common in mathematical models.
Methods: We view the dynamics of disease progression as an evolutionary game between cellular phenotypes. A typical assumption in this modeling paradigm is that the probability of a given phenotypic strategy interacting with another depends exclusively on the abundance of those strategies without regard local heterogeneities. We address this limitation by using the Ohtsuki-Nowak transform to introduce spatial structure to the go vs. grow game.
Results: We show that spatial structure can promote the invasive (go) strategy. By considering the change in neighbourhood size at a static boundary -- such as a blood-vessel, organ capsule, or basement membrane -- we show an edge effect that allows a tumour without invasive phenotypes in the bulk to have a polyclonal boundary with invasive cells. We present an example of this promotion of invasive (EMT positive) cells in a metastatic colony of prostate adenocarcinoma in bone marrow.
Interpretation: Pathologic analyses that do not distinguish between cells in the bulk and cells at a static edge of a tumour can underestimate the number of invasive cells. We expect our approach to extend to other evolutionary game models where interaction neighborhoods change at fixed system boundaries.
△ Less
Submitted 21 January, 2015; v1 submitted 25 July, 2013;
originally announced July 2013.
-
False discovery rate regression: an application to neural synchrony detection in primary visual cortex
Authors:
James G. Scott,
Ryan C. Kelly,
Matthew A. Smith,
Pengcheng Zhou,
Robert E. Kass
Abstract:
Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subset…
▽ More
Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subsets of the experiment. To address this issue, we introduce an approach called false-discovery-rate regression that directly uses this auxiliary information to inform the outcome of each test. The method can be motivated by a two-groups model in which covariates are allowed to influence the local false discovery rate, or equivalently, the posterior probability that a given observation is a signal. This poses many subtle issues at the interface between inference and computation, and we investigate several variations of the overall approach. Simulation evidence suggests that: (1) when covariate effects are present, FDR regression improves power for a fixed false-discovery rate; and (2) when covariate effects are absent, the method is robust, in the sense that it does not lead to inflated error rates. We apply the method to neural recordings from primary visual cortex. The goal is to detect pairs of neurons that exhibit fine-time-scale interactions, in the sense that they fire together more often than expected due to chance. Our method detects roughly 50% more synchronous pairs versus a standard FDR-controlling analysis. The companion R package FDRreg implements all methods described in the paper.
△ Less
Submitted 8 June, 2014; v1 submitted 12 July, 2013;
originally announced July 2013.
-
Expectation-maximization for logistic regression
Authors:
James G. Scott,
Liang Sun
Abstract:
We present a family of expectation-maximization (EM) algorithms for binary and negative-binomial logistic regression, drawing a sharp connection with the variational-Bayes algorithm of Jaakkola and Jordan (2000). Indeed, our results allow a version of this variational-Bayes approach to be re-interpreted as a true EM algorithm. We study several interesting features of the algorithm, and of this pre…
▽ More
We present a family of expectation-maximization (EM) algorithms for binary and negative-binomial logistic regression, drawing a sharp connection with the variational-Bayes algorithm of Jaakkola and Jordan (2000). Indeed, our results allow a version of this variational-Bayes approach to be re-interpreted as a true EM algorithm. We study several interesting features of the algorithm, and of this previously unrecognized connection with variational Bayes. We also generalize the approach to sparsity-promoting priors, and to an online method whose convergence properties are easily established. This latter method compares favorably with stochastic-gradient descent in situations with marked collinearity.
△ Less
Submitted 31 May, 2013;
originally announced June 2013.
-
Mathematical modeling of the metastatic process
Authors:
Jacob G. Scott,
Philip Gerlee,
David Basanta,
Alexander G. Fletcher,
Philip K. Maini,
Alexander RA Anderson
Abstract:
Mathematical modeling in cancer has been growing in popularity and impact since its inception in 1932. The first theoretical mathematical modeling in cancer research was focused on understanding tumor growth laws and has grown to include the competition between healthy and normal tissue, carcinogenesis, therapy and metastasis. It is the latter topic, metastasis, on which we will focus this short r…
▽ More
Mathematical modeling in cancer has been growing in popularity and impact since its inception in 1932. The first theoretical mathematical modeling in cancer research was focused on understanding tumor growth laws and has grown to include the competition between healthy and normal tissue, carcinogenesis, therapy and metastasis. It is the latter topic, metastasis, on which we will focus this short review, specifically discussing various computational and mathematical models of different portions of the metastatic process, including: the emergence of the metastatic phenotype, the timing and size distribution of metastases, the factors that influence the dormancy of micrometastases and patterns of spread from a given primary tumor.
△ Less
Submitted 21 May, 2013; v1 submitted 20 May, 2013;
originally announced May 2013.
-
Nonparametric Bayesian testing for monotonicity
Authors:
James G. Scott,
Thomas S. Shively,
Stephen G. Walker
Abstract:
This paper studies the problem of testing whether a function is monotone from a nonparametric Bayesian perspective. Two new families of tests are constructed. The first uses constrained smoothing splines, together with a hierarchical stochastic-process prior that explicitly controls the prior probability of monotonicity. The second uses regression splines, together with two proposals for the prior…
▽ More
This paper studies the problem of testing whether a function is monotone from a nonparametric Bayesian perspective. Two new families of tests are constructed. The first uses constrained smoothing splines, together with a hierarchical stochastic-process prior that explicitly controls the prior probability of monotonicity. The second uses regression splines, together with two proposals for the prior over the regression coefficients. The finite-sample performance of the tests is shown via simulation to improve upon existing frequentist and Bayesian methods. The asymptotic properties of the Bayes factor for comparing monotone versus non-monotone regression functions in a Gaussian model are also studied. Our results significantly extend those currently available, which chiefly focus on determining the dimension of a parametric linear model.
△ Less
Submitted 1 June, 2014; v1 submitted 11 April, 2013;
originally announced April 2013.
-
A Markov chain model of evolution in asexually reproducing populations: insight and analytical tractability in the evolutionary process
Authors:
Daniel Nichol,
Peter Jeavons,
Robert Bonomo,
Philip K. Maini,
Jerome L. Paul,
Robert A. Gatenby,
Alexander R. A. Anderson,
Jacob G. Scott
Abstract:
The evolutionary process has been modelled in many ways using both stochastic and deterministic models. We develop an algebraic model of evolution in a population of asexually reproducing organisms in which we represent a stochastic walk in phenotype space, constrained to the edges of an underlying graph representing the genotype, with a time-homogeneous Markov Chain. We show its equivalence to a…
▽ More
The evolutionary process has been modelled in many ways using both stochastic and deterministic models. We develop an algebraic model of evolution in a population of asexually reproducing organisms in which we represent a stochastic walk in phenotype space, constrained to the edges of an underlying graph representing the genotype, with a time-homogeneous Markov Chain. We show its equivalence to a more standard, explicit stochastic model and show the algebraic model's superiority in computational efficiency. Because of this increase in efficiency, we offer the ability to simulate the evolution of much larger populations in more realistic genotype spaces. Further, we show how the algebraic properties of the Markov Chain model can give insight into the evolutionary process and allow for analysis using familiar linear algebraic methods.
△ Less
Submitted 17 January, 2013;
originally announced January 2013.
-
Intrinsic cell factors that influence tumourigenicity in cancer stem cells - towards hallmarks of cancer stem cells
Authors:
Jacob G. Scott,
Prakash Chinnaiyan,
Alexander R. A. Anderson,
Anita Hjelmeland,
David Basanta
Abstract:
Since the discovery of a cancer initiating side population in solid tumours, studies focussing on the role of so-called cancer stem cells in cancer initiation and progression have abounded. The biological interrogation of these cells has yielded volumes of information about their behaviour, but there has, as of yet, not been many actionable generalised theoretical conclusions. To address this poin…
▽ More
Since the discovery of a cancer initiating side population in solid tumours, studies focussing on the role of so-called cancer stem cells in cancer initiation and progression have abounded. The biological interrogation of these cells has yielded volumes of information about their behaviour, but there has, as of yet, not been many actionable generalised theoretical conclusions. To address this point, we have created a hybrid, discrete/continuous computational cellular automaton model of a generalised stem-cell driven tissue and explored the phenotypic traits inherent in the inciting cell and the resultant tissue growth. We identify the regions in phenotype parameter space where these initiating cells are able to cause a disruption in homeostasis, leading to tissue overgrowth and tumour formation. As our parameters and model are non-specific, they could apply to any tissue cancer stem-cell and do not assume specific genetic mutations. In this way, our model suggests that targeting these phenotypic traits could represent generalizable strategies across cancer types and represents a first attempt to identify the hallmarks of cancer stem cells.
△ Less
Submitted 20 August, 2013; v1 submitted 16 January, 2013;
originally announced January 2013.
-
A mathematical model of tumor self-seeding reveals secondary metastatic deposits as drivers of primary tumor growth
Authors:
Jacob G Scott,
David Basanta,
Alexander R. A. Anderson,
Philip Gerlee
Abstract:
Two models of circulating tumor cell (CTC) dynamics have been proposed to explain the phenomenon of tumor 'self-seeding', whereby CTCs repopulate the primary tumor and accelerate growth: Primary Seeding, where cells from a primary tumor shed into the vasculature and return back to the primary themselves; and Secondary Seeding, where cells from the primary first metastasize in a secondary tissue an…
▽ More
Two models of circulating tumor cell (CTC) dynamics have been proposed to explain the phenomenon of tumor 'self-seeding', whereby CTCs repopulate the primary tumor and accelerate growth: Primary Seeding, where cells from a primary tumor shed into the vasculature and return back to the primary themselves; and Secondary Seeding, where cells from the primary first metastasize in a secondary tissue and form microscopic secondary deposits, which then shed cells into the vasculature returning to the primary. These two models are difficult to distinguish experimentally, yet the differences between them is of great importance to both our understanding of the metastatic process and also for designing methods of intervention. Therefore we developed a mathematical model to test the relative likelihood of these two phenomena in the subset of tumours whose shed CTCs first encounter the lung capillary bed, and show that Secondary Seeding is several orders of magnitude more likely than Primary seeding. We suggest how this difference could affect tumour evolution, progression and therapy, and propose several possible methods of experimental validation.
△ Less
Submitted 25 February, 2013; v1 submitted 23 May, 2012;
originally announced May 2012.
-
Bayesian inference for logistic models using Polya-Gamma latent variables
Authors:
Nicholas G. Polson,
James G. Scott,
Jesse Windle
Abstract:
We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effects models, and spatial models for…
▽ More
We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effects models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that: (1) circumvent the need for analytic approximations, numerical integration, or Metropolis-Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Polya-Gamma distribution, are implemented in the R package BayesLogit.
In the technical supplement appended to the end of the paper, we provide further details regarding the generation of Polya-Gamma random variables; the empirical benchmarks reported in the main manuscript; and the extension of the basic data-augmentation framework to contingency tables and multinomial outcomes.
△ Less
Submitted 22 July, 2013; v1 submitted 1 May, 2012;
originally announced May 2012.
-
The partition problem: case studies in Bayesian screening for time-varying model structure
Authors:
Zesong Liu,
Jesse Windle,
James G. Scott
Abstract:
This paper presents two case studies of data sets where the main inferential goal is to characterize time-varying patterns in model structure. Both of these examples are seen to be general cases of the so-called "partition problem," where auxiliary information (in this case, time) defines a partition over sample space, and where different models hold for each element of the partition. In the first…
▽ More
This paper presents two case studies of data sets where the main inferential goal is to characterize time-varying patterns in model structure. Both of these examples are seen to be general cases of the so-called "partition problem," where auxiliary information (in this case, time) defines a partition over sample space, and where different models hold for each element of the partition. In the first case study, we identify time-varying graphical structure in the covariance matrix of asset returns from major European equity indices from 2006--2010. This structure has important implications for quantifying the notion of financial contagion, a term often mentioned in the context of the European sovereign debt crisis of this period. In the second case study, we screen a large database of historical corporate performance in order to identify specific firms with impressively good (or bad) streaks of performance.
△ Less
Submitted 2 November, 2011;
originally announced November 2011.
-
An empirical test for Eurozone contagion using an asset-pricing model with heavy-tailed stochastic volatility
Authors:
Nicholas G. Polson,
James G. Scott
Abstract:
This paper proposes an empirical test of financial contagion in European equity markets during the tumultuous period of 2008-2011. Our analysis shows that traditional GARCH and Gaussian stochastic-volatility models are unable to explain two key stylized features of global markets during presumptive contagion periods: shocks to aggregate market volatility can be sudden and explosive, and they are a…
▽ More
This paper proposes an empirical test of financial contagion in European equity markets during the tumultuous period of 2008-2011. Our analysis shows that traditional GARCH and Gaussian stochastic-volatility models are unable to explain two key stylized features of global markets during presumptive contagion periods: shocks to aggregate market volatility can be sudden and explosive, and they are associated with specific directional biases in the cross-section of country-level returns. Our model repairs this deficit by assuming that the random shocks to volatility are heavy-tailed and correlated cross-sectionally, both with each other and with returns. The fundamental conclusion of our analysis is that great care is needed in modeling volatility if one wishes to characterize the relationship between volatility and contagion that is predicted by economic theory.
In analyzing daily data, we find evidence for significant contagion effects during the major EU crisis periods of May 2010 and August 2011, where contagion is defined as excess correlation in the residuals from a factor model incorporating global and regional market risk factors. Some of this excess correlation can be explained by quantifying the impact of shocks to aggregate volatility in the cross-section of expected returns - but only, it turns out, if one is extremely careful in accounting for the explosive nature of these shocks. We show that global markets have time-varying cross-sectional sensitivities to these shocks, and that high sensitivities strongly predict periods of financial crisis. Moreover, the pattern of temporal changes in correlation structure between volatility and returns is readily interpretable in terms of the major events of the periods in question.
△ Less
Submitted 26 March, 2012; v1 submitted 26 October, 2011;
originally announced October 2011.
-
Default Bayesian analysis for multi-way tables: a data-augmentation approach
Authors:
Nicholas G. Polson,
James G. Scott
Abstract:
This paper proposes a strategy for regularized estimation in multi-way contingency tables, which are common in meta-analyses and multi-center clinical trials. Our approach is based on data augmentation, and appeals heavily to a novel class of Polya-Gamma distributions. Our main contributions are to build up the relevant distributional theory and to demonstrate three useful features of this data-au…
▽ More
This paper proposes a strategy for regularized estimation in multi-way contingency tables, which are common in meta-analyses and multi-center clinical trials. Our approach is based on data augmentation, and appeals heavily to a novel class of Polya-Gamma distributions. Our main contributions are to build up the relevant distributional theory and to demonstrate three useful features of this data-augmentation scheme. First, it leads to simple EM and Gibbs-sampling algorithms for posterior inference, circumventing the need for analytic approximations, numerical integration, Metropolis--Hastings, or variational methods. Second, it allows modelers much more flexibility when choosing priors, which have traditionally come from the Dirichlet or logistic-normal family. For example, our approach allows users to incorporate Bayesian analogues of classical penalized-likelihood techniques (e.g. the lasso or bridge) in computing regularized estimates for log-odds ratios. Finally, our data-augmentation scheme naturally suggests a default strategy for prior selection based on the logistic-Z model, which is strongly related to Jeffreys' prior for a binomial proportion. To illustrate the method we focus primarily on the particular case of a meta-analysis/multi-center study (or a JxKxN table). But the general approach encompasses many other common situations, of which we will provide examples.
△ Less
Submitted 19 September, 2011;
originally announced September 2011.