-
The appeal of the gamma family distribution to protect the confidentiality of contingency tables
Authors:
James Jackson,
Robin Mitra,
Brian Francis,
Iain Dove
Abstract:
Administrative databases, such as the English School Census (ESC), are rich sources of information that are potentially useful for researchers. For such data sources to be made available, however, strict guarantees of privacy would be required. To achieve this, synthetic data methods can be used. Such methods, when protecting the confidentiality of tabular data (contingency tables), often utilise…
▽ More
Administrative databases, such as the English School Census (ESC), are rich sources of information that are potentially useful for researchers. For such data sources to be made available, however, strict guarantees of privacy would be required. To achieve this, synthetic data methods can be used. Such methods, when protecting the confidentiality of tabular data (contingency tables), often utilise the Poisson or Poisson-mixture distributions, such as the negative binomial (NBI). These distributions, however, are either equidispersed (in the case of the Poisson) or overdispersed (e.g. in the case of the NBI), which results in excessive noise being applied to large low-risk counts. This paper proposes the use of the (discretized) gamma family (GAF) distribution, which allows noise to be applied in a more bespoke fashion. Specifically, it allows less noise to be applied as cell counts become larger, providing an optimal balance in relation to the risk-utility trade-off. We illustrate the suitability of the GAF distribution on an administrative-type data set that is reminiscent of the ESC.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Bernoulli amputation
Authors:
Marius Hofert,
James Jackson,
Niels Hagenbuch
Abstract:
An approach to amputation, the process of introducing missing values to a complete dataset, is presented. It allows to construct missingness indicators in a flexible and principled way via copulas and Bernoulli margins and to incorporate dependence in missingness patterns. Besides more classical missingness models such as missing completely at random, missing at random, and missing not at random,…
▽ More
An approach to amputation, the process of introducing missing values to a complete dataset, is presented. It allows to construct missingness indicators in a flexible and principled way via copulas and Bernoulli margins and to incorporate dependence in missingness patterns. Besides more classical missingness models such as missing completely at random, missing at random, and missing not at random, the approach is able to model structured missingness such as block missingness and, via mixtures, monotone missingness, which are patterns of missing data frequently found in real-life datasets. Properties such as joint missingness probabilities or missingness correlation are derived mathematically. The approach is demonstrated with mathematical examples and empirical illustrations in terms of a well-known dataset.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Idiographic Personality Gaussian Process for Psychological Assessment
Authors:
Yehu Chen,
Muchen Xi,
Jacob Montgomery,
Joshua Jackson,
Roman Garnett
Abstract:
We develop a novel measurement framework based on a Gaussian process coregionalization model to address a long-lasting debate in psychometrics: whether psychological features like personality share a common structure across the population, vary uniquely for individuals, or some combination. We propose the idiographic personality Gaussian process (IPGP) framework, an intermediate model that accommo…
▽ More
We develop a novel measurement framework based on a Gaussian process coregionalization model to address a long-lasting debate in psychometrics: whether psychological features like personality share a common structure across the population, vary uniquely for individuals, or some combination. We propose the idiographic personality Gaussian process (IPGP) framework, an intermediate model that accommodates both shared trait structure across a population and "idiographic" deviations for individuals. IPGP leverages the Gaussian process coregionalization model to handle the grouped nature of battery responses, but adjusted to non-Gaussian ordinal data. We further exploit stochastic variational inference for efficient latent factor estimation required for idiographic modeling at scale. Using synthetic and real data, we show that IPGP improves both prediction of actual responses and estimation of individualized factor structures relative to existing benchmarks. In a third study, we show that IPGP also identifies unique clusters of personality taxonomies in real-world data, displaying great potential in advancing individualized approaches to psychological diagnosis and treatment.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Obtaining $(ε,δ)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables
Authors:
James Jackson,
Robin Mitra,
Brian Francis,
Iain Dove
Abstract:
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(ε, δ)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential databa…
▽ More
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(ε, δ)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
A Complete Characterisation of Structured Missingness
Authors:
James Jackson,
Robin Mitra,
Niels Hagenbuch,
Sarah McGough,
Chris Harbron
Abstract:
Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically…
▽ More
Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically assume that variables' missingness indicator vectors $M_1$, $M_2$, ..., $M_p$ are independent after conditioning on the relevant portion of the data matrix $\mathbf{X}$. As this is often unsuitable for characterising SM in multivariate settings, we introduce a taxonomy for SM, where each ${M}_j$ can depend on $\mathbf{M}_{-j}$ (i.e., all missingness indicator vectors except ${M}_j$), in addition to $\mathbf{X}$. We embed this new framework within the well-established decomposition of mechanisms into MCAR, MAR, and MNAR (Rubin, 1976), allowing us to recast mechanisms into a broader setting, where we can consider the combined effect of $\mathbf{X}$ and $\mathbf{M}_{-j}$ on ${M}_j$. We also demonstrate, via simulations, the impact of SM on inference and prediction, and consider contextual instances of SM arising in a de-identified nationwide (US-based) clinico-genomic database (CGDB). We hope to stimulate interest in SM, and encourage timely research into this phenomenon.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
The Target Study: A Conceptual Model and Framework for Measuring Disparity
Authors:
John W. Jackson,
Yea-Jen Hsu,
Raquel C. Greer,
Romsai T. Boonyasai,
Chanelle J. Howe
Abstract:
We present a conceptual model to measure disparity--the target study--where social groups may be similarly situated (i.e., balanced) on allowable covariates. Our model, based on a sampling design, does not intervene to assign social group membership or alter allowable covariates. To address non-random sample selection, we extend our model to generalize or transport disparity or to assess disparity…
▽ More
We present a conceptual model to measure disparity--the target study--where social groups may be similarly situated (i.e., balanced) on allowable covariates. Our model, based on a sampling design, does not intervene to assign social group membership or alter allowable covariates. To address non-random sample selection, we extend our model to generalize or transport disparity or to assess disparity after an intervention on eligibility-related variables that eliminates forms of collider-stratification. To avoid bias from differential timing of enrollment, we aggregate time-specific study results by balancing calendar time of enrollment across social groups. To provide a framework for emulating our model, we discuss study designs, data structures, and G-computation and weighting estimators. We compare our sampling-based model to prominent decomposition-based models used in healthcare and algorithmic fairness. We provide R code for all estimators and apply our methods to measure health system disparities in hypertension control using electronic medical records.
△ Less
Submitted 11 January, 2025; v1 submitted 1 July, 2022;
originally announced July 2022.
-
On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach
Authors:
James Edward Jackson,
Robin Mitra,
Brian Joseph Francis,
Iain Dove
Abstract:
Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ($m=1$) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility…
▽ More
Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ($m=1$) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of $m$ is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, $τ_3(k,d)$ and $τ_4(k,d)$, that are suited for assessing disclosure risk in multiple categorical synthetic data sets. Finally, the synthesis methods are demonstrated empirically.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Using saturated count models for user-friendly synthesis of categorical data
Authors:
James Edward Jackson,
Robin Mitra,
Brian Joseph Francis,
Iain Dove
Abstract:
Over the past three decades, synthetic data methods for statistical disclosure control have continually evolved, but mainly within the domain of survey data sets. There are certain characteristics of administrative databases, such as their size, which present challenges from a synthesis perspective and require special attention. This paper, through the fitting of saturated count models, presents a…
▽ More
Over the past three decades, synthetic data methods for statistical disclosure control have continually evolved, but mainly within the domain of survey data sets. There are certain characteristics of administrative databases, such as their size, which present challenges from a synthesis perspective and require special attention. This paper, through the fitting of saturated count models, presents a synthesis method that is suitable for administrative databases that is tuned by two parameters. The method allows large categorical data sets to be synthesized quickly and allows risk and utility metrics to be satisfied a priori, that is, prior to synthetic data generation. The paper explores how the flexibility afforded by two-parameter count models (the negative binomial and Poisson-inverse Gaussian) can be utilised to protect respondents' - especially uniques' - privacy in synthetic data. Finally, an empirical example is carried out through the synthesis of a database which can be viewed as a good substitute to the English School Census.
△ Less
Submitted 12 May, 2022; v1 submitted 16 July, 2021;
originally announced July 2021.
-
Increasing the efficiency of Sequential Monte Carlo samplers through the use of approximately optimal L-kernels
Authors:
Peter L Green,
Robert E Moore,
Ryan J Jackson,
Jinglai Li,
Simon Maskell
Abstract:
By facilitating the generation of samples from arbitrary probability distributions, Markov Chain Monte Carlo (MCMC) is, arguably, \emph{the} tool for the evaluation of Bayesian inference problems that yield non-standard posterior distributions. In recent years, however, it has become apparent that Sequential Monte Carlo (SMC) samplers have the potential to outperform MCMC in a number of ways. SMC…
▽ More
By facilitating the generation of samples from arbitrary probability distributions, Markov Chain Monte Carlo (MCMC) is, arguably, \emph{the} tool for the evaluation of Bayesian inference problems that yield non-standard posterior distributions. In recent years, however, it has become apparent that Sequential Monte Carlo (SMC) samplers have the potential to outperform MCMC in a number of ways. SMC samplers are better suited to highly parallel computing architectures and also feature various tuning parameters that are not available to MCMC. One such parameter - the `L-kernel' - is a user-defined probability distribution that can be used to influence the efficiency of the sampler. In the current paper, the authors explain how to derive an expression for the L-kernel that minimises the variance of the estimates realised by an SMC sampler. Various approximation methods are then proposed to aid implementation of the proposed L-kernel. The improved performance of the resulting algorithm is demonstrated in multiple scenarios. For the examples shown in the current paper, the use of an approximately optimum L-kernel has reduced the variance of the SMC estimates by up to 99 % while also reducing the number of times that resampling was required by between 65 % and 70 %. Python code and code tests accompanying this manuscript are available through the Github repository \url{https://github.com/plgreenLIRU/SMC_approx_optL}.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
Data Vision: Learning to See Through Algorithmic Abstraction
Authors:
Samir Passi,
Steven J. Jackson
Abstract:
Learning to see through data is central to contemporary forms of algorithmic knowledge production. While often represented as a mechanical application of rules, making algorithms work with data requires a great deal of situated work. This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments. Drawing on research in CSCW and t…
▽ More
Learning to see through data is central to contemporary forms of algorithmic knowledge production. While often represented as a mechanical application of rules, making algorithms work with data requires a great deal of situated work. This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments. Drawing on research in CSCW and the social sciences, and ethnographic fieldwork in two data learning environments, we show how an algorithm's application is seen sometimes as a mechanical sequence of rules and at other times as an array of situated decisions. Casting data analytics as a rule-based (rather than rule-bound) practice, we show that effective data vision requires would-be analysts to straddle the competing demands of formal abstraction and empirical contingency. We conclude by discussing how the notion of data vision can help better leverage the role of human work in data analytic learning, research, and practice.
△ Less
Submitted 9 February, 2020;
originally announced February 2020.
-
Meaningful causal decompositions in health equity research: definition, identification, and estimation through a weighting framework
Authors:
John W. Jackson
Abstract:
Causal decomposition analyses can help build the evidence base for interventions that address health disparities (inequities). They ask how disparities in outcomes may change under hypothetical intervention. Through study design and assumptions, they can rule out alternate explanations such as confounding, selection-bias, and measurement error, thereby identifying potential targets for interventio…
▽ More
Causal decomposition analyses can help build the evidence base for interventions that address health disparities (inequities). They ask how disparities in outcomes may change under hypothetical intervention. Through study design and assumptions, they can rule out alternate explanations such as confounding, selection-bias, and measurement error, thereby identifying potential targets for intervention. Unfortunately, the literature on causal decomposition analysis and related methods have largely ignored equity concerns that actual interventionists would respect, limiting their relevance and practical value. This paper addresses these concerns by explicitly considering what covariates the outcome disparity and hypothetical intervention adjust for (so-called allowable covariates) and the equity value judgements these choices convey, drawing from the bioethics, biostatistics, epidemiology, and health services research literatures. From this discussion, we generalize decomposition estimands and formulae to incorporate allowable covariate sets, to reflect equity choices, while still allowing for adjustment of non-allowable covariates needed to satisfy causal assumptions. For these general formulae, we provide weighting-based estimators based on adaptations of ratio-of-mediator-probability and inverse-odds-ratio weighting. We discuss when these estimators reduce to already used estimators under certain equity value judgements, and a novel adaptation under other judgements.
△ Less
Submitted 15 September, 2020; v1 submitted 22 September, 2019;
originally announced September 2019.
-
Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks
Authors:
Qiuqiang Kong,
Yong Xu,
Wenwu Wang,
Philip J. B. Jackson,
Mark D. Plumbley
Abstract:
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synt…
▽ More
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18.9 dB and 15.4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15.3 dB and 12.2 dB, respectively and achieves a PSNR of 13.2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10.1 dB.
△ Less
Submitted 14 June, 2019;
originally announced June 2019.
-
An Empirical Study on Hyperparameters and their Interdependence for RL Generalization
Authors:
Xingyou Song,
Yilun Du,
Jacob Jackson
Abstract:
Recent results in Reinforcement Learning (RL) have shown that agents with limited training environments are susceptible to a large amount of overfitting across many domains. A key challenge for RL generalization is to quantitatively explain the effects of changing parameters on testing performance. Such parameters include architecture, regularization, and RL-dependent variables such as discount fa…
▽ More
Recent results in Reinforcement Learning (RL) have shown that agents with limited training environments are susceptible to a large amount of overfitting across many domains. A key challenge for RL generalization is to quantitatively explain the effects of changing parameters on testing performance. Such parameters include architecture, regularization, and RL-dependent variables such as discount factor and action stochasticity. We provide empirical results that show complex and interdependent relationships between hyperparameters and generalization. We further show that several empirical metrics such as gradient cosine similarity and trajectory-dependent metrics serve to provide intuition towards these results.
△ Less
Submitted 2 June, 2019;
originally announced June 2019.
-
Semi-Supervised Learning by Label Gradient Alignment
Authors:
Jacob Jackson,
John Schulman
Abstract:
We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by mapping a point (x, y) to the gradient of the model at (x, y). We then formulate an optimization problem whose objective is to minimize the distance between the lab…
▽ More
We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by mapping a point (x, y) to the gradient of the model at (x, y). We then formulate an optimization problem whose objective is to minimize the distance between the labeled and the unlabeled data in this space, and we solve it by gradient descent on the imputed labels. We evaluate label gradient alignment using the standardized architecture introduced by Oliver et al. (2018) and demonstrate state-of-the-art accuracy in semi-supervised CIFAR-10 classification.
△ Less
Submitted 6 February, 2019;
originally announced February 2019.
-
Rapid Prototyping Model for Healthcare Alternative Payment Models: Replicating the Federally Qualified Health Center Advanced Primary Care Practice Demonstration
Authors:
Jarrod Olson,
Amir Rahimi,
Po Hsu Allen Chen,
J. Elizabeth Jackson,
Tyler Coy,
Adrienne Cocci,
Nancy McMillan,
Jeff Geppert
Abstract:
Innovation in healthcare payment and service delivery utilizes high cost, high risk pilots paired with traditional program evaluations. Decision-makers are unable to reliably forecast the impacts of pilot interventions in this complex system, complicating the feasibility assessment of proposed healthcare models. We developed and validated a Discrete Event Simulation (DES) model of primary care for…
▽ More
Innovation in healthcare payment and service delivery utilizes high cost, high risk pilots paired with traditional program evaluations. Decision-makers are unable to reliably forecast the impacts of pilot interventions in this complex system, complicating the feasibility assessment of proposed healthcare models. We developed and validated a Discrete Event Simulation (DES) model of primary care for patients with Diabetes to allow rapid prototyping and assessment of models before pilot implementation. We replicated four outcomes from the Centers for Medicare and Medicaid Services Federally Qualified Health Center Advanced Primary Care Practice pilot. The DES model simulates a synthetic population's healthcare experience, including symptom onset, appointment scheduling, screening, and treatment, as well as the impact of physician training. A network of detailed event modules was developed from peer-reviewed literature. Synthetic patients' attributes modify the probability distributions for event outputs and direct them through an episode of care; attributes are in turn modified by patients' experiences. Our model replicates the direction of the effect of physician training on the selected outcomes, and the strength of the effect increases with the number of trainings. The simulated effect strength replicates the pilot results for eye exams and nephropathy screening, but over-estimates results for HbA1c and LDL screening. Our model will improve decision-makers' abilities to assess the feasibility of pilot success, with reproducible, literature-based systems models. Our model identifies intervention and healthcare system components to which outcomes are sensitive, so these aspects can be monitored and controlled during pilot implementation. More work is needed to improve replication of HbA1c and LDL screening, and to elaborate sub-models related to intervention components.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Decomposition analysis to identify intervention targets for reducing disparities
Authors:
John W. Jackson,
Tyler J. VanderWeele
Abstract:
There has been considerable interest in using decomposition methods in epidemiology (mediation analysis) and economics (Oaxaca-Blinder decomposition) to understand how health disparities arise and how they might change upon intervention. It has not been clear when estimates from the Oaxaca-Blinder decomposition can be interpreted causally because its implementation does not explicitly address pote…
▽ More
There has been considerable interest in using decomposition methods in epidemiology (mediation analysis) and economics (Oaxaca-Blinder decomposition) to understand how health disparities arise and how they might change upon intervention. It has not been clear when estimates from the Oaxaca-Blinder decomposition can be interpreted causally because its implementation does not explicitly address potential confounding of target variables. While mediation analysis does explicitly adjust for confounders of target variables, it does so in a way that entails equalizing confounders across racial groups, which may not reflect the intended intervention. Revisiting prior analyses in the National Longitudinal Survey of Youth on disparities in wages, unemployment, incarceration, and overall health with test scores, taken as a proxy for educational attainment, as a target intervention, we propose and demonstrate a novel decomposition that controls for confounders of test scores (measures of childhood SES) while leaving their association with race intact. We compare this decomposition with others that use standardization (to equalize childhood SES alone), mediation analysis (to equalize test scores within levels of childhood SES), and one that equalizes both childhood SES and test scores. We also show how these decompositions, including our novel proposals, are equivalent to causal implementations of the Oaxaca-Blinder decomposition.
△ Less
Submitted 17 March, 2017;
originally announced March 2017.
-
A two-state mixed hidden Markov model for risky teenage driving behavior
Authors:
John C. Jackson,
Paul S. Albert,
Zhiwei Zhang
Abstract:
This paper proposes a joint model for longitudinal binary and count outcomes. We apply the model to a unique longitudinal study of teen driving where risky driving behavior and the occurrence of crashes or near crashes are measured prospectively over the first 18 months of licensure. Of scientific interest is relating the two processes and predicting crash and near crash outcomes. We propose a two…
▽ More
This paper proposes a joint model for longitudinal binary and count outcomes. We apply the model to a unique longitudinal study of teen driving where risky driving behavior and the occurrence of crashes or near crashes are measured prospectively over the first 18 months of licensure. Of scientific interest is relating the two processes and predicting crash and near crash outcomes. We propose a two-state mixed hidden Markov model whereby the hidden state characterizes the mean for the joint longitudinal crash/near crash outcomes and elevated g-force events which are a proxy for risky driving. Heterogeneity is introduced in both the conditional model for the count outcomes and the hidden process using a shared random effect. An estimation procedure is presented using the forward-backward algorithm along with adaptive Gaussian quadrature to perform numerical integration. The estimation procedure readily yields hidden state probabilities as well as providing for a broad class of predictors.
△ Less
Submitted 16 September, 2015;
originally announced September 2015.