-
Interpreting Network Differential Privacy
Authors:
Jonathan Hehir,
Xiaoyue Niu,
Aleksandra Slavkovic
Abstract:
How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actual…
▽ More
How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actually protected under DP (tests of complete networks) and the sorts of hypotheses implied to be protected by common claims (tests of individual edges). We demonstrate some conditions under which this gap can be bridged, while leaving some questions open. While some discussion is specific to edge DP, we offer selected results in terms of abstract DP definitions and provide discussion of the implications for other forms of network DP.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Gaussian Differentially Private Human Faces Under a Face Radial Curve Representation
Authors:
Carlos Soto,
Matthew Reimherr,
Aleksandra Slavkovic,
Mark Shriver
Abstract:
In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one's identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. W…
▽ More
In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one's identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. We further propose a novel representation, face radial curves, of a 3D face as a set of functions and then utilize our proposed GDP functional data mechanism. To preserve the shape of the face while injecting noise we rely on tools from shape analysis for our novel representation of the face. We show that our method preserves the shape of the average face and injects less noise than traditional methods for the same privacy budget. Our mechanism consists of two primary components, the first is generally applicable to function value summaries (as are commonly found in nonparametric statistics or functional data analysis) while the second is general to disk-like surfaces and hence more applicable than just to human faces.
△ Less
Submitted 15 April, 2025; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Assessing Utility of Differential Privacy for RCTs
Authors:
Soumya Mukherjee,
Aratrika Mustafi,
Aleksandra Slavković,
Lars Vilhuber
Abstract:
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the respons…
▽ More
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with \ac{DP} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Differentially Private Synthetic Heavy-tailed Data
Authors:
Tran Tran,
Matthew Reimherr,
Aleksandra Slavković
Abstract:
The U.S. Census Longitudinal Business Database (LBD) product contains employment and payroll information of all U.S. establishments and firms dating back to 1976 and is an invaluable resource for economic research. However, the sensitive information in LBD requires confidentiality measures that the U.S. Census in part addressed by releasing a synthetic version (SynLBD) of the data to protect firms…
▽ More
The U.S. Census Longitudinal Business Database (LBD) product contains employment and payroll information of all U.S. establishments and firms dating back to 1976 and is an invaluable resource for economic research. However, the sensitive information in LBD requires confidentiality measures that the U.S. Census in part addressed by releasing a synthetic version (SynLBD) of the data to protect firms' privacy while ensuring its usability for research activities, but without provable privacy guarantees. In this paper, we propose using the framework of differential privacy (DP) that offers strong provable privacy protection against arbitrary adversaries to generate synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility. We propose using the K-Norm Gradient Mechanism (KNG) with quantile regression for DP synthetic data generation. The proposed methodology offers the flexibility of the well-known exponential mechanism while adding less noise. We propose implementing KNG in a stepwise and sandwich order, such that new quantile estimation relies on previously sampled quantiles, to more efficiently use the privacy-loss budget. Generating synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility is a challenging problem for data curators and researchers. However, we show that the proposed methods can achieve better data utility relative to the original KNG at the same privacy-loss budget through a simulation study and an application to the Synthetic Longitudinal Business Database.
△ Less
Submitted 14 October, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Shape And Structure Preserving Differential Privacy
Authors:
Carlos Soto,
Karthik Bharath,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively cur…
▽ More
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively curved manifold, such as Kendall's 2D shape space, is significantly influences by the curvature. Focusing on the problem of sanitizing the Fréchet mean of a sample of points on a manifold, we exploit the characterisation of the mean as the minimizer of an objective function comprised of the sum of squared distances and develop a K-norm gradient mechanism on Riemannian manifolds that favors values that produce gradients close to the the zero of the objective function. For the case of positively curved manifolds, we describe how using the gradient of the squared distance function offers better control over sensitivity than the Laplace mechanism, and demonstrate this numerically on a dataset of shapes of corpus callosa. Further illustrations of the mechanism's utility on a sphere and the manifold of symmetric positive definite matrices are also presented.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Perfect Spectral Clustering with Discrete Covariates
Authors:
Jonathan Hehir,
Xiaoyue Niu,
Aleksandra Slavkovic
Abstract:
Among community detection methods, spectral clustering enjoys two desirable properties: computational efficiency and theoretical guarantees of consistency. Most studies of spectral clustering consider only the edges of a network as input to the algorithm. Here we consider the problem of performing community detection in the presence of discrete node covariates, where network structure is determine…
▽ More
Among community detection methods, spectral clustering enjoys two desirable properties: computational efficiency and theoretical guarantees of consistency. Most studies of spectral clustering consider only the edges of a network as input to the algorithm. Here we consider the problem of performing community detection in the presence of discrete node covariates, where network structure is determined by a combination of a latent block model structure and homophily on the observed covariates. We propose a spectral algorithm that we prove achieves perfect clustering with high probability on a class of large, sparse networks with discrete covariates, effectively separating latent network structure from homophily on observed covariates. To our knowledge, our method is the first to offer a guarantee of consistent latent structure recovery using spectral clustering in the setting where edge formation is dependent on both latent and observed factors.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Statistical Data Privacy: A Song of Privacy and Utility
Authors:
Aleksandra Slavkovic,
Jeremy Seeman
Abstract:
To quantify trade-offs between increasing demand for open data sharing and concerns about sensitive information disclosure, statistical data privacy (SDP) methodology analyzes data release mechanisms which sanitize outputs based on confidential data. Two dominant frameworks exist: statistical disclosure control (SDC), and more recent, differential privacy (DP). Despite framing differences, both SD…
▽ More
To quantify trade-offs between increasing demand for open data sharing and concerns about sensitive information disclosure, statistical data privacy (SDP) methodology analyzes data release mechanisms which sanitize outputs based on confidential data. Two dominant frameworks exist: statistical disclosure control (SDC), and more recent, differential privacy (DP). Despite framing differences, both SDC and DP share the same statistical problems at its core. For inference problems, we may either design optimal release mechanisms and associated estimators that satisfy bounds on disclosure risk, or we may adjust existing sanitized output to create new optimal estimators. Both problems rely on uncertainty quantification in evaluating risk and utility. In this review, we discuss the statistical foundations common to both SDC and DP, highlight major developments in SDP, and present exciting open research problems in private inference.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can speci…
▽ More
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can specify privacy budgets in advance and implement samplers with exact privacy guarantees. In this paper, we use tools from ergodic theory and perfect simulation to design exact finite runtime sampling algorithms for the exponential mechanism by introducing an intermediate modified target distribution using artificial atoms. We propose an additional modification of this sampling algorithm that maintains its $ε$-DP guarantee and has improved runtime at the cost of some utility. We then compare these methods in scenarios where we can explicitly calculate a $δ$ cost (as in $(ε, δ)$-DP) incurred when using standard MCMC techniques. Much as there is a well known trade-off between privacy and utility, we demonstrate that there is also a trade-off between privacy guarantees and runtime.
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
Formal Privacy for Partially Private Data
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize t…
▽ More
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize the privacy loss attributable to the entire collection of releases. In this paper, we present a privacy formalism, $(ε, \{ Θ_z\}_{z \in \mathcal{Z}})$-Pufferfish ($ε$-TP for short when $\{ Θ_z\}_{z \in \mathcal{Z}}$ is implied), a collection of Pufferfish mechanisms indexed by realizations of a random variable $Z$ representing public information not protected with DP noise. First, we prove that this definition has similar properties to DP. Next, we introduce mechanisms for releasing partially private data (PPD) satisfying $ε$-TP and prove their desirable properties. We provide algorithms for sampling from the posterior of a parameter given PPD. We then compare this inference approach to the alternative where noisy statistics are deterministically combined with Z. We derive mild conditions under which using our algorithms offers both theoretical and computational improvements over this more common approach. Finally, we demonstrate all the effects above on a case study on COVID-19 data.
△ Less
Submitted 14 December, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
A Latent Class Modeling Approach for Generating Synthetic Data and Making Posterior Inferences from Differentially Private Counts
Authors:
Michelle Pistner Nixon,
Andrés F. Barrientos,
Jerome P. Reiter,
Aleksandra Slavković
Abstract:
Several algorithms exist for creating differentially private counts from contingency tables, such as two-way or three-way marginal counts. The resulting noisy counts generally do not correspond to a coherent contingency table, so that some post-processing step is needed if one wants the released counts to correspond to a coherent contingency table. We present a latent class modeling approach for p…
▽ More
Several algorithms exist for creating differentially private counts from contingency tables, such as two-way or three-way marginal counts. The resulting noisy counts generally do not correspond to a coherent contingency table, so that some post-processing step is needed if one wants the released counts to correspond to a coherent contingency table. We present a latent class modeling approach for post-processing differentially private marginal counts that can be used (i) to create differentially private synthetic data from the set of marginal counts, and (ii) to enable posterior inferences about the confidential counts. We illustrate the approach using a subset of the 2016 American Community Survey Public Use Microdata Sets and the 2004 National Long Term Care Survey.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
Perturbed M-Estimation: A Further Investigation of Robust Statistics for Differential Privacy
Authors:
Aleksandra Slavkovic,
Roberto Molinari
Abstract:
Differential Privacy (DP) provides an elegant mathematical framework for defining a provable disclosure risk in the presence of arbitrary adversaries; it guarantees that whether an individual is in a database or not, the results of a DP procedure should be similar in terms of their probability distribution. While DP mechanisms are provably effective in protecting privacy, they often negatively imp…
▽ More
Differential Privacy (DP) provides an elegant mathematical framework for defining a provable disclosure risk in the presence of arbitrary adversaries; it guarantees that whether an individual is in a database or not, the results of a DP procedure should be similar in terms of their probability distribution. While DP mechanisms are provably effective in protecting privacy, they often negatively impact the utility of the query responses, statistics and/or analyses that come as outputs from these mechanisms. To address this problem, we use ideas from the area of robust statistics which aims at reducing the influence of outlying observations on statistical inference. Based on the preliminary known links between differential privacy and robust statistics, we modify the objective perturbation mechanism by making use of a new bounded function and define a bounded M-Estimator with adequate statistical properties. The resulting privacy mechanism, named "Perturbed M-Estimation", shows important potential in terms of improved statistical utility of its outputs as suggested by some preliminary results. These results consequently support the need to further investigate the use of robust statistical tools for differential privacy.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Consistent Spectral Clustering of Network Block Models under Local Differential Privacy
Authors:
Jonathan Hehir,
Aleksandra Slavkovic,
Xiaoyue Niu
Abstract:
The stochastic block model (SBM) and degree-corrected block model (DCBM) are network models often selected as the fundamental setting in which to analyze the theoretical properties of community detection methods. We consider the problem of spectral clustering of SBM and DCBM networks under a local form of edge differential privacy. Using a randomized response privacy mechanism called the edge-flip…
▽ More
The stochastic block model (SBM) and degree-corrected block model (DCBM) are network models often selected as the fundamental setting in which to analyze the theoretical properties of community detection methods. We consider the problem of spectral clustering of SBM and DCBM networks under a local form of edge differential privacy. Using a randomized response privacy mechanism called the edge-flip mechanism, we develop theoretical guarantees for differentially private community detection, demonstrating conditions under which this strong privacy guarantee can be upheld while achieving spectral clustering convergence rates that match the known rates without privacy. We prove the strongest theoretical results are achievable for dense networks (those with node degree linear in the number of nodes), while weak consistency is achievable under mild sparsity (node degree greater than $\sqrt{n}$). We empirically demonstrate our results on a number of network examples.
△ Less
Submitted 20 September, 2021; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Privacy for Spatial Point Process Data
Authors:
Adam Walder,
Ephraim M. Hanks,
Aleksandra Slavković
Abstract:
In this work we develop methods for privatizing spatial location data, such as spatial locations of individual disease cases. We propose two novel Bayesian methods for generating synthetic location data based on log-Gaussian Cox processes (LGCPs). We show that conditional predictive ordinate (CPO) estimates can easily be obtained for point process data. We construct a novel risk metric that utiliz…
▽ More
In this work we develop methods for privatizing spatial location data, such as spatial locations of individual disease cases. We propose two novel Bayesian methods for generating synthetic location data based on log-Gaussian Cox processes (LGCPs). We show that conditional predictive ordinate (CPO) estimates can easily be obtained for point process data. We construct a novel risk metric that utilizes CPO estimates to evaluate individual disclosure risks. We adapt the propensity mean square error (pMSE) data utility metric for LGCPs. We demonstrate that our synthesis methods offer an improved risk vs. utility balance in comparison to radial synthesis with a case study of Dr. John Snow's cholera outbreak data.
△ Less
Submitted 28 April, 2020; v1 submitted 28 March, 2020;
originally announced March 2020.
-
Differentially Private Inference for Binomial Data
Authors:
Jordan Awan,
Aleksandra Slavkovic
Abstract:
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this str…
▽ More
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a 'Neyman-Pearson lemma' for binomial data under DP, where the DP-UMP only depends on the sample sum. Our tests can also be stated as a post-processing of a random variable, whose distribution we coin ''Truncated-Uniform-Laplace'' (Tulap), a generalization of the Staircase and discrete Laplace distributions. Furthermore, we obtain exact $p$-values, which are easily computed in terms of the Tulap random variable.
Using the above techniques, we show that our tests can be applied to give uniformly most accurate one-sided confidence intervals and optimal confidence distributions. We also derive uniformly most powerful unbiased (UMPU) two-sided tests, which lead to uniformly most accurate unbiased (UMAU) two-sided confidence intervals. We show that our results can be applied to distribution-free hypothesis tests for continuous data. Our simulation results demonstrate that all our tests have exact type I error, and are more powerful than current techniques.
△ Less
Submitted 31 March, 2019;
originally announced April 2019.
-
Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA
Authors:
Jordan Awan,
Ana Kenney,
Matthew Reimherr,
Aleksandra Slavković
Abstract:
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the outpu…
▽ More
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the output space, such as a Guassian process. We provide a positive result that establishes a Central Limit Theorem for the exponential mechanism quite broadly. We also provide an apparent negative result, showing that the magnitude of the noise introduced for privacy is asymptotically non-negligible relative to the statistical estimation error. We develop an \ep-DP mechanism for functional principal component analysis, applicable in separable Hilbert spaces. We demonstrate its performance via simulations and applications to two datasets.
△ Less
Submitted 30 January, 2019;
originally announced January 2019.
-
pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity
Authors:
Joshua Snoke,
Aleksandra Slavković
Abstract:
We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals' privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privac…
▽ More
We propose a method for the release of differentially private synthetic datasets. In many contexts, data contain sensitive values which cannot be released in their original form in order to protect individuals' privacy. Synthetic data is a protection method that releases alternative values in place of the original ones, and differential privacy (DP) is a formal guarantee for quantifying the privacy loss. We propose a method that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE, while guaranteeing epsilon-differential privacy. Additionally, we relax common DP assumptions concerning the distribution and boundedness of the original data. We prove theoretical results for the privacy guarantee and provide simulations for the empirical failure rate of the theoretical results under typical computational limitations. We also give simulations for the accuracy of linear regression coefficients generated from the synthetic data compared with the accuracy of non-differentially private synthetic data and other differentially private methods. Additionally, our theoretical results extend a prior result for the sensitivity of the Gini Index to include continuous predictors.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Differentially Private Uniformly Most Powerful Tests for Binomial Data
Authors:
Jordan Awan,
Aleksandra Slavkovic
Abstract:
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a `Neyman-Pearson lemma' for binom…
▽ More
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a `Neyman-Pearson lemma' for binomial data under DP, where the DP-UMP only depends on the sample sum. Our tests can also be stated as a post-processing of a random variable, whose distribution we coin "Truncated-Uniform-Laplace" (Tulap), a generalization of the Staircase and discrete Laplace distributions. Furthermore, we obtain exact p-values, which are easily computed in terms of the Tulap random variable. We show that our results also apply to distribution-free hypothesis tests for continuous data. Our simulation results demonstrate that our tests have exact type I error, and are more powerful than current techniques.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms
Authors:
Jordan Awan,
Aleksandra Slavkovic
Abstract:
Differential privacy (DP), provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector $T$, a function of sensitive data under DP, via the class of $K$-norm mechanisms with the goal of minimizing the noise added to achieve privacy. Fi…
▽ More
Differential privacy (DP), provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector $T$, a function of sensitive data under DP, via the class of $K$-norm mechanisms with the goal of minimizing the noise added to achieve privacy. First, we introduce the sensitivity space of $T$, which extends the concepts of sensitivity polytope and sensitivity hull to the setting of arbitrary statistics $T$. We then propose a framework consisting of three methods for comparing the $K$-norm mechanisms: 1) a multivariate extension of stochastic dominance, 2) the entropy of the mechanism, and 3) the conditional variance given a direction, to identify the optimal $K$-norm mechanism. In all of these criteria, the optimal $K$-norm mechanism is generated by the convex hull of the sensitivity space. Using our methodology, we extend the objective perturbation and functional mechanisms and apply these tools to logistic and linear regression, allowing for private releases of statistical results. Via simulations and an application to a housing price dataset, we demonstrate that our proposed methodology offers a substantial improvement in utility for the same level of risk.
△ Less
Submitted 31 October, 2024; v1 submitted 28 January, 2018;
originally announced January 2018.
-
Formal Privacy for Functional Data with Gaussian Perturbations
Authors:
Ardalan Mirshani,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Motivated by the rapid rise in statistical tools in Functional Data Analysis, we consider the Gaussian mechanism for achieving differential privacy with parameter estimates taking values in a, potentially infinite-dimensional, separable Banach space. Using classic results from probability theory, we show how densities over function spaces can be utilized to achieve the desired differential privacy…
▽ More
Motivated by the rapid rise in statistical tools in Functional Data Analysis, we consider the Gaussian mechanism for achieving differential privacy with parameter estimates taking values in a, potentially infinite-dimensional, separable Banach space. Using classic results from probability theory, we show how densities over function spaces can be utilized to achieve the desired differential privacy bounds. This extends prior results of Hall et al (2013) to a much broader class of statistical estimates and summaries, including "path level" summaries, nonlinear functionals, and full function releases. By focusing on Banach spaces, we provide a deeper picture of the challenges for privacy with complex data, especially the role regularization plays in balancing utility and privacy. Using an application to penalized smoothing, we explicitly highlight this balance in the context of mean function estimation. Simulations and an application to diffusion tensor imaging are briefly presented, with extensive additions included in a supplement.
△ Less
Submitted 27 January, 2019; v1 submitted 17 November, 2017;
originally announced November 2017.
-
Providing Accurate Models across Private Partitioned Data: Secure Maximum Likelihood Estimation
Authors:
Joshua Snoke,
Timothy R. Brick,
Aleksandra Slavkovic,
Michael D. Hunter
Abstract:
This paper focuses on the privacy paradigm of providing access to researchers to remotely carry out analyses on sensitive data stored behind firewalls. We address the situation where the analysis demands data from multiple physically separate databases which cannot be combined. Motivating this problem are analyses using multiple data sources that currently are only possible through extension work…
▽ More
This paper focuses on the privacy paradigm of providing access to researchers to remotely carry out analyses on sensitive data stored behind firewalls. We address the situation where the analysis demands data from multiple physically separate databases which cannot be combined. Motivating this problem are analyses using multiple data sources that currently are only possible through extension work creating a trusted user network. We develop and demonstrate a method for accurate calculation of the multivariate normal likelihood equation, for a set of parameters given the partitioned data, which can then be maximized to obtain estimates. These estimates are achieved without sharing any data or any true intermediate statistics of the data across firewalls. We show that under a certain set of assumptions our method for estimation across these partitions achieves identical results as estimation with the full data. Privacy is maintained by adding noise at each partition. This ensures each party receives noisy statistics, such that the noise cannot be removed until the last step to obtain a single value, the true total log-likelihood. Potential applications include all methods utilizing parameter estimation through maximizing the multivariate normal likelihood equation. We give detailed algorithms, along with available software, and both a real data example and simulations estimating structural equation models (SEMs) with partitioned data.
△ Less
Submitted 18 October, 2017;
originally announced October 2017.
-
Differentially Private Model Selection with Penalized and Constrained Likelihood
Authors:
Jing Lei,
Anne-Sophie Charest,
Aleksandra Slavkovic,
Adam Smith,
Stephen Fienberg
Abstract:
In statistical disclosure control, the goal of data analysis is twofold: The released information must provide accurate and useful statistics about the underlying population of interest, while minimizing the potential for an individual record to be identified. In recent years, the notion of differential privacy has received much attention in theoretical computer science, machine learning, and stat…
▽ More
In statistical disclosure control, the goal of data analysis is twofold: The released information must provide accurate and useful statistics about the underlying population of interest, while minimizing the potential for an individual record to be identified. In recent years, the notion of differential privacy has received much attention in theoretical computer science, machine learning, and statistics. It provides a rigorous and strong notion of protection for individuals' sensitive information. A fundamental question is how to incorporate differential privacy into traditional statistical inference procedures. In this paper we study model selection in multivariate linear regression under the constraint of differential privacy. We show that model selection procedures based on penalized least squares or likelihood can be made differentially private by a combination of regularization and randomization, and propose two algorithms to do so. We show that our private procedures are consistent under essentially the same conditions as the corresponding non-private procedures. We also find that under differential privacy, the procedure becomes more sensitive to the tuning parameters. We illustrate and evaluate our method using simulation studies and two real data examples.
△ Less
Submitted 14 July, 2016;
originally announced July 2016.
-
General and specific utility measures for synthetic data
Authors:
Joshua Snoke,
Gillian Raab,
Beata Nowok,
Chris Dibben,
Aleksandra Slavkovic
Abstract:
Data holders can produce synthetic versions of datasets when concerns about potential disclosure restrict the availability of the original records. This paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable to that of the original data, what we will term general utility. We consider how general utility compares with specific utility, the simila…
▽ More
Data holders can produce synthetic versions of datasets when concerns about potential disclosure restrict the availability of the original records. This paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable to that of the original data, what we will term general utility. We consider how general utility compares with specific utility, the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared-error (pMSE), to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two examples examining this comparison of general and specific utility to real data syntheses and make recommendations for their use for evaluating synthetic data.
△ Less
Submitted 18 June, 2017; v1 submitted 22 April, 2016;
originally announced April 2016.
-
Private Posterior distributions from Variational approximations
Authors:
Vishesh Karwa,
Dan Kifer,
Aleksandra B. Slavković
Abstract:
Privacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating post…
▽ More
Privacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating posterior distributions over model parameters. This leads to a principled way of performing valid statistical inference using private data, however, the corresponding likelihoods are intractable. In this paper, we derive fast and accurate variational approximations to tackle such intractable likelihoods that arise due to privacy. We focus on estimating posterior distributions of parameters of the naive Bayes log-linear model, where the sufficient statistics of this model are shared using a differentially private interface. Using a simulation study, we show that the posterior approximations outperform the naive method of ignoring the noise addition mechanism.
△ Less
Submitted 24 November, 2015;
originally announced November 2015.
-
Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models
Authors:
Vishesh Karwa,
Pavel N. Krivitsky,
Aleksandra B. Slavković
Abstract:
Motivated by a real-life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case study using a version of the Enron e-mail corpus dataset demonstrates the…
▽ More
Motivated by a real-life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case study using a version of the Enron e-mail corpus dataset demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy \emph{and} supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analyzing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under $ε$-edge differential privacy, and then use likelihood based inference for missing data and Markov chain Monte Carlo techniques to fit exponential-family random graph models to the generated synthetic networks.
△ Less
Submitted 23 September, 2016; v1 submitted 9 November, 2015;
originally announced November 2015.
-
Differentially Private Exponential Random Graphs
Authors:
Vishesh Karwa,
Aleksandra B. Slavković,
Pavel Krivitsky
Abstract:
We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to rele…
▽ More
We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to release networks under $ε$-edge differential privacy. To maintain utility for statistical inference, treating the original graph as missing, we propose a way to use likelihood based inference and Markov chain Monte Carlo (MCMC) techniques to fit ERGMs to the produced synthetic networks. We demonstrate the usefulness of the proposed techniques on a real data example.
△ Less
Submitted 15 April, 2015; v1 submitted 16 September, 2014;
originally announced September 2014.
-
Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies
Authors:
Fei Yu,
Stephen E. Fienberg,
Aleksandra Slavković,
Caroline Uhler
Abstract:
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees…
▽ More
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private $χ^2$-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).
△ Less
Submitted 21 January, 2014;
originally announced January 2014.
-
10 Simple Rules for the Care and Feeding of Scientific Data
Authors:
Alyssa Goodman,
Alberto Pepe,
Alexander W. Blocker,
Christine L. Borgman,
Kyle Cranmer,
Mercè Crosas,
Rosanne Di Stefano,
Yolanda Gil,
Paul Groth,
Margaret Hedstrom,
David W. Hogg,
Vinay Kashyap,
Ashish Mahabal,
Aneta Siemiginowska,
Aleksandra Slavkovic
Abstract:
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review…
▽ More
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that.
△ Less
Submitted 9 January, 2014;
originally announced January 2014.
-
Fibers of multi-way contingency tables given conditionals: relation to marginals, cell bounds and Markov bases
Authors:
Aleksandra B. Slavković,
Xiaotian Zhu,
Sonja Petrović
Abstract:
A reference set, or a fiber, of a contingency table is the space of all realizations of the table under a given set of constraints such as marginal totals. Understanding the geometry of this space is a key problem in algebraic statistics, important for conducting exact conditional inference, calculating cell bounds, imputing missing cell values, and assessing the risk of disclosure of sensitive in…
▽ More
A reference set, or a fiber, of a contingency table is the space of all realizations of the table under a given set of constraints such as marginal totals. Understanding the geometry of this space is a key problem in algebraic statistics, important for conducting exact conditional inference, calculating cell bounds, imputing missing cell values, and assessing the risk of disclosure of sensitive information.
Motivated primarily by disclosure limitation problems where constraints can come from summary statistics other than the margins, in this paper we study the space $\mathcal{F_T}$ of all possible multi-way contingency tables for a given sample size and set of observed conditional frequencies. We show that this space can be decomposed according to different possible marginals, which, in turn, are encoded by the solution set of a linear Diophantine equation. We characterize the difference between two fibers: $\mathcal{F_T}$ and the space of tables for a given set of corresponding marginal totals. In particular, we solve a generalization of an open problem posed by Dobra et al. (2008). Our decomposition of $\mathcal{F_T}$ has two important consequences: (1) we derive new cell bounds, some including connections to Directed Acyclic Graphs, and (2) we describe a structure for the Markov bases for the space $\mathcal{F_T}$ that leads to a simplified calculation of Markov bases in this particular setting.
△ Less
Submitted 7 January, 2014;
originally announced January 2014.
-
Inference using noisy degrees: Differentially private $β$-model and synthetic graphs
Authors:
Vishesh Karwa,
Aleksandra Slavković
Abstract:
The $β$-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the $β$-model, and show that the MLE never exists for the degree partition $β$-model. Second, motivated by privacy problems wit…
▽ More
The $β$-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the $β$-model, and show that the MLE never exists for the degree partition $β$-model. Second, motivated by privacy problems with network data, we derive a differentially private estimator of the parameters of $β$-model, and show it is consistent and asymptotically normally distributed - it achieves the same rate of convergence as the nonprivate estimator. We present an efficient algorithm for the private estimator that can be used to release synthetic graphs. Our techniques can also be used to release degree distributions and degree partitions accurately and privately, and to perform inference from noisy degrees arising from contexts other than privacy. We evaluate the proposed estimator on real graphs and compare it with a current algorithm for releasing degree distributions and find that it does significantly better. Finally, our paper addresses shortcomings of current approaches to a fundamental problem of how to perform valid statistical inference from data released by privacy mechanisms, and lays a foundational groundwork on how to achieve optimal and private statistical inference in a principled manner by modeling the privacy mechanism; these principles should be applicable to a class of models beyond the $β$-model.
△ Less
Submitted 12 January, 2016; v1 submitted 21 May, 2012;
originally announced May 2012.
-
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Authors:
Caroline Uhler,
Aleksandra B. Slavkovic,
Stephen E. Fienberg
Abstract:
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous defini…
▽ More
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual's privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression.
△ Less
Submitted 3 May, 2012;
originally announced May 2012.
-
Causal inference in transportation safety studies: Comparison of potential outcomes and causal diagrams
Authors:
Vishesh Karwa,
Aleksandra B. Slavković,
Eric T. Donnell
Abstract:
The research questions that motivate transportation safety studies are causal in nature. Safety researchers typically use observational data to answer such questions, but often without appropriate causal inference methodology. The field of causal inference presents several modeling frameworks for probing empirical data to assess causal relations. This paper focuses on exploring the applicability o…
▽ More
The research questions that motivate transportation safety studies are causal in nature. Safety researchers typically use observational data to answer such questions, but often without appropriate causal inference methodology. The field of causal inference presents several modeling frameworks for probing empirical data to assess causal relations. This paper focuses on exploring the applicability of two such modeling frameworks---Causal Diagrams and Potential Outcomes---for a specific transportation safety problem. The causal effects of pavement marking retroreflectivity on safety of a road segment were estimated. More specifically, the results based on three different implementations of these frameworks on a real data set were compared: Inverse Propensity Score Weighting with regression adjustment and Propensity Score Matching with regression adjustment versus Causal Bayesian Network. The effect of increased pavement marking retroreflectivity was generally found to reduce the probability of target nighttime crashes. However, we found that the magnitude of the causal effects estimated are sensitive to the method used and to the assumptions being violated.
△ Less
Submitted 25 July, 2011;
originally announced July 2011.
-
The space of compatible full conditionals is a unimodular toric variety
Authors:
Aleksandra B Slavkovic,
Seth Sullivant
Abstract:
The set of all m-tuples of compatible full conditional distributions on discrete random variables is an algebraic set whose defining ideal is a unimodular toric ideal. We identify the defining polynomials of these ideals with closed walks on a bipartite graph. Our algebraic characterization provides a natural generalization of the requirement that compatible conditionals have identical odds rati…
▽ More
The set of all m-tuples of compatible full conditional distributions on discrete random variables is an algebraic set whose defining ideal is a unimodular toric ideal. We identify the defining polynomials of these ideals with closed walks on a bipartite graph. Our algebraic characterization provides a natural generalization of the requirement that compatible conditionals have identical odds ratios and holds regardless of the patterns of zeros in the conditional arrays.
△ Less
Submitted 3 May, 2004;
originally announced May 2004.