-
Interpreting Network Differential Privacy
Authors:
Jonathan Hehir,
Xiaoyue Niu,
Aleksandra Slavkovic
Abstract:
How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actual…
▽ More
How do we interpret the differential privacy (DP) guarantee for network data? We take a deep dive into a popular form of network DP ($\varepsilon$--edge DP) to find that many of its common interpretations are flawed. Drawing on prior work for privacy with correlated data, we interpret DP through the lens of adversarial hypothesis testing and demonstrate a gap between the pairs of hypotheses actually protected under DP (tests of complete networks) and the sorts of hypotheses implied to be protected by common claims (tests of individual edges). We demonstrate some conditions under which this gap can be bridged, while leaving some questions open. While some discussion is specific to edge DP, we offer selected results in terms of abstract DP definitions and provide discussion of the implications for other forms of network DP.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Gaussian Differentially Private Human Faces Under a Face Radial Curve Representation
Authors:
Carlos Soto,
Matthew Reimherr,
Aleksandra Slavkovic,
Mark Shriver
Abstract:
In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one's identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. W…
▽ More
In this paper we consider the problem of releasing a Gaussian Differentially Private (GDP) 3D human face. The human face is a complex structure with many features and inherently tied to one's identity. Protecting this data, in a formally private way, is important yet challenging given the dimensionality of the problem. We extend approximate DP techniques for functional data to the GDP framework. We further propose a novel representation, face radial curves, of a 3D face as a set of functions and then utilize our proposed GDP functional data mechanism. To preserve the shape of the face while injecting noise we rely on tools from shape analysis for our novel representation of the face. We show that our method preserves the shape of the average face and injects less noise than traditional methods for the same privacy budget. Our mechanism consists of two primary components, the first is generally applicable to function value summaries (as are commonly found in nonparametric statistics or functional data analysis) while the second is general to disk-like surfaces and hence more applicable than just to human faces.
△ Less
Submitted 15 April, 2025; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Assessing Utility of Differential Privacy for RCTs
Authors:
Soumya Mukherjee,
Aratrika Mustafi,
Aleksandra Slavković,
Lars Vilhuber
Abstract:
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the respons…
▽ More
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with \ac{DP} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Shape And Structure Preserving Differential Privacy
Authors:
Carlos Soto,
Karthik Bharath,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively cur…
▽ More
It is common for data structures such as images and shapes of 2D objects to be represented as points on a manifold. The utility of a mechanism to produce sanitized differentially private estimates from such data is intimately linked to how compatible it is with the underlying structure and geometry of the space. In particular, as recently shown, utility of the Laplace mechanism on a positively curved manifold, such as Kendall's 2D shape space, is significantly influences by the curvature. Focusing on the problem of sanitizing the Fréchet mean of a sample of points on a manifold, we exploit the characterisation of the mean as the minimizer of an objective function comprised of the sum of squared distances and develop a K-norm gradient mechanism on Riemannian manifolds that favors values that produce gradients close to the the zero of the objective function. For the case of positively curved manifolds, we describe how using the gradient of the squared distance function offers better control over sensitivity than the Laplace mechanism, and demonstrate this numerically on a dataset of shapes of corpus callosa. Further illustrations of the mechanism's utility on a sphere and the manifold of symmetric positive definite matrices are also presented.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Perfect Spectral Clustering with Discrete Covariates
Authors:
Jonathan Hehir,
Xiaoyue Niu,
Aleksandra Slavkovic
Abstract:
Among community detection methods, spectral clustering enjoys two desirable properties: computational efficiency and theoretical guarantees of consistency. Most studies of spectral clustering consider only the edges of a network as input to the algorithm. Here we consider the problem of performing community detection in the presence of discrete node covariates, where network structure is determine…
▽ More
Among community detection methods, spectral clustering enjoys two desirable properties: computational efficiency and theoretical guarantees of consistency. Most studies of spectral clustering consider only the edges of a network as input to the algorithm. Here we consider the problem of performing community detection in the presence of discrete node covariates, where network structure is determined by a combination of a latent block model structure and homophily on the observed covariates. We propose a spectral algorithm that we prove achieves perfect clustering with high probability on a class of large, sparse networks with discrete covariates, effectively separating latent network structure from homophily on observed covariates. To our knowledge, our method is the first to offer a guarantee of consistent latent structure recovery using spectral clustering in the setting where edge formation is dependent on both latent and observed factors.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Statistical Data Privacy: A Song of Privacy and Utility
Authors:
Aleksandra Slavkovic,
Jeremy Seeman
Abstract:
To quantify trade-offs between increasing demand for open data sharing and concerns about sensitive information disclosure, statistical data privacy (SDP) methodology analyzes data release mechanisms which sanitize outputs based on confidential data. Two dominant frameworks exist: statistical disclosure control (SDC), and more recent, differential privacy (DP). Despite framing differences, both SD…
▽ More
To quantify trade-offs between increasing demand for open data sharing and concerns about sensitive information disclosure, statistical data privacy (SDP) methodology analyzes data release mechanisms which sanitize outputs based on confidential data. Two dominant frameworks exist: statistical disclosure control (SDC), and more recent, differential privacy (DP). Despite framing differences, both SDC and DP share the same statistical problems at its core. For inference problems, we may either design optimal release mechanisms and associated estimators that satisfy bounds on disclosure risk, or we may adjust existing sanitized output to create new optimal estimators. Both problems rely on uncertainty quantification in evaluating risk and utility. In this review, we discuss the statistical foundations common to both SDC and DP, highlight major developments in SDP, and present exciting open research problems in private inference.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can speci…
▽ More
Implementations of the exponential mechanism in differential privacy often require sampling from intractable distributions. When approximate procedures like Markov chain Monte Carlo (MCMC) are used, the end result incurs costs to both privacy and accuracy. Existing work has examined these effects asymptotically, but implementable finite sample results are needed in practice so that users can specify privacy budgets in advance and implement samplers with exact privacy guarantees. In this paper, we use tools from ergodic theory and perfect simulation to design exact finite runtime sampling algorithms for the exponential mechanism by introducing an intermediate modified target distribution using artificial atoms. We propose an additional modification of this sampling algorithm that maintains its $ε$-DP guarantee and has improved runtime at the cost of some utility. We then compare these methods in scenarios where we can explicitly calculate a $δ$ cost (as in $(ε, δ)$-DP) incurred when using standard MCMC techniques. Much as there is a well known trade-off between privacy and utility, we demonstrate that there is also a trade-off between privacy guarantees and runtime.
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
Formal Privacy for Partially Private Data
Authors:
Jeremy Seeman,
Matthew Reimherr,
Aleksandra Slavkovic
Abstract:
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize t…
▽ More
Differential privacy (DP) quantifies privacy loss by analyzing noise injected into output statistics. For non-trivial statistics, this noise is necessary to ensure finite privacy loss. However, data curators frequently release collections of statistics where some use DP mechanisms and others are released as-is, i.e., without additional randomized noise. Consequently, DP alone cannot characterize the privacy loss attributable to the entire collection of releases. In this paper, we present a privacy formalism, $(ε, \{ Θ_z\}_{z \in \mathcal{Z}})$-Pufferfish ($ε$-TP for short when $\{ Θ_z\}_{z \in \mathcal{Z}}$ is implied), a collection of Pufferfish mechanisms indexed by realizations of a random variable $Z$ representing public information not protected with DP noise. First, we prove that this definition has similar properties to DP. Next, we introduce mechanisms for releasing partially private data (PPD) satisfying $ε$-TP and prove their desirable properties. We provide algorithms for sampling from the posterior of a parameter given PPD. We then compare this inference approach to the alternative where noisy statistics are deterministically combined with Z. We derive mild conditions under which using our algorithms offers both theoretical and computational improvements over this more common approach. Finally, we demonstrate all the effects above on a case study on COVID-19 data.
△ Less
Submitted 14 December, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
Perturbed M-Estimation: A Further Investigation of Robust Statistics for Differential Privacy
Authors:
Aleksandra Slavkovic,
Roberto Molinari
Abstract:
Differential Privacy (DP) provides an elegant mathematical framework for defining a provable disclosure risk in the presence of arbitrary adversaries; it guarantees that whether an individual is in a database or not, the results of a DP procedure should be similar in terms of their probability distribution. While DP mechanisms are provably effective in protecting privacy, they often negatively imp…
▽ More
Differential Privacy (DP) provides an elegant mathematical framework for defining a provable disclosure risk in the presence of arbitrary adversaries; it guarantees that whether an individual is in a database or not, the results of a DP procedure should be similar in terms of their probability distribution. While DP mechanisms are provably effective in protecting privacy, they often negatively impact the utility of the query responses, statistics and/or analyses that come as outputs from these mechanisms. To address this problem, we use ideas from the area of robust statistics which aims at reducing the influence of outlying observations on statistical inference. Based on the preliminary known links between differential privacy and robust statistics, we modify the objective perturbation mechanism by making use of a new bounded function and define a bounded M-Estimator with adequate statistical properties. The resulting privacy mechanism, named "Perturbed M-Estimation", shows important potential in terms of improved statistical utility of its outputs as suggested by some preliminary results. These results consequently support the need to further investigate the use of robust statistical tools for differential privacy.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Consistent Spectral Clustering of Network Block Models under Local Differential Privacy
Authors:
Jonathan Hehir,
Aleksandra Slavkovic,
Xiaoyue Niu
Abstract:
The stochastic block model (SBM) and degree-corrected block model (DCBM) are network models often selected as the fundamental setting in which to analyze the theoretical properties of community detection methods. We consider the problem of spectral clustering of SBM and DCBM networks under a local form of edge differential privacy. Using a randomized response privacy mechanism called the edge-flip…
▽ More
The stochastic block model (SBM) and degree-corrected block model (DCBM) are network models often selected as the fundamental setting in which to analyze the theoretical properties of community detection methods. We consider the problem of spectral clustering of SBM and DCBM networks under a local form of edge differential privacy. Using a randomized response privacy mechanism called the edge-flip mechanism, we develop theoretical guarantees for differentially private community detection, demonstrating conditions under which this strong privacy guarantee can be upheld while achieving spectral clustering convergence rates that match the known rates without privacy. We prove the strongest theoretical results are achievable for dense networks (those with node degree linear in the number of nodes), while weak consistency is achievable under mild sparsity (node degree greater than $\sqrt{n}$). We empirically demonstrate our results on a number of network examples.
△ Less
Submitted 20 September, 2021; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Differentially Private Inference for Binomial Data
Authors:
Jordan Awan,
Aleksandra Slavkovic
Abstract:
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this str…
▽ More
We derive uniformly most powerful (UMP) tests for simple and one-sided hypotheses for a population proportion within the framework of Differential Privacy (DP), optimizing finite sample performance. We show that in general, DP hypothesis tests can be written in terms of linear constraints, and for exchangeable data can always be expressed as a function of the empirical distribution. Using this structure, we prove a 'Neyman-Pearson lemma' for binomial data under DP, where the DP-UMP only depends on the sample sum. Our tests can also be stated as a post-processing of a random variable, whose distribution we coin ''Truncated-Uniform-Laplace'' (Tulap), a generalization of the Staircase and discrete Laplace distributions. Furthermore, we obtain exact $p$-values, which are easily computed in terms of the Tulap random variable.
Using the above techniques, we show that our tests can be applied to give uniformly most accurate one-sided confidence intervals and optimal confidence distributions. We also derive uniformly most powerful unbiased (UMPU) two-sided tests, which lead to uniformly most accurate unbiased (UMAU) two-sided confidence intervals. We show that our results can be applied to distribution-free hypothesis tests for continuous data. Our simulation results demonstrate that all our tests have exact type I error, and are more powerful than current techniques.
△ Less
Submitted 31 March, 2019;
originally announced April 2019.
-
Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA
Authors:
Jordan Awan,
Ana Kenney,
Matthew Reimherr,
Aleksandra Slavković
Abstract:
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the outpu…
▽ More
The exponential mechanism is a fundamental tool of Differential Privacy (DP) due to its strong privacy guarantees and flexibility. We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. We show that one can design the mechanism with respect to a specific base measure over the output space, such as a Guassian process. We provide a positive result that establishes a Central Limit Theorem for the exponential mechanism quite broadly. We also provide an apparent negative result, showing that the magnitude of the noise introduced for privacy is asymptotically non-negligible relative to the statistical estimation error. We develop an \ep-DP mechanism for functional principal component analysis, applicable in separable Hilbert spaces. We demonstrate its performance via simulations and applications to two datasets.
△ Less
Submitted 30 January, 2019;
originally announced January 2019.
-
Private Posterior distributions from Variational approximations
Authors:
Vishesh Karwa,
Dan Kifer,
Aleksandra B. Slavković
Abstract:
Privacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating post…
▽ More
Privacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating posterior distributions over model parameters. This leads to a principled way of performing valid statistical inference using private data, however, the corresponding likelihoods are intractable. In this paper, we derive fast and accurate variational approximations to tackle such intractable likelihoods that arise due to privacy. We focus on estimating posterior distributions of parameters of the naive Bayes log-linear model, where the sufficient statistics of this model are shared using a differentially private interface. Using a simulation study, we show that the posterior approximations outperform the naive method of ignoring the noise addition mechanism.
△ Less
Submitted 24 November, 2015;
originally announced November 2015.
-
Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models
Authors:
Vishesh Karwa,
Pavel N. Krivitsky,
Aleksandra B. Slavković
Abstract:
Motivated by a real-life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case study using a version of the Enron e-mail corpus dataset demonstrates the…
▽ More
Motivated by a real-life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case study using a version of the Enron e-mail corpus dataset demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy \emph{and} supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analyzing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under $ε$-edge differential privacy, and then use likelihood based inference for missing data and Markov chain Monte Carlo techniques to fit exponential-family random graph models to the generated synthetic networks.
△ Less
Submitted 23 September, 2016; v1 submitted 9 November, 2015;
originally announced November 2015.
-
Differentially Private Exponential Random Graphs
Authors:
Vishesh Karwa,
Aleksandra B. Slavković,
Pavel Krivitsky
Abstract:
We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to rele…
▽ More
We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to release networks under $ε$-edge differential privacy. To maintain utility for statistical inference, treating the original graph as missing, we propose a way to use likelihood based inference and Markov chain Monte Carlo (MCMC) techniques to fit ERGMs to the produced synthetic networks. We demonstrate the usefulness of the proposed techniques on a real data example.
△ Less
Submitted 15 April, 2015; v1 submitted 16 September, 2014;
originally announced September 2014.
-
10 Simple Rules for the Care and Feeding of Scientific Data
Authors:
Alyssa Goodman,
Alberto Pepe,
Alexander W. Blocker,
Christine L. Borgman,
Kyle Cranmer,
Mercè Crosas,
Rosanne Di Stefano,
Yolanda Gil,
Paul Groth,
Margaret Hedstrom,
David W. Hogg,
Vinay Kashyap,
Ashish Mahabal,
Aneta Siemiginowska,
Aleksandra Slavkovic
Abstract:
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review…
▽ More
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that.
△ Less
Submitted 9 January, 2014;
originally announced January 2014.
-
Inference using noisy degrees: Differentially private $β$-model and synthetic graphs
Authors:
Vishesh Karwa,
Aleksandra Slavković
Abstract:
The $β$-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the $β$-model, and show that the MLE never exists for the degree partition $β$-model. Second, motivated by privacy problems wit…
▽ More
The $β$-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the $β$-model, and show that the MLE never exists for the degree partition $β$-model. Second, motivated by privacy problems with network data, we derive a differentially private estimator of the parameters of $β$-model, and show it is consistent and asymptotically normally distributed - it achieves the same rate of convergence as the nonprivate estimator. We present an efficient algorithm for the private estimator that can be used to release synthetic graphs. Our techniques can also be used to release degree distributions and degree partitions accurately and privately, and to perform inference from noisy degrees arising from contexts other than privacy. We evaluate the proposed estimator on real graphs and compare it with a current algorithm for releasing degree distributions and find that it does significantly better. Finally, our paper addresses shortcomings of current approaches to a fundamental problem of how to perform valid statistical inference from data released by privacy mechanisms, and lays a foundational groundwork on how to achieve optimal and private statistical inference in a principled manner by modeling the privacy mechanism; these principles should be applicable to a class of models beyond the $β$-model.
△ Less
Submitted 12 January, 2016; v1 submitted 21 May, 2012;
originally announced May 2012.
-
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Authors:
Caroline Uhler,
Aleksandra B. Slavkovic,
Stephen E. Fienberg
Abstract:
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous defini…
▽ More
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual's privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression.
△ Less
Submitted 3 May, 2012;
originally announced May 2012.