-
But Can You Use It? Design Recommendations for Differentially Private Interactive Systems
Authors:
Liudas Panavas,
Joshua Snoke,
Erika Tyagi,
Claire McKay Bowen,
Aaron R. Williams
Abstract:
Accessing data collected by federal statistical agencies is essential for public policy research and improving evidence-based decision making, such as evaluating the effectiveness of social programs, understanding demographic shifts, or addressing public health challenges. Differentially private interactive systems, or validation servers, can form a crucial part of the data-sharing infrastructure.…
▽ More
Accessing data collected by federal statistical agencies is essential for public policy research and improving evidence-based decision making, such as evaluating the effectiveness of social programs, understanding demographic shifts, or addressing public health challenges. Differentially private interactive systems, or validation servers, can form a crucial part of the data-sharing infrastructure. They may allow researchers to query targeted statistics, providing flexible, efficient access to specific insights, reducing the need for broad data releases and supporting timely, focused research. However, they have not yet been practically implemented. While substantial theoretical work has been conducted on the privacy and accuracy guarantees of differentially private mechanisms, prior efforts have not considered usability as an explicit goal of interactive systems. This work outlines and considers the barriers to developing differentially private interactive systems for informing public policy and offers an alternative way forward. We propose balancing three design considerations: privacy assurance, statistical utility, and system usability, we develop recommendations for making differentially private interactive systems work in practice, we present an example architecture based on these recommendations, and we provide an outline of how to conduct the necessary user-testing. Our work seeks to move the practical development of differentially private interactive systems forward to better aid public policy making and spark future research.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Advancing Microdata Privacy Protection: A Review of Synthetic Data
Authors:
Jingchen Hu,
Claire McKay Bowen
Abstract:
Synthetic data generation is a powerful tool for privacy protection when considering public release of record-level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehe…
▽ More
Synthetic data generation is a powerful tool for privacy protection when considering public release of record-level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehensive introduction to synthetic data, including technical details of their generation and evaluation. Our review also addresses the challenges and limitations of synthetic data, discusses practical applications, and provides thoughts for future work.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data
Authors:
Andrés F. Barrientos,
Aaron R. Williams,
Joshua Snoke,
Claire McKay Bowen
Abstract:
Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP)…
▽ More
Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to make certain queries while preserving privacy. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We define feasibility as the impact of DP methods on analyses for making public policy decisions and the queries accuracy according to several utility metrics. We evaluate the methods using Internal Revenue Service data and public-use Current Population Survey data and identify how specific data features might challenge some of these methods. Our findings show that DP methods are feasible for simple, univariate statistics but struggle to produce accurate regression estimates and confidence intervals. To the best of our knowledge, this is the first comprehensive statistical study of DP regression methodology on real, complex datasets, and the findings have significant implications for the direction of a growing research field and public policy.
△ Less
Submitted 30 June, 2023; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PSCR Differential Privacy Synthetic Data Challenge
Authors:
Claire McKay Bowen,
Joshua Snoke
Abstract:
Differentially private synthetic data generation offers a recent solution to release analytically useful data while preserving the privacy of individuals in the data. In order to utilize these algorithms for public policy decisions, policymakers need an accurate understanding of these algorithms' comparative performance. Correspondingly, data practitioners require standard metrics for evaluating t…
▽ More
Differentially private synthetic data generation offers a recent solution to release analytically useful data while preserving the privacy of individuals in the data. In order to utilize these algorithms for public policy decisions, policymakers need an accurate understanding of these algorithms' comparative performance. Correspondingly, data practitioners require standard metrics for evaluating the analytic qualities of the synthetic data. In this paper, we present an in-depth evaluation of several differentially private synthetic data algorithms using actual differentially private synthetic data sets created by contestants in the 2018-2019 National Institute of Standards and Technology Public Safety Communications Research (NIST PSCR) Division's ``Differential Privacy Synthetic Data Challenge.'' We offer analyses of these algorithms based on both the accuracy of the data they created and their usability by potential data providers. We frame the methods used in the NIST PSCR data challenge within the broader differentially private synthetic data literature. We implement additional utility metrics, including two of our own, on the differentially private synthetic data and compare mechanism utility on three categories. Our comparative assessment of the differentially private data synthesis methods and the quality metrics shows the relative usefulness, the general strengths and weaknesses, and offers preferred choices of algorithms and metrics. Finally we describe the implications of our evaluation for policymakers seeking to implement differentially private synthetic data algorithms on future data products.
△ Less
Submitted 12 October, 2020; v1 submitted 28 November, 2019;
originally announced November 2019.
-
Differentially Private Data Release via Statistical Election to Partition Sequentially
Authors:
Claire McKay Bowen,
Fang Liu,
Binyue Su
Abstract:
Differential Privacy (DP) formalizes privacy in mathematical terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce and release synthetic individual-level data in the DP framework. One key challenge to developing DIPS methods is preservation of the statistical utility of synthetic data, especially in high-dimensional settings. We…
▽ More
Differential Privacy (DP) formalizes privacy in mathematical terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce and release synthetic individual-level data in the DP framework. One key challenge to developing DIPS methods is preservation of the statistical utility of synthetic data, especially in high-dimensional settings. We propose a new DIPS approach, STatistical Election to Partition Sequentially (STEPS) that partitions data by attributes according to their importance ranks according to either a practical or statistical importance measure. STEPS aims to achieve better original information preservation for the attributes with higher importance ranks and produce thus more useful synthetic data overall. We present an algorithm to implement the STEPS procedure and employ the privacy budget composability to ensure the overall privacy cost is controlled at the pre-specified value. We apply the STEPS procedure to both simulated data and the 2000-2012 Current Population Survey youth voter data. The results suggest STEPS can better preserve the population-level information and the original information for some analyses compared to PrivBayes, a modified Uniform histogram approach, and the flat Laplace sanitizer.
△ Less
Submitted 20 October, 2020; v1 submitted 18 March, 2018;
originally announced March 2018.
-
Comparative Study of Differentially Private Data Synthesis Methods
Authors:
Claire McKay Bowen,
Fang Liu
Abstract:
When sharing data among researchers or releasing data for public use, there is a risk of exposing sensitive information of individuals in the data set. Data synthesis (DS) is a statistical disclosure limitation technique for releasing synthetic data sets with pseudo individual records. Traditional DS techniques often rely on strong assumptions of a data intruder's behaviors and background knowledg…
▽ More
When sharing data among researchers or releasing data for public use, there is a risk of exposing sensitive information of individuals in the data set. Data synthesis (DS) is a statistical disclosure limitation technique for releasing synthetic data sets with pseudo individual records. Traditional DS techniques often rely on strong assumptions of a data intruder's behaviors and background knowledge to assess disclosure risk. Differential privacy (DP) formulates a theoretical approach for a strong and robust privacy guarantee in data release without having to model intruders' behaviors. Efforts have been made aiming to incorporate the DP concept in the DS process. In this paper, we examine current DIfferentially Private Data Synthesis (DIPS) techniques for releasing individual-level surrogate data for the original data, compare the techniques conceptually, and evaluate the statistical utility and inferential properties of the synthetic data via each DIPS technique through extensive simulation studies. Our work sheds light on the practical feasibility and utility of the various DIPS approaches, and suggests future research directions for DIPS.
△ Less
Submitted 8 January, 2019; v1 submitted 2 February, 2016;
originally announced February 2016.
-
Partitioning a Large Simulation as It Runs
Authors:
Kary Myers,
Earl Lawrence,
Michael Fugate,
Claire McKay Bowen,
Lawrence Ticknor,
Jon Woodring,
Joanne Wendelberger,
Jim Ahrens
Abstract:
As computer simulations continue to grow in size and complexity, they present a particularly challenging class of big data problems. Many application areas are moving toward exascale computing systems, systems that perform $10^{18}$ FLOPS (FLoating-point Operations Per Second) --- a billion billion calculations per second. Simulations at this scale can generate output that exceeds both the storage…
▽ More
As computer simulations continue to grow in size and complexity, they present a particularly challenging class of big data problems. Many application areas are moving toward exascale computing systems, systems that perform $10^{18}$ FLOPS (FLoating-point Operations Per Second) --- a billion billion calculations per second. Simulations at this scale can generate output that exceeds both the storage capacity and the bandwidth available for transfer to storage, making post-processing and analysis challenging. One approach is to embed some analyses in the simulation while the simulation is running --- a strategy often called in situ analysis --- to reduce the need for transfer to storage. Another strategy is to save only a reduced set of time steps rather than the full simulation. Typically the selected time steps are evenly spaced, where the spacing can be defined by the budget for storage and transfer. This paper combines both of these ideas to introduce an online in situ method for identifying a reduced set of time steps of the simulation to save. Our approach significantly reduces the data transfer and storage requirements, and it provides improved fidelity to the simulation to facilitate post-processing and reconstruction. We illustrate the method using a computer simulation that supported NASA's 2009 Lunar Crater Observation and Sensing Satellite mission.
△ Less
Submitted 23 September, 2015; v1 submitted 2 September, 2014;
originally announced September 2014.