-
Four checks for low-fidelity synthetic data: recommendations for disclosure control and quality evaluation
Authors:
Gillian M Raab,
Sophie McCall,
Liam Cavin
Abstract:
Confidential administrative data is usually only available to researchers within a trusted research environment (TRE). Recently, some UK groups have proposed that low-fidelity synthetic data (LFSD) is available to researchers outside the TRE to allow code-testing and data discovery. There is a need for transparency so that those who access LFSD know how it has been created and what to expect from…
▽ More
Confidential administrative data is usually only available to researchers within a trusted research environment (TRE). Recently, some UK groups have proposed that low-fidelity synthetic data (LFSD) is available to researchers outside the TRE to allow code-testing and data discovery. There is a need for transparency so that those who access LFSD know how it has been created and what to expect from it. Relationships between variables are not maintained in LFSD, but a real or apparent data breach can occur from its release. To be useful to researchers for preliminary analyses LFSD needs to meet some minimum quality standards. Researchers who will use the LFSD need to have details of how it compares with the data they will access in the TRE clearly explained and documented. We propose that these checks should be run by data controllers before releasing LFSD to ensure it is well documented, useful and non-disclosive. 1.Labelling To avoid an apparent data breach, steps must be taken to ensure that the SD is clearly identified as not being real data. 2.Disclosure The LFSD should undergo disclosure risk evaluation as described below and any risks identified mitigated. 3.Structure The structure of the SD should be as similar as possible to the TRE data. 4.Documentation Differences in the structure of the SD compared to data in the TRE must be documented, and the way(s) that analyses of the SD expect to differ from those of data in the TRE must be clarified.
We propose details of each of these below; but a strict, rule-based approach should not be used. Instead, the data holders should modify the rules to take account of the type of information that may be disclosed and the circumstances of the data release (to whom and under what conditions).
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Privacy risk from synthetic data: practical proposals
Authors:
Gillian M Raab
Abstract:
This paper proposes and compares measures of identity and attribute disclosure risk for synthetic data. Data custodians can use the methods proposed here to inform the decision as to whether to release synthetic versions of confidential data. Different measures are evaluated on two data sets. Insight into the measures is obtained by examining the details of the records identified as posing a discl…
▽ More
This paper proposes and compares measures of identity and attribute disclosure risk for synthetic data. Data custodians can use the methods proposed here to inform the decision as to whether to release synthetic versions of confidential data. Different measures are evaluated on two data sets. Insight into the measures is obtained by examining the details of the records identified as posing a disclosure risk. This leads to methods to identify, and possibly exclude, apparently risky records where the identification or attribution would be expected by someone with background knowledge of the data. The methods described are available as part of the \textbf{synthpop} package for \textbf{R}.
△ Less
Submitted 16 May, 2025; v1 submitted 6 September, 2024;
originally announced September 2024.
-
Practical privacy metrics for synthetic data
Authors:
Gillian M Raab,
Beata Nowok,
Chris Dibben
Abstract:
This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a…
▽ More
This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them.
△ Less
Submitted 14 May, 2025; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data
Authors:
Gillian M Raab
Abstract:
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and cal…
▽ More
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the \textit{synthpop} package for \textbf{R}. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter $ε$ as low as 0.5. The relationship between the disclosure risk and $ε$ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.
△ Less
Submitted 26 June, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Assessing, visualizing and improving the utility of synthetic data
Authors:
Gillian M Raab,
Beata Nowok,
Chris Dibben
Abstract:
The synthpop package for R https://www.synthpop.org.uk provides tools to allow data custodians to create synthetic versions of confidential microdata that can be distributed with fewer restrictions than the original. The synthesis can be customized to ensure that relationships evident in the real data are reproduced in the synthetic data. A number of measures have been proposed to assess this aspe…
▽ More
The synthpop package for R https://www.synthpop.org.uk provides tools to allow data custodians to create synthetic versions of confidential microdata that can be distributed with fewer restrictions than the original. The synthesis can be customized to ensure that relationships evident in the real data are reproduced in the synthetic data. A number of measures have been proposed to assess this aspect, commonly known as the utility of the synthetic data. We show that all these measures, including those calculated from tabulations, can be derived from a propensity score model. The measures will be reviewed and compared, and relations between them illustrated. All the measures compared are highly correlated and some are shown to be identical. The method used to define the propensity score model is more important than the choice of measure. These measures and methods are incorporated into utility modules in the synthpop package that include methods to visualize the results and thus provide immediate feedback to allow the person creating the synthetic data to improve its quality. The utility functions were originally designed to be used for synthetic data objects of class \code{synds}, created by the \pkg{synthpop} function syn() or syn.strata(), but they can now be used to compare one or more synthesised data sets with the original records, where the records are R data frames or lists of data frames.
△ Less
Submitted 13 November, 2021; v1 submitted 26 September, 2021;
originally announced September 2021.
-
Guidelines for Producing Useful Synthetic Data
Authors:
Gillian M. Raab,
Beata Nowok,
Chris Dibben
Abstract:
We report on our experiences of helping staff of the Scottish Longitudinal Study to create synthetic extracts that can be released to users. In particular, we focus on how the synthesis process can be tailored to produce synthetic extracts that will provide users with similar results to those that would be obtained from the original data. We make recommendations for synthesis methods and illustrat…
▽ More
We report on our experiences of helping staff of the Scottish Longitudinal Study to create synthetic extracts that can be released to users. In particular, we focus on how the synthesis process can be tailored to produce synthetic extracts that will provide users with similar results to those that would be obtained from the original data. We make recommendations for synthesis methods and illustrate how the staff creating synthetic extracts can evaluate their utility at the time they are being produced. We discuss measures of utility for synthetic data and show that one tabular utility measure is exactly equivalent to a measure calculated from a propensity score. The methods are illustrated by using the R package $synthpop$ to create synthetic versions of data from the 1901 Census of Scotland.
△ Less
Submitted 11 December, 2017;
originally announced December 2017.