-
Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?
Authors:
Ileana Montoya Perez,
Parisa Movahedi,
Valtteri Nieminen,
Antti Airola,
Tapio Pahikkala
Abstract:
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off.
Object…
▽ More
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off.
Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.
Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.
Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error.
△ Less
Submitted 23 August, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Study of AGN contribution on morphological parameters of their host galaxies
Authors:
Tilahun Getachew-Woreta,
Mirjana Pović,
Josefa Masegosa,
Jaime Perea,
Zeleke Beyoro-Amado,
Isabel Marquez Perez
Abstract:
We tested how the AGN contribution (5% - 75% of the total flux) may affect different morphological parameters commonly used in galaxy classification. We carried out all analysis at $z$,$sim$,0 and at higher redshifts that correspond to the COSMOS field. Using a local training sample of $>$,2000 visually classified galaxies, we carried out all measurements with and without the central source and qu…
▽ More
We tested how the AGN contribution (5% - 75% of the total flux) may affect different morphological parameters commonly used in galaxy classification. We carried out all analysis at $z$,$sim$,0 and at higher redshifts that correspond to the COSMOS field. Using a local training sample of $>$,2000 visually classified galaxies, we carried out all measurements with and without the central source and quantified how the contribution of a bright nuclear point source could affect different morphological parameters, such as: Abraham and Concelice-Bershady indices, Gini, Asymmetry, $M20$ moment of light, and Smoothness. We found that concentration indexes are less sensitive to both redshift and brightness in comparison to the other parameters. We also found that all parameters change significantly with AGN contribution. At $z$$\sim$0, up to a 10% of AGN contribution the morphological classification will not be significantly affect, but for $\ge$25% of AGN contribution late-type spirals follow the range of parameters of elliptical galaxies and can therefore be misclassified early types.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
Tournament Leave-pair-out Cross-validation for Receiver Operating Characteristic (ROC) Analysis
Authors:
Ileana Montoya Perez,
Antti Airola,
Peter J. Boström,
Ivan Jambor,
Tapio Pahikkala
Abstract:
Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide…
▽ More
Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide a ranking of the data needed for plotting and analyzing the ROC curve. In this study, we propose a new method called tournament leave-pair-out (TLPO) cross-validation. This method extends LPO by creating a tournament from pair comparisons to produce a ranking for the data. TLPO preserves the advantage of LPO for estimating AUC, while it also allows performing ROC analyses. We have shown using both synthetic and real world data that TLPO is as reliable as LPO for AUC estimation, and confirmed the bias in leave-one-out cross-validation on low-dimensional data. As a case study on ROC analysis, we also evaluate how reliably sensitivity and specificity can be estimated from TLPO ROC curves.
△ Less
Submitted 24 January, 2024; v1 submitted 29 January, 2018;
originally announced January 2018.
-
CALIFA: a diameter-selected sample for an integral field spectroscopy galaxy survey
Authors:
C. J. Walcher,
L. Wisotzki,
S. Bekeraité,
B. Husemann,
J. Iglesias-Páramo,
N. Backsmann,
J. Barrera Ballesteros,
C. Catalán-Torrecilla,
C. Cortijo,
A. del Olmo,
B. Garcia Lorenzo,
J. Falcón-Barroso,
L. Jilkova,
V. Kalinova,
D. Mast,
R. A. Marino,
J. Méndez-Abreu,
A. Pasquali,
S. F. Sánchez,
S. Trager,
S. Zibetti,
J. A. L. Aguerri,
J. Alves,
J. Bland-Hawthorn,
A. Boselli
, et al. (26 additional authors not shown)
Abstract:
We describe and discuss the selection procedure and statistical properties of the galaxy sample used by the Calar Alto Legacy Integral Field Area Survey (CALIFA), a public legacy survey of 600 galaxies using integral field spectroscopy. The CALIFA "mother sample" was selected from the Sloan Digital Sky Survey (SDSS) DR7 photometric catalogue to include all galaxies with an r-band isophotal major a…
▽ More
We describe and discuss the selection procedure and statistical properties of the galaxy sample used by the Calar Alto Legacy Integral Field Area Survey (CALIFA), a public legacy survey of 600 galaxies using integral field spectroscopy. The CALIFA "mother sample" was selected from the Sloan Digital Sky Survey (SDSS) DR7 photometric catalogue to include all galaxies with an r-band isophotal major axis between 45" and 79.2" and with a redshift 0.005 < z < 0.03. The mother sample contains 939 objects, 600 of which will be observed in the course of the CALIFA survey. The selection of targets for observations is based solely on visibility and thus keeps the statistical properties of the mother sample. By comparison with a large set of SDSS galaxies, we find that the CALIFA sample is representative of galaxies over a luminosity range of -19 > Mr > -23.1 and over a stellar mass range between 10^9.7 and 10^11.4Msun. In particular, within these ranges, the diameter selection does not lead to any significant bias against - or in favour of - intrinsically large or small galaxies. Only below luminosities of Mr = -19 (or stellar masses < 10^9.7Msun) is there a prevalence of galaxies with larger isophotal sizes, especially of nearly edge-on late-type galaxies, but such galaxies form < 10% of the full sample. We estimate volume-corrected distribution functions in luminosities and sizes and show that these are statistically fully compatible with estimates from the full SDSS when accounting for large-scale structure. We also present a number of value-added quantities determined for the galaxies in the CALIFA sample. We explore different ways of characterizing the environments of CALIFA galaxies, finding that the sample covers environmental conditions from the field to genuine clusters. We finally consider the expected incidence of active galactic nuclei among CALIFA galaxies.
△ Less
Submitted 10 July, 2014;
originally announced July 2014.