Search | arXiv e-print repository

doi 10.1055/a-2385-1355

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Authors: Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala

Abstract: Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Object… ▽ More Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error. △ Less

Submitted 23 August, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Journal ref: Methods Inf Med 2024; 63(01/02): 035-051

arXiv:2004.02250 [pdf, other]

doi 10.1017/S1743921320002793

Study of AGN contribution on morphological parameters of their host galaxies

Authors: Tilahun Getachew-Woreta, Mirjana Pović, Josefa Masegosa, Jaime Perea, Zeleke Beyoro-Amado, Isabel Marquez Perez

Abstract: We tested how the AGN contribution (5% - 75% of the total flux) may affect different morphological parameters commonly used in galaxy classification. We carried out all analysis at $z$,$sim$,0 and at higher redshifts that correspond to the COSMOS field. Using a local training sample of $>$,2000 visually classified galaxies, we carried out all measurements with and without the central source and qu… ▽ More We tested how the AGN contribution (5% - 75% of the total flux) may affect different morphological parameters commonly used in galaxy classification. We carried out all analysis at $z$,$sim$,0 and at higher redshifts that correspond to the COSMOS field. Using a local training sample of $>$,2000 visually classified galaxies, we carried out all measurements with and without the central source and quantified how the contribution of a bright nuclear point source could affect different morphological parameters, such as: Abraham and Concelice-Bershady indices, Gini, Asymmetry, $M20$ moment of light, and Smoothness. We found that concentration indexes are less sensitive to both redshift and brightness in comparison to the other parameters. We also found that all parameters change significantly with AGN contribution. At $z$$\sim$0, up to a 10% of AGN contribution the morphological classification will not be significantly affect, but for $\ge$25% of AGN contribution late-type spirals follow the range of parameters of elliptical galaxies and can therefore be misclassified early types. △ Less

Submitted 5 April, 2020; originally announced April 2020.

Comments: Proceedings paper of the IAU symposium "Nuclear Activity in Galaxies Across Cosmic Time" (Ethiopia) accepted to be published under the Cambridge University Press, eds. M. Pović, P. Marziani, J. Masegosa, H. Netzer, S. H. Negu, and S. B. Tessema

Report number: 2004.02250

arXiv:1801.09386 [pdf, other]

doi 10.1177/0962280218795190

Tournament Leave-pair-out Cross-validation for Receiver Operating Characteristic (ROC) Analysis

Authors: Ileana Montoya Perez, Antti Airola, Peter J. Boström, Ivan Jambor, Tapio Pahikkala

Abstract: Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide… ▽ More Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide a ranking of the data needed for plotting and analyzing the ROC curve. In this study, we propose a new method called tournament leave-pair-out (TLPO) cross-validation. This method extends LPO by creating a tournament from pair comparisons to produce a ranking for the data. TLPO preserves the advantage of LPO for estimating AUC, while it also allows performing ROC analyses. We have shown using both synthetic and real world data that TLPO is as reliable as LPO for AUC estimation, and confirmed the bias in leave-one-out cross-validation on low-dimensional data. As a case study on ROC analysis, we also evaluate how reliably sensitivity and specificity can be estimated from TLPO ROC curves. △ Less

Submitted 24 January, 2024; v1 submitted 29 January, 2018; originally announced January 2018.

Comments: 17 pages, 7 figures

Journal ref: Statistical Methods in Medical Research. 2019;28(10-11):2975-2991

arXiv:1407.2939 [pdf, other]

doi 10.1051/0004-6361/201424198

CALIFA: a diameter-selected sample for an integral field spectroscopy galaxy survey

Authors: C. J. Walcher, L. Wisotzki, S. Bekeraité, B. Husemann, J. Iglesias-Páramo, N. Backsmann, J. Barrera Ballesteros, C. Catalán-Torrecilla, C. Cortijo, A. del Olmo, B. Garcia Lorenzo, J. Falcón-Barroso, L. Jilkova, V. Kalinova, D. Mast, R. A. Marino, J. Méndez-Abreu, A. Pasquali, S. F. Sánchez, S. Trager, S. Zibetti, J. A. L. Aguerri, J. Alves, J. Bland-Hawthorn, A. Boselli , et al. (26 additional authors not shown)

Abstract: We describe and discuss the selection procedure and statistical properties of the galaxy sample used by the Calar Alto Legacy Integral Field Area Survey (CALIFA), a public legacy survey of 600 galaxies using integral field spectroscopy. The CALIFA "mother sample" was selected from the Sloan Digital Sky Survey (SDSS) DR7 photometric catalogue to include all galaxies with an r-band isophotal major a… ▽ More We describe and discuss the selection procedure and statistical properties of the galaxy sample used by the Calar Alto Legacy Integral Field Area Survey (CALIFA), a public legacy survey of 600 galaxies using integral field spectroscopy. The CALIFA "mother sample" was selected from the Sloan Digital Sky Survey (SDSS) DR7 photometric catalogue to include all galaxies with an r-band isophotal major axis between 45" and 79.2" and with a redshift 0.005 < z < 0.03. The mother sample contains 939 objects, 600 of which will be observed in the course of the CALIFA survey. The selection of targets for observations is based solely on visibility and thus keeps the statistical properties of the mother sample. By comparison with a large set of SDSS galaxies, we find that the CALIFA sample is representative of galaxies over a luminosity range of -19 > Mr > -23.1 and over a stellar mass range between 10^9.7 and 10^11.4Msun. In particular, within these ranges, the diameter selection does not lead to any significant bias against - or in favour of - intrinsically large or small galaxies. Only below luminosities of Mr = -19 (or stellar masses < 10^9.7Msun) is there a prevalence of galaxies with larger isophotal sizes, especially of nearly edge-on late-type galaxies, but such galaxies form < 10% of the full sample. We estimate volume-corrected distribution functions in luminosities and sizes and show that these are statistically fully compatible with estimates from the full SDSS when accounting for large-scale structure. We also present a number of value-added quantities determined for the galaxies in the CALIFA sample. We explore different ways of characterizing the environments of CALIFA galaxies, finding that the sample covers environmental conditions from the field to genuine clusters. We finally consider the expected incidence of active galactic nuclei among CALIFA galaxies. △ Less

Submitted 10 July, 2014; originally announced July 2014.

Comments: 20 pages, 18 figures, A&A in press

Journal ref: A&A 569, A1 (2014)

Showing 1–4 of 4 results for author: Perez, I M