Programmable Synthetic Tabular Data Generation

Vero, Mark; Balunović, Mislav; Vechev, Martin

Computer Science > Machine Learning

arXiv:2307.03577v2 (cs)

[Submitted on 7 Jul 2023 (v1), revised 10 Jul 2023 (this version, v2), latest version 2 Jun 2024 (v4)]

Title:Programmable Synthetic Tabular Data Generation

Authors:Mark Vero, Mislav Balunović, Martin Vechev

View PDF

Abstract:Large amounts of tabular data remain underutilized due to privacy, data quality, and data sharing limitations. While training a generative model producing synthetic data resembling the original distribution addresses some of these issues, most applications require additional constraints from the generated data. Existing synthetic data approaches are limited as they typically only handle specific constraints, e.g., differential privacy (DP) or increased fairness, and lack an accessible interface for declaring general specifications. In this work, we introduce ProgSyn, the first programmable synthetic tabular data generation algorithm that allows for comprehensive customization over the generated data. To ensure high data quality while adhering to custom specifications, ProgSyn pre-trains a generative model on the original dataset and fine-tunes it on a differentiable loss automatically derived from the provided specifications. These can be programmatically declared using statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). We conduct an extensive experimental evaluation of ProgSyn on a number of constraints, achieving a new state-of-the-art on some, while remaining general. For instance, at the same fairness level we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset. Overall, ProgSyn provides a versatile and accessible framework for generating constrained synthetic tabular data, allowing for specifications that generalize beyond the capabilities of prior work.

Subjects:	Machine Learning (cs.LG); Databases (cs.DB); Programming Languages (cs.PL)
Cite as:	arXiv:2307.03577 [cs.LG]
	(or arXiv:2307.03577v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2307.03577

Submission history

From: Mark Vero [view email]
[v1] Fri, 7 Jul 2023 13:10:23 UTC (317 KB)
[v2] Mon, 10 Jul 2023 14:22:24 UTC (317 KB)
[v3] Thu, 15 Feb 2024 14:51:54 UTC (383 KB)
[v4] Sun, 2 Jun 2024 08:56:06 UTC (385 KB)

Computer Science > Machine Learning

Title:Programmable Synthetic Tabular Data Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Programmable Synthetic Tabular Data Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators