Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Tu, Dezhan; He, Yeye; Cui, Weiwei; Ge, Song; Zhang, Haidong; Shi, Han; Zhang, Dongmei; Chaudhuri, Surajit

Computer Science > Databases

arXiv:2306.02421 (cs)

[Submitted on 4 Jun 2023]

Title:Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Authors:Dezhan Tu, Yeye He, Weiwei Cui, Song Ge, Haidong Zhang, Han Shi, Dongmei Zhang, Surajit Chaudhuri

View PDF

Abstract:Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in production settings to keep data updated so that ML models can be re-trained regularly, and BI dashboards refreshed frequently. However, data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. As modern enterprises operate thousands of recurring pipelines, today data engineers have to spend substantial efforts to \emph{manually} monitor and resolve DQ issues, as part of their DataOps and MLOps practices.
Given the high human cost of managing large-scale pipeline operations, it is imperative that we can \emph{automate} as much as possible. In this work, we propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues in recurring pipelines, leveraging rich statistics from historical executions. We formalize this as an optimization problem, and develop constant-factor approximation algorithms with provable precision guarantees. Extensive evaluations using 2000 production data pipelines at Microsoft demonstrate the effectiveness and efficiency of AVH.

Comments:	full version of a paper accepted to KDD 2023
Subjects:	Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:2306.02421 [cs.DB]
	(or arXiv:2306.02421v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2306.02421

Submission history

From: Yeye He [view email]
[v1] Sun, 4 Jun 2023 17:53:30 UTC (12,481 KB)

Computer Science > Databases

Title:Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators