Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Jesus, Sérgio; Pombal, José; Alves, Duarte; Cruz, André; Saleiro, Pedro; Ribeiro, Rita P.; Gama, João; Bizarro, Pedro

Computer Science > Machine Learning

arXiv:2211.13358 (cs)

[Submitted on 24 Nov 2022 (v1), last revised 28 Nov 2022 (this version, v2)]

Title:Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Authors:Sérgio Jesus, José Pombal, Duarte Alves, André Cruz, Pedro Saleiro, Rita P. Ribeiro, João Gama, Pedro Bizarro

View PDF

Abstract:Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.

Comments:	Accepted at NeurIPS 2022. this https URL
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2211.13358 [cs.LG]
	(or arXiv:2211.13358v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.13358

Submission history

From: Pedro Saleiro [view email]
[v1] Thu, 24 Nov 2022 00:03:29 UTC (245 KB)
[v2] Mon, 28 Nov 2022 11:17:46 UTC (245 KB)

Computer Science > Machine Learning

Title:Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators