CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Gorla, Aditya; Wang, Ryan; Liu, Zhengtong; An, Ulzee; Sankararaman, Sriram

Computer Science > Machine Learning

arXiv:2506.02306 (cs)

[Submitted on 2 Jun 2025]

Title:CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Authors:Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman

View PDF HTML (experimental)

Abstract:We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2506.02306 [cs.LG]
	(or arXiv:2506.02306v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2506.02306
Journal reference:	In Proc. 42th International Conference on Machine Learning (ICML 2025 Spotlight)

Submission history

From: Aditya Gorla [view email]
[v1] Mon, 2 Jun 2025 22:50:22 UTC (1,283 KB)

Computer Science > Machine Learning

Title:CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators