Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Boutaleb, Allaa; Amann, Bernd; Naacke, Hubert; Angarita, Rafael

Computer Science > Information Retrieval

arXiv:2505.21329 (cs)

[Submitted on 27 May 2025 (v1), last revised 28 May 2025 (this version, v2)]

Title:Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Authors:Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita

View PDF HTML (experimental)

Abstract:Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.

Comments:	Accepted @ ACL 2025's Table Representation Learning Workshop (TRL)
Subjects:	Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:2505.21329 [cs.IR]
	(or arXiv:2505.21329v2 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2505.21329

Submission history

From: Allaa Boutaleb [view email]
[v1] Tue, 27 May 2025 15:23:52 UTC (154 KB)
[v2] Wed, 28 May 2025 11:44:41 UTC (152 KB)

Computer Science > Information Retrieval

Title:Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators