Pylon: Semantic Table Union Search in Data Lakes

Cong, Tianji; Nargesian, Fatemeh; Jagadish, H. V.

Computer Science > Databases

arXiv:2301.04901 (cs)

[Submitted on 12 Jan 2023 (v1), last revised 13 Jan 2023 (this version, v2)]

Title:Pylon: Semantic Table Union Search in Data Lakes

Authors:Tianji Cong, Fatemeh Nargesian, H. V. Jagadish

View PDF

Abstract:The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables.
The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision ($16\% \uparrow$), recall ($17\% \uparrow $), and query response time (7x faster) compared to the state-of-the-art.

Comments:	Version submitted to the third round of ICDE 2023 on October 8, 2022
Subjects:	Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:	arXiv:2301.04901 [cs.DB]
	(or arXiv:2301.04901v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2301.04901

Submission history

From: Tianji Cong [view email]
[v1] Thu, 12 Jan 2023 09:51:48 UTC (2,588 KB)
[v2] Fri, 13 Jan 2023 08:07:27 UTC (472 KB)

Computer Science > Databases

Title:Pylon: Semantic Table Union Search in Data Lakes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Pylon: Semantic Table Union Search in Data Lakes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators