TableRAG: Million-Token Table Understanding with Language Models

Chen, Si-An; Miculicich, Lesly; Eisenschlos, Julian Martin; Wang, Zifeng; Wang, Zilong; Chen, Yanfei; Fujii, Yasuhisa; Lin, Hsuan-Tien; Lee, Chen-Yu; Pfister, Tomas

Computer Science > Computation and Language

arXiv:2410.04739 (cs)

[Submitted on 7 Oct 2024 (v1), last revised 26 Dec 2024 (this version, v3)]

Title:TableRAG: Million-Token Table Understanding with Language Models

Authors:Si-An Chen, Lesly Miculicich, Julian Martin Eisenschlos, Zifeng Wang, Zilong Wang, Yanfei Chen, Yasuhisa Fujii, Hsuan-Tien Lin, Chen-Yu Lee, Tomas Pfister

View PDF HTML (experimental)

Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.

Comments:	Accepted to NeurIPS 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2410.04739 [cs.CL]
	(or arXiv:2410.04739v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.04739

Submission history

From: Si-An Chen [view email]
[v1] Mon, 7 Oct 2024 04:15:02 UTC (1,973 KB)
[v2] Tue, 24 Dec 2024 13:18:49 UTC (1,973 KB)
[v3] Thu, 26 Dec 2024 13:58:31 UTC (1,973 KB)

Computer Science > Computation and Language

Title:TableRAG: Million-Token Table Understanding with Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:TableRAG: Million-Token Table Understanding with Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators