Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Jeong, Hyeon Seong; Jo, Sangwoo; Yoon, Byeong Hyun; Heo, Yoonseok; Jeong, Haedong; Kim, Taehoon

Computer Science > Machine Learning

arXiv:2507.23217 (cs)

[Submitted on 31 Jul 2025]

Title:Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Authors:Hyeon Seong Jeong, Sangwoo Jo, Byeong Hyun Yoon, Yoonseok Heo, Haedong Jeong, Taehoon Kim

View PDF HTML (experimental)

Abstract:Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 \cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2507.23217 [cs.LG]
	(or arXiv:2507.23217v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2507.23217

Submission history

From: Taehoon Kim [view email]
[v1] Thu, 31 Jul 2025 03:14:45 UTC (234 KB)

Computer Science > Machine Learning

Title:Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators