Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Bugueño, Margarita; de Melo, Gerard

Computer Science > Computation and Language

arXiv:2508.00864 (cs)

[Submitted on 18 Jul 2025]

Title:Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Authors:Margarita Bugueño, Gerard de Melo

View PDF HTML (experimental)

Abstract:In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.

Comments:	7 pages, 3 figures, 3 tables. Appendix starts on page 10
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2508.00864 [cs.CL]
	(or arXiv:2508.00864v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2508.00864

Submission history

From: Margarita Bugueño [view email]
[v1] Fri, 18 Jul 2025 12:05:54 UTC (757 KB)

Computer Science > Computation and Language

Title:Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators