ChapterBreak: A Challenge Dataset for Long-Range Language Models

Sun, Simeng; Thai, Katherine; Iyyer, Mohit

Computer Science > Computation and Language

arXiv:2204.10878 (cs)

[Submitted on 22 Apr 2022]

Title:ChapterBreak: A Challenge Dataset for Long-Range Language Models

Authors:Simeng Sun, Katherine Thai, Mohit Iyyer

View PDF

Abstract:While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our ChapterBreak dataset to spur more principled future research into LRLMs.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2204.10878 [cs.CL]
	(or arXiv:2204.10878v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2204.10878

Submission history

From: Simeng Sun [view email]
[v1] Fri, 22 Apr 2022 18:20:23 UTC (6,487 KB)

Computer Science > Computation and Language

Title:ChapterBreak: A Challenge Dataset for Long-Range Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ChapterBreak: A Challenge Dataset for Long-Range Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators