Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Sawicki, Piotr; Grześ, Marek; Brown, Dan; Góes, Fabrício

Computer Science > Computation and Language

arXiv:2502.19064 (cs)

[Submitted on 26 Feb 2025 (v1), last revised 4 Oct 2025 (this version, v2)]

Title:Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Authors:Piotr Sawicki, Marek Grześ, Dan Brown, Fabrício Góes

View PDF

Abstract:This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.

Comments:	18 pages, 3 figures. Accepted for publication at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.19064 [cs.CL]
	(or arXiv:2502.19064v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.19064

Submission history

From: Piotr Sawicki [view email]
[v1] Wed, 26 Feb 2025 11:43:25 UTC (43 KB)
[v2] Sat, 4 Oct 2025 09:24:24 UTC (35 KB)

Computer Science > Computation and Language

Title:Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators