Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

He, Jiaman; Leng, Zikang; McKay, Dana; Spina, Damiano; Trippas, Johanne R.

doi:10.1145/3767695.3769508

Computer Science > Information Retrieval

arXiv:2510.06658 (cs)

[Submitted on 8 Oct 2025]

Title:Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

Authors:Jiaman He, Zikang Leng, Dana McKay, Damiano Spina, Johanne R. Trippas

View PDF HTML (experimental)

Abstract:Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $\alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable.
We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.

Comments:	Accepted at SIGIR-AP 2025
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:2510.06658 [cs.IR]
	(or arXiv:2510.06658v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2510.06658
Related DOI:	https://doi.org/10.1145/3767695.3769508

Submission history

From: Jiaman He [view email]
[v1] Wed, 8 Oct 2025 05:17:33 UTC (704 KB)

Computer Science > Information Retrieval

Title:Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators