The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Altakrori, Malik H.; Cheung, Jackie Chi Kit; Fung, Benjamin C. M.

Computer Science > Computation and Language

arXiv:2104.08530 (cs)

[Submitted on 17 Apr 2021 (v1), last revised 9 Sep 2021 (this version, v2)]

Title:The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Authors:Malik H. Altakrori, Jackie Chi Kit Cheung, Benjamin C. M. Fung

View PDF

Abstract:Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to capture authorship writing style or by a topic shift. Motivated by this, we propose the \emph{topic confusion} task where we switch the author-topic configuration between the training and testing sets. This setup allows us to distinguish two types of errors: those caused by the topic shift and those caused by the features' inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level $n$-grams.

Comments:	15 pages (9 + ref./appin.), 6 figures, Accepted to Findings of EMNLP 2021
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2104.08530 [cs.CL]
	(or arXiv:2104.08530v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2104.08530

Submission history

From: Malik Altakrori [view email]
[v1] Sat, 17 Apr 2021 12:50:58 UTC (145 KB)
[v2] Thu, 9 Sep 2021 15:28:52 UTC (398 KB)

Computer Science > Computation and Language

Title:The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators