SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Liu, Hannah; Min, Junghyun; Cheung, Ethan Yue Heng; Hung, Shou-Yi; Wasti, Syed Mekael; Liang, Runtong; Qian, Shiyao; Zheng, Shizhao; Chan, Elsie; Lo, Ka Ieng Charlotte; Yip, Wing Yu; Tsai, Richard Tzong-Han; Lee, En-Shiun Annie

Computer Science > Computation and Language

arXiv:2509.20557 (cs)

[Submitted on 24 Sep 2025]

Title:SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Authors:Hannah Liu, Junghyun Min, Ethan Yue Heng Cheung, Shou-Yi Hung, Syed Mekael Wasti, Runtong Liang, Shiyao Qian, Shizhao Zheng, Elsie Chan, Ka Ieng Charlotte Lo, Wing Yu Yip, Richard Tzong-Han Tsai, En-Shiun Annie Lee

View PDF HTML (experimental)

Abstract:Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.

Comments:	Work in progress. 14 pages, 4 figures, 5 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2509.20557 [cs.CL]
	(or arXiv:2509.20557v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2509.20557

Submission history

From: Junghyun Min [view email]
[v1] Wed, 24 Sep 2025 20:50:09 UTC (24,242 KB)

Computer Science > Computation and Language

Title:SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators