Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Yari, Amir Hossein; Kulkarni, Kalmit; Khan, Ahmad Raza; Koto, Fajri

Computer Science > Computation and Language

arXiv:2510.07061 (cs)

[Submitted on 8 Oct 2025]

Title:Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Authors:Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

View PDF HTML (experimental)

Abstract:While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

Comments:	18 pages, 14 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2510.07061 [cs.CL]
	(or arXiv:2510.07061v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2510.07061

Submission history

From: Amir Hossein Yari [view email]
[v1] Wed, 8 Oct 2025 14:27:02 UTC (21,559 KB)

Computer Science > Computation and Language

Title:Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators