NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Ni'mah, Iftitahu; Fang, Meng; Menkovski, Vlado; Pechenizkiy, Mykola

Computer Science > Computation and Language

arXiv:2305.08566 (cs)

[Submitted on 15 May 2023 (v1), last revised 26 May 2023 (this version, v4)]

Title:NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Authors:Iftitahu Ni'mah, Meng Fang, Vlado Menkovski, Mykola Pechenizkiy

View PDF

Abstract:In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear.
We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.

Comments:	To appear at ACL 2023 Toronto (main conference). 9 pages (main), 1 page for Limitations and Ethics, 11 pages for Appendix
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.08566 [cs.CL]
	(or arXiv:2305.08566v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.08566

Submission history

From: Iftitahu Ni'mah [view email]
[v1] Mon, 15 May 2023 11:51:55 UTC (7,554 KB)
[v2] Wed, 17 May 2023 16:09:51 UTC (7,554 KB)
[v3] Thu, 18 May 2023 13:20:19 UTC (16,158 KB)
[v4] Fri, 26 May 2023 07:30:35 UTC (8,090 KB)

Computer Science > Computation and Language

Title:NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators