-
Real-World Summarization: When Evaluation Reaches Its Limits
Authors:
Patrícia Schmidtová,
Ondřej Dušek,
Saad Mahamood
Abstract:
We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word…
▽ More
We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.
△ Less
Submitted 15 July, 2025;
originally announced July 2025.
-
How Important is `Perfect' English for Machine Translation Prompts?
Authors:
Patrícia Schmidtová,
Niyati Bafna,
Seth Aycock,
Gianluca Vico,
Wiktor Kamzela,
Katharina Hämmerl,
Vilém Zouhar
Abstract:
Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quan…
▽ More
Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt.
The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.
△ Less
Submitted 30 August, 2025; v1 submitted 13 July, 2025;
originally announced July 2025.
-
Large Language Models as Span Annotators
Authors:
Zdeněk Kasner,
Vilém Zouhar,
Patrícia Schmidtová,
Ivan Kartáč,
Kristýna Onderková,
Ondřej Plátek,
Dimitra Gkatzia,
Saad Mahamood,
Ondřej Dušek,
Simone Balloccu
Abstract:
Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flex…
▽ More
Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flexible and cost-effective span annotation backbones. To demonstrate their utility, we compare LLMs to skilled human annotators on three diverse span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA) comparable to human annotators at a fraction of a cost per output annotation. We also manually analyze model outputs, finding that LLMs make errors at a similar rate to human annotators. We release the dataset of more than 40k model and human annotations for further research.
△ Less
Submitted 24 June, 2025; v1 submitted 11 April, 2025;
originally announced April 2025.
-
Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
Authors:
Patrícia Schmidtová,
Saad Mahamood,
Simone Balloccu,
Ondřej Dušek,
Albert Gatt,
Dimitra Gkatzia,
David M. Howcroft,
Ondřej Plátek,
Adarsa Sivaprasad
Abstract:
Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are cho…
▽ More
Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
factgenie: A Framework for Span-based Evaluation of Generated Texts
Authors:
Zdeněk Kasner,
Ondřej Plátek,
Patrícia Schmidtová,
Simone Balloccu,
Ondřej Dušek
Abstract:
We present factgenie: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With factgenie, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annota…
▽ More
We present factgenie: a framework for annotating and visualizing word spans in textual model outputs. Annotations can capture various span-based phenomena such as semantic inaccuracies or irrelevant text. With factgenie, the annotations can be collected both from human crowdworkers and large language models. Our framework consists of a web interface for data visualization and gathering text annotations, powered by an easily extensible codebase.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Authors:
Simone Balloccu,
Patrícia Schmidtová,
Mateusz Lango,
Ondřej Dušek
Abstract:
Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but…
▽ More
Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.
△ Less
Submitted 22 February, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Three Ways of Using Large Language Models to Evaluate Chat
Authors:
Ondřej Plátek,
Vojtěch Hudeček,
Patricia Schmidtová,
Mateusz Lango,
Ondřej Dušek
Abstract:
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other tw…
▽ More
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.
△ Less
Submitted 12 August, 2023;
originally announced August 2023.
-
DialogueScript: Using Dialogue Agents to Produce a Script
Authors:
Patrícia Schmidtová,
Dávid Javorský,
Christián Mikláš,
Tomáš Musil,
Rudolf Rosa,
Ondřej Dušek
Abstract:
We present a novel approach to generating scripts by using agents with different personality types. To manage character interaction in the script, we employ simulated dramatic networks. Automatic and human evaluation on multiple criteria shows that our approach outperforms a vanilla-GPT2-based baseline. We further introduce a new metric to evaluate dialogue consistency based on natural language in…
▽ More
We present a novel approach to generating scripts by using agents with different personality types. To manage character interaction in the script, we employ simulated dramatic networks. Automatic and human evaluation on multiple criteria shows that our approach outperforms a vanilla-GPT2-based baseline. We further introduce a new metric to evaluate dialogue consistency based on natural language inference and demonstrate its validity.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
THEaiTRE 1.0: Interactive generation of theatre play scripts
Authors:
Rudolf Rosa,
Tomáš Musil,
Ondřej Dušek,
Dominik Jurko,
Patrícia Schmidtová,
David Mareček,
Ondřej Bojar,
Tom Kocmi,
Daniel Hrbek,
David Košťák,
Martina Kinská,
Marie Nováková,
Josef Doležal,
Klára Vosecká,
Tomáš Studeník,
Petr Žabka
Abstract:
We present the first version of a system for interactive generation of theatre play scripts. The system is based on a vanilla GPT-2 model with several adjustments, targeting specific issues we encountered in practice. We also list other issues we encountered but plan to only solve in a future version of the system. The presented system was used to generate a theatre play script planned for premier…
▽ More
We present the first version of a system for interactive generation of theatre play scripts. The system is based on a vanilla GPT-2 model with several adjustments, targeting specific issues we encountered in practice. We also list other issues we encountered but plan to only solve in a future version of the system. The presented system was used to generate a theatre play script planned for premiere in February 2021.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
-
THEaiTRE: Artificial Intelligence to Write a Theatre Play
Authors:
Rudolf Rosa,
Ondřej Dušek,
Tom Kocmi,
David Mareček,
Tomáš Musil,
Patrícia Schmidtová,
Dominik Jurko,
Ondřej Bojar,
Daniel Hrbek,
David Košťák,
Martina Kinská,
Josef Doležal,
Klára Vosecká
Abstract:
We present THEaiTRE, a starting project aimed at automatic generation of theatre play scripts. This paper reviews related work and drafts an approach we intend to follow. We plan to adopt generative neural language models and hierarchical generation approaches, supported by summarization and machine translation methods, and complemented with a human-in-the-loop approach.
We present THEaiTRE, a starting project aimed at automatic generation of theatre play scripts. This paper reviews related work and drafts an approach we intend to follow. We plan to adopt generative neural language models and hierarchical generation approaches, supported by summarization and machine translation methods, and complemented with a human-in-the-loop approach.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.