La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
Authors:
María Grandury,
Javier Aula-Blasco,
Júlia Falcão,
Clémentine Fourrier,
Miguel González,
Gonzalo Martínez,
Gonzalo Santamaría,
Rodrigo Agerri,
Nuria Aldama,
Luis Chiruzzo,
Javier Conde,
Helena Gómez,
Marta Guerrero,
Guido Ivetta,
Natalia López,
Flor Miriam Plaza-del-Arco,
María Teresa Martín-Valdivia,
Helena Montoro,
Carmen Muñoz,
Pedro Reviriego,
Leire Rosado,
Alejandro Vaca,
María Estrella Vallecillo-Rodríguez,
Jorge Vallego,
Irune Zubiaga
Abstract:
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a communi…
▽ More
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
Authors:
Irune Zubiaga,
Aitor Soroa,
Rodrigo Agerri
Abstract:
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of gene…
▽ More
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a $ρ$ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
△ Less
Submitted 4 November, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.