-
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
Authors:
María Grandury,
Javier Aula-Blasco,
Júlia Falcão,
Clémentine Fourrier,
Miguel González,
Gonzalo Martínez,
Gonzalo Santamaría,
Rodrigo Agerri,
Nuria Aldama,
Luis Chiruzzo,
Javier Conde,
Helena Gómez,
Marta Guerrero,
Guido Ivetta,
Natalia López,
Flor Miriam Plaza-del-Arco,
María Teresa Martín-Valdivia,
Helena Montoro,
Carmen Muñoz,
Pedro Reviriego,
Leire Rosado,
Alejandro Vaca,
María Estrella Vallecillo-Rodríguez,
Jorge Vallego,
Irune Zubiaga
Abstract:
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a communi…
▽ More
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Salamandra Technical Report
Authors:
Aitor Gonzalez-Agirre,
Marc Pàmies,
Joan Llop,
Irene Baucells,
Severino Da Dalt,
Daniel Tamayo,
José Javier Saiz,
Ferran Espuña,
Jaume Prats,
Javier Aula-Blasco,
Mario Mina,
Iñigo Pikabea,
Adrián Rubio,
Alexander Shvets,
Anna Sallés,
Iñaki Lacunza,
Jorge Palomar,
Júlia Falcão,
Lucía Tormo,
Luis Vasquez-Reina,
Montserrat Marimon,
Oriol Pareras,
Valle Ruiz-Fernández,
Marta Villegas
Abstract:
This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along wi…
▽ More
This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
△ Less
Submitted 13 February, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Authors:
Guijin Son,
Dongkeun Yoon,
Juyoung Suk,
Javier Aula-Blasco,
Mano Aslan,
Vu Trong Kim,
Shayekh Bin Islam,
Jaume Prats-Cristià,
Lucía Tormo-Bañuelos,
Seungone Kim
Abstract:
As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators…
▽ More
As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine whether evaluator LLMs can reliably assess the outputs of multilingual LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising five core subsets covering 18 languages and a Language Consistency subset spanning 122 languages. A core attribute of MM-Eval is that, instead of merely translating existing English meta-evaluation benchmarks, it is designed with multilingual-specific challenges in mind. Additionally, unlike existing meta-evaluation benchmarks that focus solely on ranking accuracy over pairwise data, MM-Eval also evaluates the consistency and fairness of absolute score values across a wide range of languages. Our results show that existing evaluator LLMs that excel in English contexts have considerable room for improvement when assessing non-English outputs. Furthermore, we find that evaluators are unfair and inconsistent when evaluating lower-resourced languages. Finally, we validate MM-Eval by measuring its correlation with Best-of-N rankings, finding a significantly stronger correlation compared to other meta-evaluation benchmarks. We publicly release our benchmark and code.
△ Less
Submitted 29 March, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.