-
ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP
Authors:
Guillem García Subies,
Álvaro Barbero Jiménez,
Paloma Martínez Fernández
Abstract:
We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing…
▽ More
We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware
Authors:
Gonzalo Santamaría Gómez,
Guillem García Subies,
Pablo Gutiérrez Ruiz,
Mario González Valero,
Natàlia Fuertes,
Helena Montoro Zamorano,
Carmen Muñoz Sanz,
Leire Rosado Plaza,
Nuria Aldama García,
David Betancur Sánchez,
Kateryna Sushkova,
Marta Guerrero Nieto,
Álvaro Barbero Jiménez
Abstract:
Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational r…
▽ More
Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
MEL: Legal Spanish Language Model
Authors:
David Betancur Sánchez,
Nuria Aldama García,
Álvaro Barbero Jiménez,
Marta Guerrero Nieto,
Patricia Marsà Morales,
Nicolás Serrano Salas,
Carlos García Hernán,
Pablo Haya Coll,
Elena Montiel Ponsoda,
Pablo Calleja Ibáñez
Abstract:
Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper pr…
▽ More
Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large, fine-tuned on legal documents such as BOE (Boletín Oficial del Estado, the Spanish oficial report of laws) and congress texts. We detail the data collection, processing, training, and evaluation processes. Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language. We also present case studies demonstrating the model's application to new legal texts, highlighting its potential to perform top results over different NLP tasks.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
3CEL: A corpus of legal Spanish contract clauses
Authors:
Nuria Aldama García,
Patricia Marsà Morales,
David Betancur Sánchez,
Álvaro Barbero Jiménez,
Marta Guerrero Nieto,
Pablo Haya Coll,
Patricia Martín Chozas,
Elena Montiel Ponsoda
Abstract:
Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Politécnica de Madrid (UPM) and developed by Instituto de Ingeniería del Conocimiento (IIC) to create a series of state-of-the-art…
▽ More
Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Politécnica de Madrid (UPM) and developed by Instituto de Ingeniería del Conocimiento (IIC) to create a series of state-of-the-art NLP resources applied to the legal/administrative domain in Spanish. The goal of this paper is to present the Corpus of Legal Spanish Contract Clauses (3CEL), which is a contract information extraction corpus developed within the framework of INESData 2024. 3CEL contains 373 manually annotated tenders using 19 defined categories (4 782 total tags) that identify key information for contract understanding and reviewing.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
An evaluation of LLM code generation capabilities through graded exercises
Authors:
Álvaro Barbero Jiménez
Abstract:
Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-m…
▽ More
Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, a software development community. Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty, the popularity of the programming language being used and the time elapsed since the publication of the challenge. A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed to task difficulty, a 37.4% seems to be related to leakage of the challenge solutions into the model training set, while the remaining 16% depends on the programming language. These results suggest that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
A Survey of Spanish Clinical Language Models
Authors:
Guillem García Subies,
Álvaro Barbero Jiménez,
Paloma Martínez Fernández
Abstract:
This survey focuses in encoder Language Models for solving tasks in the clinical domain in the Spanish language. We review the contributions of 17 corpora focused mainly in clinical tasks, then list the most relevant Spanish Language Models and Spanish Clinical Language models. We perform a thorough comparison of these models by benchmarking them over a curated subset of the available corpora, in…
▽ More
This survey focuses in encoder Language Models for solving tasks in the clinical domain in the Spanish language. We review the contributions of 17 corpora focused mainly in clinical tasks, then list the most relevant Spanish Language Models and Spanish Clinical Language models. We perform a thorough comparison of these models by benchmarking them over a curated subset of the available corpora, in order to find the best-performing ones; in total more than 3000 models were fine-tuned for this study. All the tested corpora and the best models are made publically available in an accessible way, so that the results can be reproduced by independent teams or challenged in the future when new Spanish Clinical Language models are created.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
RLBoost: Boosting Supervised Models using Deep Reinforcement Learning
Authors:
Eloy Anguiano Batanero,
Ángela Fernández Pascual,
Álvaro Barbero Jiménez
Abstract:
Data quality or data evaluation is sometimes a task as important as collecting a large volume of data when it comes to generating accurate artificial intelligence models. In fact, being able to evaluate the data can lead to a larger database that is better suited to a particular problem because we have the ability to filter out data obtained automatically of dubious quality. In this paper we prese…
▽ More
Data quality or data evaluation is sometimes a task as important as collecting a large volume of data when it comes to generating accurate artificial intelligence models. In fact, being able to evaluate the data can lead to a larger database that is better suited to a particular problem because we have the ability to filter out data obtained automatically of dubious quality. In this paper we present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data in order to improve the final predictive quality of a supervised learning model. This solution has the advantage that of being agnostic regarding the supervised model used and, through multi-attention strategies, takes into account the data in its context and not only individually. The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Mixture of Diffusers for scene composition and high resolution image generation
Authors:
Álvaro Barbero Jiménez
Abstract:
Diffusion methods have been proven to be very effective to generate images while conditioning on a text prompt. However, and although the quality of the generated images is unprecedented, these methods seem to struggle when trying to generate specific image compositions. In this paper we present Mixture of Diffusers, an algorithm that builds over existing diffusion models to provide a more detaile…
▽ More
Diffusion methods have been proven to be very effective to generate images while conditioning on a text prompt. However, and although the quality of the generated images is unprecedented, these methods seem to struggle when trying to generate specific image compositions. In this paper we present Mixture of Diffusers, an algorithm that builds over existing diffusion models to provide a more detailed control over composition. By harmonizing several diffusion processes acting on different regions of a canvas, it allows generating larger images, where the location of each object and style is controlled by a separate diffusion process.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
RigoBERTa: A State-of-the-Art Language Model For Spanish
Authors:
Alejandro Vaca Serrano,
Guillem Garcia Subies,
Helena Montoro Zamorano,
Nuria Aldama Garcia,
Doaa Samy,
David Betancur Sanchez,
Antonio Moreno Sandoval,
Marta Guerrero Nieto,
Alvaro Barbero Jimenez
Abstract:
This paper presents RigoBERTa, a State-of-the-Art Language Model for Spanish. RigoBERTa is trained over a well-curated corpus formed up from different subcorpora with key features. It follows the DeBERTa architecture, which has several advantages over other architectures of similar size as BERT or RoBERTa. RigoBERTa performance is assessed over 13 NLU tasks in comparison with other available Spani…
▽ More
This paper presents RigoBERTa, a State-of-the-Art Language Model for Spanish. RigoBERTa is trained over a well-curated corpus formed up from different subcorpora with key features. It follows the DeBERTa architecture, which has several advantages over other architectures of similar size as BERT or RoBERTa. RigoBERTa performance is assessed over 13 NLU tasks in comparison with other available Spanish language models, namely, MarIA, BERTIN and BETO. RigoBERTa outperformed the three models in 10 out of the 13 tasks, achieving new "State-of-the-Art" results.
△ Less
Submitted 3 June, 2022; v1 submitted 27 April, 2022;
originally announced May 2022.