-
EuroLLM-9B: Technical Report
Authors:
Pedro Henrique Martins,
João Alves,
Patrick Fernandes,
Nuno M. Guerreiro,
Ricardo Rei,
Amin Farajian,
Mateusz Klimaszewski,
Duarte M. Alves,
José Pombal,
Manuel Faysse,
Pierre Colombo,
François Yvon,
Barry Haddow,
José G. C. de Souza,
Alexandra Birch,
André F. T. Martins
Abstract:
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, inclu…
▽ More
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
EuroBERT: Scaling Multilingual Encoders for European Languages
Authors:
Nicolas Boizard,
Hippolyte Gisserot-Boukhlef,
Duarte M. Alves,
André Martins,
Ayoub Hammal,
Caio Corro,
Céline Hudelot,
Emmanuel Malherbe,
Etienne Malaboeuf,
Fanny Jourdan,
Gabriel Hautreux,
João Alves,
Kevin El-Haddad,
Manuel Faysse,
Maxime Peyrard,
Nuno M. Guerreiro,
Patrick Fernandes,
Ricardo Rei,
Pierre Colombo
Abstract:
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit th…
▽ More
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.
△ Less
Submitted 26 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
SPARC4 control system
Authors:
Denis Bernardes,
Orlando Verducci Junior,
Francisco Rodrigues,
Claudia Vilega Rodrigues,
Luciano Fraga,
Eder Martioli,
Clemens D. Gneiding,
André Luiz de Moura Alves,
Juliano Romão,
Laerte Andrade,
Leandro de Almeida,
Ana Carolina Mattiuci,
Flavio Felipe Ribeiro,
Wagner Schlindwein,
Jesulino Bispo dos Santos,
Francisco Jose Jablonski,
Julio Cesar Neves Campagnolo,
Rene Laporte
Abstract:
SPARC4 is a new astronomical instrument developed entirely by Brazilian institutions, currently installed on the 1.6-m Perkin-Elmer telescope of the Pico dos Dias Observatory. It allows the user to perform photometric or polarimetric observations simultaneously in the four SDSS bands (g, r, i, and z). In this paper, we describe the control system developed for SPARC4. This system is composed of S4…
▽ More
SPARC4 is a new astronomical instrument developed entirely by Brazilian institutions, currently installed on the 1.6-m Perkin-Elmer telescope of the Pico dos Dias Observatory. It allows the user to perform photometric or polarimetric observations simultaneously in the four SDSS bands (g, r, i, and z). In this paper, we describe the control system developed for SPARC4. This system is composed of S4ACS, S4ICS, and S4GUI softwares and associated hardware. S4ACS is responsible for controlling the four EMCCD scientific cameras (one for each instrument band). S4ICS controls the sensors and motors responsible for the moving parts of SPARC4. Finally, S4GUI is the interface used to perform observations, which includes the choice of instrument configuration and image acquisition parameters. S4GUI communicates with the instrument subsystems and with some observatory facilities, needed during the observations. Bench tests were performed for the determination of the overheads added by SPARC4 control system in the acquisition of photometric and polarimetric series of images. In the photometric mode, SPARC4 allows the acquisition of a series of 1400 full-frame images, with a deadtime of 4.5 ms between images. Besides, several image series can be concatenated with a deadtime of 450 ms plus the readout time of the last image. For the polarimetric mode, measurements can be obtained with a deadtime of 1.41 s plus the image readout time between subsequent waveplate positions. For both photometric and polarimetric modes, the user can choose among operating modes with image readout times between 5.9 ms and 1.24 s, which ultimately defines the instrument temporal performance.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
EuroLLM: Multilingual Language Models for Europe
Authors:
Pedro Henrique Martins,
Patrick Fernandes,
João Alves,
Nuno M. Guerreiro,
Ricardo Rei,
Duarte M. Alves,
José Pombal,
Amin Farajian,
Manuel Faysse,
Mateusz Klimaszewski,
Pierre Colombo,
Barry Haddow,
José G. C. de Souza,
Alexandra Birch,
André F. T. Martins
Abstract:
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date,…
▽ More
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Authors:
Duarte M. Alves,
José Pombal,
Nuno M. Guerreiro,
Pedro H. Martins,
João Alves,
Amin Farajian,
Ben Peters,
Ricardo Rei,
Patrick Fernandes,
Sweta Agrawal,
Pierre Colombo,
José G. C. de Souza,
André F. T. Martins
Abstract:
While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and pa…
▽ More
While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our final model surpasses open alternatives on several tasks relevant to translation workflows and is competitive with general-purpose closed LLMs. To facilitate future research, we release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations, including ours, on our benchmark.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
CroissantLLM: A Truly Bilingual French-English Language Model
Authors:
Manuel Faysse,
Patrick Fernandes,
Nuno M. Guerreiro,
António Loison,
Duarte M. Alves,
Caio Corro,
Nicolas Boizard,
João Alves,
Ricardo Rei,
Pedro H. Martins,
Antoni Bigata Casademunt,
François Yvon,
André F. T. Martins,
Gautier Viaud,
Céline Hudelot,
Pierre Colombo
Abstract:
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust…
▽ More
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.
△ Less
Submitted 9 April, 2025; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Experimenting with Large Language Models and vector embeddings in NASA SciX
Authors:
Sergi Blanco-Cuaresma,
Ioana Ciucă,
Alberto Accomazzi,
Michael J. Kurtz,
Edwin A. Henneken,
Kelly E. Lockhart,
Felix Grezes,
Thomas Allen,
Golnaz Shapurian,
Carolyn S. Grant,
Donna M. Thompson,
Timothy W. Hostetler,
Matthew R. Templeton,
Shinyi Chen,
Jennifer Koch,
Taylor Jacovich,
Daniel Chivvis,
Fernanda de Macedo Alves,
Jean-Claude Paquin,
Jennifer Bartlett,
Mugdha Polimera,
Stephanie Jarmak
Abstract:
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed a…
▽ More
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
Authors:
Duarte M. Alves,
Nuno M. Guerreiro,
João Alves,
José Pombal,
Ricardo Rei,
José G. C. de Souza,
Pierre Colombo,
André F. T. Martins
Abstract:
Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capa…
▽ More
Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
Authors:
Ricardo Rei,
Marcos Treviso,
Nuno M. Guerreiro,
Chrysoula Zerva,
Ana C. Farinha,
Christine Maroti,
José G. C. de Souza,
Taisiya Glushkova,
Duarte M. Alves,
Alon Lavie,
Luisa Coheur,
André F. T. Martins
Abstract:
We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it w…
▽ More
We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Novel C-dots/titanate nanotubular hybrid materials with enhanced optical and photocatalytic properties
Authors:
D. M. Alves,
J. V. Prata,
A. J. Silvestre,
O. C. Monteiro
Abstract:
Advanced nanomaterials with enhanced optical and photocatalytic properties for the photodegradation of organic pollutants, in particular pharmaceuticals and personal care products (PPCPs), were successfully prepared by a swift one-pot synthesis. Nanostructured materials were synthesized through an integrated hydrothermal procedure which generates titanate nanotubes (TNTs) with different carbon dot…
▽ More
Advanced nanomaterials with enhanced optical and photocatalytic properties for the photodegradation of organic pollutants, in particular pharmaceuticals and personal care products (PPCPs), were successfully prepared by a swift one-pot synthesis. Nanostructured materials were synthesized through an integrated hydrothermal procedure which generates titanate nanotubes (TNTs) with different carbon dots (C-dots) contents, from an amorphous titanium oxide-based source and cork industry wastewaters (CIWWs) as carbon source. Their structural, microstructural, morphological, and optical properties were studied by XRD, TEM, UV-Vis diffuse reflectance and photoluminescence spectroscopies. As aimed, the hybrid C-dots/TNT nanomaterials extend their light absorption towards the red, in comparison to pristine TNTs, prompting them for a more efficient use of light in photocatalysis by widening the TNTs energy uptake range. The decrease of bandgap energy with increasing sample's C-dots content seems to be originated from energy intermediate states formed within the TNTs forbidden band resulting from Ti-O-C bonds established between the TNTs and the C-dots that form tails of states. The as-synthesized C-dots/TNT samples were tested in the photodegradation of caffeine as a pollutant model. Rewarding results were obtained, with the hybrid C-dots/TNT nanomaterials showing significant enhanced photocatalytic ability toward caffeine degradation in comparison to pristine TNTs. Photocatalysis assays in the presence of scavengers and/or in the absence of oxygen were also performed aiming to characterize the most reactive species formed during the semiconductor photo-activation process and thus assessing to possible reactive pathways underpinning the photocatalytic activity of the hybrid C-dots/TNT nanomaterials.
△ Less
Submitted 23 September, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
A continuous integration and web framework in support of the ATLAS Publication Process
Authors:
Juan Pedro Araque Espinosa,
Gabriel Baldi Levcovitz,
Riccardo-Maria Bianchi,
Ian Brock,
Tancredi Carli,
Nuno Filipe Castro,
Alessandra Ciocio,
Maurizio Colautti,
Ana Carolina Da Silva Menezes,
Gabriel De Oliveira da Fonseca,
Leandro Domingues Macedo Alves,
Andreas Hoecker,
Bruno Lange Ramos,
Gabriela Lemos Lúcidi Pinhão,
Carmen Maidantchik,
Fairouz Malek,
Robert McPherson,
Gianluca Picco,
Marcelo Teixeira Dos Santos
Abstract:
The ATLAS collaboration defines methods, establishes procedures, and organises advisory groups to manage the publication processes of scientific papers, conference papers, and public notes. All stages are managed through web systems, computing programs, and tools that are designed and developed by the collaboration. A framework called FENCE is integrated into the CERN GitLab software repository, t…
▽ More
The ATLAS collaboration defines methods, establishes procedures, and organises advisory groups to manage the publication processes of scientific papers, conference papers, and public notes. All stages are managed through web systems, computing programs, and tools that are designed and developed by the collaboration. A framework called FENCE is integrated into the CERN GitLab software repository, to automatically configure workspaces where each analysis can be documented by the analysis team and managed by the relevant coordinators. Continuous integration is used to guide the writers in applying consistent and correct formatting when preparing papers to be submitted to scientific journals. Additional software assures the correctness of other aspects of each paper, such as the lists of collaboration authors, funding agencies, and foundations. The framework and the workflow therein provide automatic and easy support to the researchers and facilitates each phase of the publication process, allowing authors to focus on the article contents. The framework and its integration with the most up to date and efficient tools has consequently provided a more professional and efficient automatized work environment to the whole collaboration.
△ Less
Submitted 28 January, 2021; v1 submitted 14 May, 2020;
originally announced May 2020.