Skip to main content

Showing 1–11 of 11 results for author: Alves, D M

.
  1. arXiv:2506.04079  [pdf, ps, other

    cs.CL cs.AI cs.LG

    EuroLLM-9B: Technical Report

    Authors: Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

    Abstract: This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, inclu… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 56 pages

  2. arXiv:2503.05500  [pdf, other

    cs.CL cs.AI

    EuroBERT: Scaling Multilingual Encoders for European Languages

    Authors: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

    Abstract: General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit th… ▽ More

    Submitted 26 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: 28 pages, 8 figures, 13 tables

  3. arXiv:2412.18410  [pdf, other

    astro-ph.IM

    SPARC4 control system

    Authors: Denis Bernardes, Orlando Verducci Junior, Francisco Rodrigues, Claudia Vilega Rodrigues, Luciano Fraga, Eder Martioli, Clemens D. Gneiding, André Luiz de Moura Alves, Juliano Romão, Laerte Andrade, Leandro de Almeida, Ana Carolina Mattiuci, Flavio Felipe Ribeiro, Wagner Schlindwein, Jesulino Bispo dos Santos, Francisco Jose Jablonski, Julio Cesar Neves Campagnolo, Rene Laporte

    Abstract: SPARC4 is a new astronomical instrument developed entirely by Brazilian institutions, currently installed on the 1.6-m Perkin-Elmer telescope of the Pico dos Dias Observatory. It allows the user to perform photometric or polarimetric observations simultaneously in the four SDSS bands (g, r, i, and z). In this paper, we describe the control system developed for SPARC4. This system is composed of S4… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: 20 pages, 14 figures, peer-reviewed paper

  4. arXiv:2409.16235  [pdf, other

    cs.CL

    EuroLLM: Multilingual Language Models for Europe

    Authors: Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

    Abstract: The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date,… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

  5. arXiv:2402.17733  [pdf, other

    cs.CL

    Tower: An Open Multilingual Large Language Model for Translation-Related Tasks

    Authors: Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins

    Abstract: While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and pa… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  6. arXiv:2402.00786  [pdf, other

    cs.CL cs.LG

    CroissantLLM: A Truly Bilingual French-English Language Model

    Authors: Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust… ▽ More

    Submitted 9 April, 2025; v1 submitted 1 February, 2024; originally announced February 2024.

  7. arXiv:2312.14211  [pdf, ps, other

    cs.CL astro-ph.IM cs.AI

    Experimenting with Large Language Models and vector embeddings in NASA SciX

    Authors: Sergi Blanco-Cuaresma, Ioana Ciucă, Alberto Accomazzi, Michael J. Kurtz, Edwin A. Henneken, Kelly E. Lockhart, Felix Grezes, Thomas Allen, Golnaz Shapurian, Carolyn S. Grant, Donna M. Thompson, Timothy W. Hostetler, Matthew R. Templeton, Shinyi Chen, Jennifer Koch, Taylor Jacovich, Daniel Chivvis, Fernanda de Macedo Alves, Jean-Claude Paquin, Jennifer Bartlett, Mugdha Polimera, Stephanie Jarmak

    Abstract: Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed a… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear in the proceedings of the 33th annual international Astronomical Data Analysis Software & Systems (ADASS XXXIII)

  8. arXiv:2310.13448  [pdf, other

    cs.CL

    Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning

    Authors: Duarte M. Alves, Nuno M. Guerreiro, João Alves, José Pombal, Ricardo Rei, José G. C. de Souza, Pierre Colombo, André F. T. Martins

    Abstract: Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capa… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023 - Findings

  9. arXiv:2209.06243  [pdf, other

    cs.CL cs.LG

    CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

    Authors: Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C. Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte M. Alves, Alon Lavie, Luisa Coheur, André F. T. Martins

    Abstract: We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it w… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: WMT 2022 Quality Estimation shared task

  10. arXiv:2201.08243  [pdf

    physics.app-ph cond-mat.mtrl-sci

    Novel C-dots/titanate nanotubular hybrid materials with enhanced optical and photocatalytic properties

    Authors: D. M. Alves, J. V. Prata, A. J. Silvestre, O. C. Monteiro

    Abstract: Advanced nanomaterials with enhanced optical and photocatalytic properties for the photodegradation of organic pollutants, in particular pharmaceuticals and personal care products (PPCPs), were successfully prepared by a swift one-pot synthesis. Nanostructured materials were synthesized through an integrated hydrothermal procedure which generates titanate nanotubes (TNTs) with different carbon dot… ▽ More

    Submitted 23 September, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: 38 pages, 10 figures, 4 tables, Supplementary Information

  11. A continuous integration and web framework in support of the ATLAS Publication Process

    Authors: Juan Pedro Araque Espinosa, Gabriel Baldi Levcovitz, Riccardo-Maria Bianchi, Ian Brock, Tancredi Carli, Nuno Filipe Castro, Alessandra Ciocio, Maurizio Colautti, Ana Carolina Da Silva Menezes, Gabriel De Oliveira da Fonseca, Leandro Domingues Macedo Alves, Andreas Hoecker, Bruno Lange Ramos, Gabriela Lemos Lúcidi Pinhão, Carmen Maidantchik, Fairouz Malek, Robert McPherson, Gianluca Picco, Marcelo Teixeira Dos Santos

    Abstract: The ATLAS collaboration defines methods, establishes procedures, and organises advisory groups to manage the publication processes of scientific papers, conference papers, and public notes. All stages are managed through web systems, computing programs, and tools that are designed and developed by the collaboration. A framework called FENCE is integrated into the CERN GitLab software repository, t… ▽ More

    Submitted 28 January, 2021; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: 22 pages in total,11 figures, submitted to JINST. All figures including auxiliary figures are available at https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/GENR-2018-01/

    Report number: CERN-OPEN-2020-007