Skip to main content

Showing 1–1 of 1 results for author: Salas, P G d P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2207.06814  [pdf, other

    cs.CL cs.AI

    BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

    Authors: Javier de la Rosa, Eduardo G. Ponferrada, Paulo Villegas, Pablo Gonzalez de Prado Salas, Manu Romero, Marıa Grandury

    Abstract: The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

    Comments: Published at Procesamiento del Lenguaje Natural

    Journal ref: Procesamiento del Lenguaje Natural, 68 (2022): 13-23