Skip to main content

Showing 1–18 of 18 results for author: de la Clergerie, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.20331  [pdf, ps, other

    cs.CL cs.LG

    Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

    Authors: Rian Touchent, Nathan Godey, Eric de la Clergerie

    Abstract: We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) est… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Dataset link: https://hf.co/datasets/almanach/Biomed-Enriched

  2. arXiv:2503.02812  [pdf, other

    cs.CL cs.AI

    Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

    Authors: Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric de la Clergerie, Benoît Sagot

    Abstract: Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors tha… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  3. arXiv:2411.08868  [pdf, other

    cs.CL

    CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

    Authors: Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

    Abstract: French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issu… ▽ More

    Submitted 13 November, 2024; originally announced November 2024.

  4. arXiv:2406.06589  [pdf, other

    cs.CL cs.AI

    PatentEval: Understanding Errors in Patent Generation

    Authors: You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

    Abstract: In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated… ▽ More

    Submitted 25 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Journal ref: NAACL2024 - 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun 2024, Mexico City, Mexico

  5. arXiv:2404.07647  [pdf, other

    cs.CL

    Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

    Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

    Abstract: Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of smaller counterparts. However, it has been observed that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point i… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  6. arXiv:2402.19406  [pdf, other

    cs.CL cs.AI

    On the Scaling Laws of Geographical Representation in Language Models

    Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

    Abstract: Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We show that geographical… ▽ More

    Submitted 4 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted at LREC-COLING 2024

  7. arXiv:2401.12143  [pdf, other

    cs.CL

    Anisotropy Is Inherent to Self-Attention in Transformers

    Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

    Abstract: The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of opti… ▽ More

    Submitted 24 January, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Proceedings of EACL 2024. A previous version of the paper, published as arXiv:2306.07656, was presented at ACL-SRW 2023 (non-archival)

  8. arXiv:2309.08351  [pdf, other

    cs.CL

    Headless Language Models: Learning without Predicting with Contrastive Weight Tying

    Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

    Abstract: Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Languag… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  9. arXiv:2306.15550  [pdf, other

    cs.CL cs.AI

    CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data

    Authors: Rian Touchent, Laurent Romary, Eric de la Clergerie

    Abstract: Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these m… ▽ More

    Submitted 3 April, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted to LREC-COLING 2024

  10. arXiv:2306.07656  [pdf, other

    cs.CL

    Is Anisotropy Inherent to Transformers?

    Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

    Abstract: The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of opti… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: ACL-SRW 2023 (Poster)

  11. arXiv:2212.07284  [pdf, other

    cs.CL

    MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

    Authors: Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot

    Abstract: Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting s… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: EMNLP 2022 Findings (https://aclanthology.org/2022.findings-emnlp.207/)

  12. arXiv:2104.07560  [pdf, other

    cs.CL

    Rethinking Automatic Evaluation in Sentence Simplification

    Authors: Thomas Scialom, Louis Martin, Jacopo Staiano, Éric Villemonte de la Clergerie, Benoît Sagot

    Abstract: Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that shares the same meaning. This limits the effectiveness of n-gram based metrics like BLEU. Going hand in hand with the recent advances in NLG, new metrics have b… ▽ More

    Submitted 16 April, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: updated affiliation and link to data

  13. arXiv:2012.01942  [pdf, other

    cs.CL cs.AI cs.DS cs.LG

    Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

    Authors: Fuqi Song, Éric de la Clergerie

    Abstract: In contract analysis and contract automation, a knowledge base (KB) of legal entities is fundamental for performing tasks such as contract verification, contract generation and contract analytic. However, such a KB does not always exist nor can be produced in a short time. In this paper, we propose a clustering-based approach to automatically generate a reliable knowledge base of legal entities fr… ▽ More

    Submitted 7 December, 2020; v1 submitted 18 November, 2020; originally announced December 2020.

    Comments: 4 pages, 3 figures

  14. arXiv:2005.00352  [pdf, other

    cs.CL cs.LG

    MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

    Authors: Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot

    Abstract: Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentence-level paraphrase data ins… ▽ More

    Submitted 16 April, 2021; v1 submitted 1 May, 2020; originally announced May 2020.

  15. CamemBERT: a Tasty French Language Model

    Authors: Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot

    Abstract: Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based lan… ▽ More

    Submitted 21 May, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: ACL 2020 long paper. Web site: https://camembert-model.fr

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020, Online

  16. arXiv:1910.02677  [pdf, other

    cs.CL

    Controllable Sentence Simplification

    Authors: Louis Martin, Benoît Sagot, Éric de la Clergerie, Antoine Bordes

    Abstract: Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provide… ▽ More

    Submitted 20 April, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: Code and models: https://github.com/facebookresearch/access

  17. arXiv:1901.10746  [pdf, other

    cs.CL

    Reference-less Quality Estimation of Text Simplification Systems

    Authors: Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie, Benoît Sagot

    Abstract: The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between th… ▽ More

    Submitted 30 January, 2019; originally announced January 2019.

    Journal ref: 1st Workshop on Automatic Text Adaptation (ATA), Nov 2018, Tilburg, Netherlands. https://www.ida.liu.se/~evere22/ATA-18/

  18. arXiv:1111.3152  [pdf

    cs.CL

    Évaluation de lexiques syntaxiques par leur intégartion dans l'analyseur syntaxiques FRMG

    Authors: Elsa Tolone, Éric De La Clergerie, Sagot Benoit

    Abstract: In this paper, we evaluate various French lexica with the parser FRMG: the Lefff, LGLex, the lexicon built from the tables of the French Lexicon-Grammar, the lexicon DICOVALENCE and a new version of the verbal entries of the Lefff, obtained by merging with DICOVALENCE and partial manual validation. For this, all these lexica have been converted to the format of the Lefff, Alexina format. The evalu… ▽ More

    Submitted 14 November, 2011; originally announced November 2011.

    Comments: 30ème Colloque international sur le Lexique et la Grammaire (LGC'11), Nicosie : Chypre (2011)