Skip to main content

Showing 1–2 of 2 results for author: Tastet, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.17312  [pdf, other

    cs.CL cs.LG

    BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data

    Authors: Jean-Loup Tastet, Inar Timiryasov

    Abstract: We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantag… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 9 pages, 3 figures, 5 tables, submitted to the BabyLM Challenge (CoNLL 2024 Shared Task)

    ACM Class: I.2.7

  2. arXiv:2308.02019  [pdf, other

    cs.CL

    Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

    Authors: Inar Timiryasov, Jean-Loup Tastet

    Abstract: We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without di… ▽ More

    Submitted 24 October, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: 11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint available at https://huggingface.co/timinar/baby-llama-58m, training code available at https://github.com/timinar/BabyLlama

    ACM Class: I.2.7