-
BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data
Abstract: We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantag… ▽ More
Submitted 25 September, 2024; originally announced September 2024.
Comments: 9 pages, 3 figures, 5 tables, submitted to the BabyLM Challenge (CoNLL 2024 Shared Task)
ACM Class: I.2.7
-
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
Abstract: We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without di… ▽ More
Submitted 24 October, 2023; v1 submitted 3 August, 2023; originally announced August 2023.
Comments: 11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint available at https://huggingface.co/timinar/baby-llama-58m, training code available at https://github.com/timinar/BabyLlama
ACM Class: I.2.7