Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Hoffmann, Michael; John, Jophin; Schweter, Stefan; Ramakrishnan, Gokul; Mak, Hoi-Fong; Zhang, Alice; Gaynullin, Dmitry; Hammer, Nicolay J.

Computer Science > Computation and Language

arXiv:2509.05668 (cs)

[Submitted on 6 Sep 2025]

Title:Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Authors:Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer

View PDF HTML (experimental)

Abstract:We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

Comments:	Michael Hoffmann and Jophin John contributed equally to this work
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2509.05668 [cs.CL]
	(or arXiv:2509.05668v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2509.05668

Submission history

From: Michael Hoffmann [view email]
[v1] Sat, 6 Sep 2025 10:12:52 UTC (490 KB)

Computer Science > Computation and Language

Title:Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators