Skip to main content

Showing 1–1 of 1 results for author: Mak, W W Y

.
  1. arXiv:2410.18836  [pdf, other

    cs.CL cs.AI

    From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

    Authors: Artur Kiulian, Anton Polishko, Mykola Khandoga, Yevhen Kostiuk, Guillermo Gabrielli, Łukasz Gagała, Fadi Zaraket, Qusai Abu Obaida, Hrishikesh Garud, Wendy Wing Yee Mak, Dmytro Chaplynskyi, Selma Belhadj Amor, Grigol Peradze

    Abstract: In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our ap… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.