Skip to main content

Showing 1–2 of 2 results for author: Thior, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.15916  [pdf, ps, other

    cs.CL

    The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

    Authors: Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessico Ojo, Atnafu Lambebo Tonja, Maushami Chetty, Wilhelmina NdapewaOnyothi Nekoto, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

    Abstract: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk'uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus d… ▽ More

    Submitted 12 June, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

  2. arXiv:2408.17024  [pdf, other

    cs.CL

    InkubaLM: A small language model for low-resource African languages

    Authors: Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Anuoluwapo Aremu, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

    Abstract: High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts an… ▽ More

    Submitted 3 September, 2024; v1 submitted 30 August, 2024; originally announced August 2024.