DocuMint: Docstring Generation for Python using Small Language Models
Authors:
Bibek Poudel,
Adam Cook,
Sekou Traore,
Shelah Ameli
Abstract:
Effective communication, specifically through documentation, is the beating heart of collaboration among contributors in software development. Recent advancements in language models (LMs) have enabled the introduction of a new type of actor in that ecosystem: LM-powered assistants capable of code generation, optimization, and maintenance. Our study investigates the efficacy of small language model…
▽ More
Effective communication, specifically through documentation, is the beating heart of collaboration among contributors in software development. Recent advancements in language models (LMs) have enabled the introduction of a new type of actor in that ecosystem: LM-powered assistants capable of code generation, optimization, and maintenance. Our study investigates the efficacy of small language models (SLMs) for generating high-quality docstrings by assessing accuracy, conciseness, and clarity, benchmarking performance quantitatively through mathematical formulas and qualitatively through human evaluation using Likert scale. Further, we introduce DocuMint, as a large-scale supervised fine-tuning dataset with 100,000 samples. In quantitative experiments, Llama 3 8B achieved the best performance across all metrics, with conciseness and clarity scores of 0.605 and 64.88, respectively. However, under human evaluation, CodeGemma 7B achieved the highest overall score with an average of 8.3 out of 10 across all metrics. Fine-tuning the CodeGemma 2B model using the DocuMint dataset led to significant improvements in performance across all metrics, with gains of up to 22.5% in conciseness. The fine-tuned model and the dataset can be found in HuggingFace and the code can be found in the repository.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Authors:
Cheikh M. Bamba Dione,
David Adelani,
Peter Nabende,
Jesujoba Alabi,
Thapelo Sindane,
Happy Buzaaba,
Shamsuddeen Hassan Muhammad,
Chris Chinenye Emezue,
Perez Ogayo,
Anuoluwapo Aremu,
Catherine Gitau,
Derguene Mbaye,
Jonathan Mukiibi,
Blessing Sibanda,
Bonaventure F. P. Dossou,
Andiswa Bukula,
Rooweither Mabuya,
Allahsera Auguste Tapo,
Edwin Munkoh-Buabeng,
victoire Memdjokam Koagne,
Fatoumata Ouoba Kabore,
Amelia Taylor,
Godson Kalipe,
Tebogo Macucwa,
Vukosi Marivate
, et al. (19 additional authors not shown)
Abstract:
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-l…
▽ More
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.