-
Contextual Text Embeddings for Twi
Authors:
Paul Azunre,
Salomey Osei,
Salomey Addo,
Lawrence Asamoah Adu-Gyamfi,
Stephen Moore,
Bernard Adabankah,
Bernard Opoku,
Clara Asare-Nyarko,
Samuel Nyarko,
Cynthia Amoaba,
Esther Dansoa Appiah,
Felix Akwerh,
Richard Nii Lante Lawson,
Joel Budu,
Emmanuel Debrah,
Nana Boateng,
Wisdom Ofori,
Edwin Buabeng-Munkoh,
Franklin Adjei,
Isaac Kojo Essel Ampomah,
Joseph Otoo,
Reindorf Borkor,
Standylove Birago Mensah,
Lucien Mensah,
Mark Amoako Marcel
, et al. (2 additional authors not shown)
Abstract:
Transformer-based language models have been changing the modern Natural Language Processing (NLP) landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian language. In this paper, we introduce the first of such models for Twi or Akan, the most widely spoken Ghanaian language. The specific contribution of this researc…
▽ More
Transformer-based language models have been changing the modern Natural Language Processing (NLP) landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian language. In this paper, we introduce the first of such models for Twi or Akan, the most widely spoken Ghanaian language. The specific contribution of this research work is the development of several pretrained transformer language models for the Akuapem and Asante dialects of Twi, paving the way for advances in application areas such as Named Entity Recognition (NER), Neural Machine Translation (NMT), Sentiment Analysis (SA) and Part-of-Speech (POS) tagging. Specifically, we introduce four different flavours of ABENA -- A BERT model Now in Akan that is fine-tuned on a set of Akan corpora, and BAKO - BERT with Akan Knowledge only, which is trained from scratch. We open-source the model through the Hugging Face model hub and demonstrate its use via a simple sentiment classification example.
△ Less
Submitted 31 March, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
English-Twi Parallel Corpus for Machine Translation
Authors:
Paul Azunre,
Salomey Osei,
Salomey Addo,
Lawrence Asamoah Adu-Gyamfi,
Stephen Moore,
Bernard Adabankah,
Bernard Opoku,
Clara Asare-Nyarko,
Samuel Nyarko,
Cynthia Amoaba,
Esther Dansoa Appiah,
Felix Akwerh,
Richard Nii Lante Lawson,
Joel Budu,
Emmanuel Debrah,
Nana Boateng,
Wisdom Ofori,
Edwin Buabeng-Munkoh,
Franklin Adjei,
Isaac Kojo Essel Ampomah,
Joseph Otoo,
Reindorf Borkor,
Standylove Birago Mensah,
Lucien Mensah,
Mark Amoako Marcel
, et al. (2 additional authors not shown)
Abstract:
We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use a…
▽ More
We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use as an evaluation set for downstream Natural Language Processing (NLP) tasks. The typical use case for the larger human-verified dataset is for further training of machine translation models in Akuapem Twi. The higher quality 697 crowd-sourced dataset is recommended as a testing dataset for machine translation of English to Twi and Twi to English models. Furthermore, the Twi part of the crowd-sourced data may also be used for other tasks, such as representation learning, classification, etc. We fine-tune the transformer translation model on the training corpus and report benchmarks on the crowd-sourced test set.
△ Less
Submitted 1 April, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
NLP for Ghanaian Languages
Authors:
Paul Azunre,
Salomey Osei,
Salomey Addo,
Lawrence Asamoah Adu-Gyamfi,
Stephen Moore,
Bernard Adabankah,
Bernard Opoku,
Clara Asare-Nyarko,
Samuel Nyarko,
Cynthia Amoaba,
Esther Dansoa Appiah,
Felix Akwerh,
Richard Nii Lante Lawson,
Joel Budu,
Emmanuel Debrah,
Nana Boateng,
Wisdom Ofori,
Edwin Buabeng-Munkoh,
Franklin Adjei,
Isaac Kojo Essel Ampomah,
Joseph Otoo,
Reindorf Borkor,
Standylove Birago Mensah,
Lucien Mensah,
Mark Amoako Marcel
, et al. (2 additional authors not shown)
Abstract:
NLP Ghana is an open-source non-profit organization aiming to advance the development and adoption of state-of-the-art NLP techniques and digital language tools to Ghanaian languages and problems. In this paper, we first present the motivation and necessity for the efforts of the organization; by introducing some popular Ghanaian languages while presenting the state of NLP in Ghana. We then presen…
▽ More
NLP Ghana is an open-source non-profit organization aiming to advance the development and adoption of state-of-the-art NLP techniques and digital language tools to Ghanaian languages and problems. In this paper, we first present the motivation and necessity for the efforts of the organization; by introducing some popular Ghanaian languages while presenting the state of NLP in Ghana. We then present the NLP Ghana organization and outline its aims, scope of work, some of the methods employed and contributions made thus far in the NLP community in Ghana.
△ Less
Submitted 1 April, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Gated Task Interaction Framework for Multi-task Sequence Tagging
Authors:
Isaac K. E. Ampomah,
Sally McClean,
Zhiwei Lin,
Glenn Hawe
Abstract:
Recent studies have shown that neural models can achieve high performance on several sequence labelling/tagging problems without the explicit use of linguistic features such as part-of-speech (POS) tags. These models are trained only using the character-level and the word embedding vectors as inputs. Others have shown that linguistic features can improve the performance of neural models on tasks s…
▽ More
Recent studies have shown that neural models can achieve high performance on several sequence labelling/tagging problems without the explicit use of linguistic features such as part-of-speech (POS) tags. These models are trained only using the character-level and the word embedding vectors as inputs. Others have shown that linguistic features can improve the performance of neural models on tasks such as chunking and named entity recognition (NER). However, the change in performance depends on the degree of semantic relatedness between the linguistic features and the target task; in some instances, linguistic features can have a negative impact on performance. This paper presents an approach to jointly learn these linguistic features along with the target sequence labelling tasks with a new multi-task learning (MTL) framework called Gated Tasks Interaction (GTI) network for solving multiple sequence tagging tasks. The GTI network exploits the relations between the multiple tasks via neural gate modules. These gate modules control the flow of information between the different tasks. Experiments on benchmark datasets for chunking and NER show that our framework outperforms other competitive baselines trained with and without external training resources.
△ Less
Submitted 28 September, 2019;
originally announced September 2019.
-
On the Performance of Filters for Reduction of Speckle Noise in SAR Images off the Coast of the Gulf of Guinea
Authors:
Griffith S. Klogo,
Akpeko Gasonoo,
Isaac K. E. Ampomah
Abstract:
Synthetic Aperture Radar (SAR) imagery to monitor oil spills are some methods that have been proposed for the West African sub-region. With the increase in the number of oil exploration companies in Ghana (and her neighbors) and the rise in the coastal activities in the sub-region, there is the need for proper monitoring of the environmental impact of these socio-economic activities on the environ…
▽ More
Synthetic Aperture Radar (SAR) imagery to monitor oil spills are some methods that have been proposed for the West African sub-region. With the increase in the number of oil exploration companies in Ghana (and her neighbors) and the rise in the coastal activities in the sub-region, there is the need for proper monitoring of the environmental impact of these socio-economic activities on the environment. Detection and near real-time information about oil spills are fundamental in reducing oil spill environmental impact. SAR images are prone to some noise, which is predominantly speckle noise around the coastal areas. This paper evaluates the performance of the mean and median filters used in the preprocessing filtering to reduce speckle noise in SAR images for most image processing algorithms.
△ Less
Submitted 9 December, 2013;
originally announced December 2013.