-
Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis
Authors:
Mikhail Krasitskii,
Grigori Sidorov,
Olga Kolesnikova,
Liliana Chanona Hernandez,
Alexander Gelbukh
Abstract:
We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achie…
▽ More
We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages. The approach also demonstrates 22% greater computational efficiency than traditional methods. Practical applications include real-time brand monitoring and cross-cultural discourse analysis. Future work will focus on optimization for low-resource languages via 8-bit quantization.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Explainable AI: XAI-Guided Context-Aware Data Augmentation
Authors:
Melkamu Abay Mersha,
Mesay Gemeda Yigezu,
Atnafu Lambebo Tonja,
Hassan Shakil,
Samer Iskander,
Olga Kolesnikova,
Jugal Kalita
Abstract:
Explainable AI (XAI) has emerged as a powerful tool for improving the performance of AI models, going beyond providing model transparency and interpretability. The scarcity of labeled data remains a fundamental challenge in developing robust and generalizable AI models, particularly for low-resource languages. Conventional data augmentation techniques introduce noise, cause semantic drift, disrupt…
▽ More
Explainable AI (XAI) has emerged as a powerful tool for improving the performance of AI models, going beyond providing model transparency and interpretability. The scarcity of labeled data remains a fundamental challenge in developing robust and generalizable AI models, particularly for low-resource languages. Conventional data augmentation techniques introduce noise, cause semantic drift, disrupt contextual coherence, lack control, and lead to overfitting. To address these challenges, we propose XAI-Guided Context-Aware Data Augmentation. This novel framework leverages XAI techniques to modify less critical features while selectively preserving most task-relevant features. Our approach integrates an iterative feedback loop, which refines augmented data over multiple augmentation cycles based on explainability-driven insights and the model performance gain. Our experimental results demonstrate that XAI-SR-BT and XAI-PR-BT improve the accuracy of models on hate speech and sentiment analysis tasks by 6.6% and 8.1%, respectively, compared to the baseline, using the Amharic dataset with the XLM-R model. XAI-SR-BT and XAI-PR-BT outperform existing augmentation techniques by 4.8% and 5%, respectively, on the same dataset and model. Overall, XAI-SR-BT and XAI-PR-BT consistently outperform both baseline and conventional augmentation techniques across all tasks and models. This study provides a more controlled, interpretable, and context-aware solution to data augmentation, addressing critical limitations of existing augmentation techniques and offering a new paradigm shift for leveraging XAI techniques to enhance AI model training.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
GS_DravidianLangTech@2025: Women Targeted Abusive Texts Detection on Social Media
Authors:
Girma Yohannis Bade,
Zahra Ahani,
Olga Kolesnikova,
José Luis Oropeza,
Grigori Sidorov
Abstract:
The increasing misuse of social media has become a concern; however, technological solutions are being developed to moderate its content effectively. This paper focuses on detecting abusive texts targeting women on social media platforms. Abusive speech refers to communication intended to harm or incite hatred against vulnerable individuals or groups. Specifically, this study aims to identify abus…
▽ More
The increasing misuse of social media has become a concern; however, technological solutions are being developed to moderate its content effectively. This paper focuses on detecting abusive texts targeting women on social media platforms. Abusive speech refers to communication intended to harm or incite hatred against vulnerable individuals or groups. Specifically, this study aims to identify abusive language directed toward women. To achieve this, we utilized logistic regression and BERT as base models to train datasets sourced from DravidianLangTech@2025 for Tamil and Malayalam languages. The models were evaluated on test datasets, resulting in a 0.729 macro F1 score for BERT and 0.6279 for logistic regression in Tamil and Malayalam, respectively.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Multilingual Sentiment Analysis of Summarized Texts: A Cross-Language Study of Text Shortening Effects
Authors:
Mikhail Krasitskii,
Grigori Sidorov,
Olga Kolesnikova,
Liliana Chanona Hernandez,
Alexander Gelbukh
Abstract:
Summarization significantly impacts sentiment analysis across languages with diverse morphologies. This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic. We assess sentiment shifts post-summarization using multilingual transformers (mBERT, XLM-RoBERTa, T5, and BART) and language-…
▽ More
Summarization significantly impacts sentiment analysis across languages with diverse morphologies. This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic. We assess sentiment shifts post-summarization using multilingual transformers (mBERT, XLM-RoBERTa, T5, and BART) and language-specific models (FinBERT, AraBERT). Results show extractive summarization better preserves sentiment, especially in morphologically complex languages, while abstractive summarization improves readability but introduces sentiment distortion, affecting sentiment accuracy. Languages with rich inflectional morphology, such as Finnish, Hungarian, and Arabic, experience greater accuracy drops than English or German. Findings emphasize the need for language-specific adaptations in sentiment analysis and propose a hybrid summarization approach balancing readability and sentiment preservation. These insights benefit multilingual sentiment applications, including social media monitoring, market analysis, and cross-lingual opinion mining.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions
Authors:
Mikhail Krasitskii,
Olga Kolesnikova,
Liliana Chanona Hernandez,
Grigori Sidorov,
Alexander Gelbukh
Abstract:
The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, i…
▽ More
The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages
Authors:
Tadesse Destaw Belay,
Dawit Ketema Gete,
Abinew Ali Ayele,
Olga Kolesnikova,
Grigori Sidorov,
Seid Muhie Yimam
Abstract:
In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simult…
▽ More
In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simultaneously in a single instance, annotating emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025) dataset effectively captures this dynamic. Additionally, incorporating intensity, or the degree of emotion, is crucial, as emotions can significantly differ in their expressive strength and impact. This intensity is significant for assessing whether further action is necessary in decision-making processes, especially concerning negative emotions in applications such as healthcare and mental health studies. To enhance the EthioEmo dataset, we include annotations for the intensity of each labeled emotion. Furthermore, we evaluate various state-of-the-art encoder-only Pretrained Language Models (PLMs) and decoder-only Large Language Models (LLMs) to provide comprehensive benchmarking.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding
Authors:
Tadesse Destaw Belay,
Ahmed Haj Ahmed,
Alvin Grissom II,
Iqra Ameer,
Grigori Sidorov,
Olga Kolesnikova,
Seid Muhie Yimam
Abstract:
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to pote…
▽ More
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.
△ Less
Submitted 27 May, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Online Social Support Detection in Spanish Social Media Texts
Authors:
Moein Shahiki Tash,
Luis Ramos,
Zahra Ahani,
Raul Monroy,
Olga kolesnikova,
Hiram Calvo,
Grigori Sidorov
Abstract:
The advent of social media has transformed communication, enabling individuals to share their experiences, seek support, and participate in diverse discussions. While extensive research has focused on identifying harmful content like hate speech, the recognition and promotion of positive and supportive interactions remain largely unexplored. This study proposes an innovative approach to detecting…
▽ More
The advent of social media has transformed communication, enabling individuals to share their experiences, seek support, and participate in diverse discussions. While extensive research has focused on identifying harmful content like hate speech, the recognition and promotion of positive and supportive interactions remain largely unexplored. This study proposes an innovative approach to detecting online social support in Spanish-language social media texts. We introduce the first annotated dataset specifically created for this task, comprising 3,189 YouTube comments classified as supportive or non-supportive. To address data imbalance, we employed GPT-4o to generate paraphrased comments and create a balanced dataset. We then evaluated social support classification using traditional machine learning models, deep learning architectures, and transformer-based models, including GPT-4o, but only on the unbalanced dataset. Subsequently, we utilized a transformer model to compare the performance between the balanced and unbalanced datasets. Our findings indicate that the balanced dataset yielded improved results for Task 2 (Individual and Group) and Task 3 (Nation, Other, LGBTQ, Black Community, Women, Religion), whereas GPT-4o performed best for Task 1 (Social Support and Non-Support). This study highlights the significance of fostering a supportive online environment and lays the groundwork for future research in automated social support detection.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Comparative Approaches to Sentiment Analysis Using Datasets in Major European and Arabic Languages
Authors:
Mikhail Krasitskii,
Olga Kolesnikova,
Liliana Chanona Hernandez,
Grigori Sidorov,
Alexander Gelbukh
Abstract:
This study explores transformer-based models such as BERT, mBERT, and XLM-R for multi-lingual sentiment analysis across diverse linguistic structures. Key contributions include the identification of XLM-R superior adaptability in morphologically complex languages, achieving accuracy levels above 88%. The work highlights fine-tuning strategies and emphasizes their significance for improving sentime…
▽ More
This study explores transformer-based models such as BERT, mBERT, and XLM-R for multi-lingual sentiment analysis across diverse linguistic structures. Key contributions include the identification of XLM-R superior adaptability in morphologically complex languages, achieving accuracy levels above 88%. The work highlights fine-tuning strategies and emphasizes their significance for improving sentiment classification in underrepresented languages.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
Advanced Machine Learning Techniques for Social Support Detection on Social Media
Authors:
Olga Kolesnikova,
Moein Shahiki Tash,
Zahra Ahani,
Ameeta Agrawal,
Raul Monroy,
Grigori Sidorov
Abstract:
The widespread use of social media highlights the need to understand its impact, particularly the role of online social support. This study uses a dataset focused on online social support, which includes binary and multiclass classifications of social support content on social media. The classification of social support is divided into three tasks. The first task focuses on distinguishing between…
▽ More
The widespread use of social media highlights the need to understand its impact, particularly the role of online social support. This study uses a dataset focused on online social support, which includes binary and multiclass classifications of social support content on social media. The classification of social support is divided into three tasks. The first task focuses on distinguishing between supportive and non-supportive. The second task aims to identify whether the support is directed toward an individual or a group. The third task categorizes the specific type of social support, grouping it into categories such as Nation, LGBTQ, Black people, Women, Religion, and Other (if it does not fit into the previously mentioned categories). To address data imbalances in these tasks, we employed K-means clustering for balancing the dataset and compared the results with the original unbalanced data. Using advanced machine learning techniques, including transformers and zero-shot learning approaches with GPT3, GPT4, and GPT4-o, we predict social support levels in various contexts. The effectiveness of the dataset is evaluated using baseline models across different learning approaches, with transformer-based methods demonstrating superior performance. Additionally, we achieved a 0.4\% increase in the macro F1 score for the second task and a 0.7\% increase for the third task, compared to previous work utilizing traditional machine learning with psycholinguistic and unigram-based TF-IDF values.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding
Authors:
Tadesse Destaw Belay,
Israel Abebe Azime,
Abinew Ali Ayele,
Grigori Sidorov,
Dietrich Klakow,
Philipp Slusallek,
Olga Kolesnikova,
Seid Muhie Yimam
Abstract:
Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive expe…
▽ More
Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1. Our evaluation includes encoder-only, encoder-decoder, and decoder-only language models. We compare zero and few-shot approaches of LLMs to fine-tuning smaller language models. The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages such as English, and there is a large gap between the performance of high-resource and low-resource languages. The results also show varying performance levels depending on the language and model type. EthioEmo is available publicly to further improve the understanding of emotions in language models and how people convey emotions through various languages.
△ Less
Submitted 3 January, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
Stress Detection on Code-Mixed Texts in Dravidian Languages using Machine Learning
Authors:
L. Ramos,
M. Shahiki-Tash,
Z. Ahani,
A. Eponon,
O. Kolesnikova,
H. Calvo
Abstract:
Stress is a common feeling in daily life, but it can affect mental well-being in some situations, the development of robust detection models is imperative. This study introduces a methodical approach to the stress identification in code-mixed texts for Dravidian languages. The challenge encompassed two datasets, targeting Tamil and Telugu languages respectively. This proposal underscores the impor…
▽ More
Stress is a common feeling in daily life, but it can affect mental well-being in some situations, the development of robust detection models is imperative. This study introduces a methodical approach to the stress identification in code-mixed texts for Dravidian languages. The challenge encompassed two datasets, targeting Tamil and Telugu languages respectively. This proposal underscores the importance of using uncleaned text as a benchmark to refine future classification methodologies, incorporating diverse preprocessing techniques. Random Forest algorithm was used, featuring three textual representations: TF-IDF, Uni-grams of words, and a composite of (1+2+3)-Grams of characters. The approach achieved a good performance for both linguistic categories, achieving a Macro F1-score of 0.734 in Tamil and 0.727 in Telugu, overpassing results achieved with different complex techniques such as FastText and Transformer models. The results underscore the value of uncleaned data for mental state detection and the challenges classifying code-mixed texts for stress, indicating the potential for improved performance through cleaning data, other preprocessing techniques, or more complex models.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Ethio-Fake: Cutting-Edge Approaches to Combat Fake News in Under-Resourced Languages Using Explainable AI
Authors:
Mesay Gemeda Yigezu,
Melkamu Abay Mersha,
Girma Yohannis Bade,
Jugal Kalita,
Olga Kolesnikova,
Alexander Gelbukh
Abstract:
The proliferation of fake news has emerged as a significant threat to the integrity of information dissemination, particularly on social media platforms. Misinformation can spread quickly due to the ease of creating and disseminating content, affecting public opinion and sociopolitical events. Identifying false information is therefore essential to reducing its negative consequences and maintainin…
▽ More
The proliferation of fake news has emerged as a significant threat to the integrity of information dissemination, particularly on social media platforms. Misinformation can spread quickly due to the ease of creating and disseminating content, affecting public opinion and sociopolitical events. Identifying false information is therefore essential to reducing its negative consequences and maintaining the reliability of online news sources. Traditional approaches to fake news detection often rely solely on content-based features, overlooking the crucial role of social context in shaping the perception and propagation of news articles. In this paper, we propose a comprehensive approach that integrates social context-based features with news content features to enhance the accuracy of fake news detection in under-resourced languages. We perform several experiments utilizing a variety of methodologies, including traditional machine learning, neural networks, ensemble learning, and transfer learning. Assessment of the outcomes of the experiments shows that the ensemble learning approach has the highest accuracy, achieving a 0.99 F1 score. Additionally, when compared with monolingual models, the fine-tuned model with the target language outperformed others, achieving a 0.94 F1 score. We analyze the functioning of the models, considering the important features that contribute to model performance, using explainable AI techniques.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
Authors:
Moein Shahiki Tash,
Zahra Ahani,
Mohim Tash,
Olga Kolesnikova,
Grigori Sidorov
Abstract:
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories.…
▽ More
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Analyzing Emotional Trends from X platform using SenticNet: A Comparative Analysis with Cryptocurrency Price
Authors:
Moein Shahiki Tash,
Zahra Ahani,
Olga Kolesnikova,
Grigori Sidorov
Abstract:
This study delves into the relationship between emotional trends from X platform data and the market dynamics of well-known cryptocurrencies Cardano, Binance, Fantom, Matic, and Ripple over the period from October 2022 to March 2023. Leveraging SenticNet, we identified emotions like Fear and Anxiety, Rage and Anger, Grief and Sadness, Delight and Pleasantness, Enthusiasm and Eagerness, and Delight…
▽ More
This study delves into the relationship between emotional trends from X platform data and the market dynamics of well-known cryptocurrencies Cardano, Binance, Fantom, Matic, and Ripple over the period from October 2022 to March 2023. Leveraging SenticNet, we identified emotions like Fear and Anxiety, Rage and Anger, Grief and Sadness, Delight and Pleasantness, Enthusiasm and Eagerness, and Delight and Joy. Following data extraction, we segmented each month into bi-weekly intervals, replicating this process for price data obtained from Finance-Yahoo. Consequently, a comparative analysis was conducted, establishing connections between emotional trends observed across bi-weekly intervals and cryptocurrency prices, uncovering significant correlations between emotional sentiments and coin valuations.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
NLP Progress in Indigenous Latin American Languages
Authors:
Atnafu Lambebo Tonja,
Fazlourrahman Balouchzahi,
Sabur Butt,
Olga Kolesnikova,
Hector Ceballos,
Alexander Gelbukh,
Thamar Solorio
Abstract:
The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancemen…
▽ More
The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general.
△ Less
Submitted 12 May, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
Authors:
Atnafu Lambebo Tonja,
Olga Kolesnikova,
Alexander Gelbukh,
Jugal Kalita
Abstract:
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available…
▽ More
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation
Authors:
Atnafu Lambebo Tonja,
Israel Abebe Azime,
Tadesse Destaw Belay,
Mesay Gemeda Yigezu,
Moges Ahmed Mehamed,
Abinew Ali Ayele,
Ebrahim Chekol Jibril,
Michael Melese Woldeyohannis,
Olga Kolesnikova,
Philipp Slusallek,
Dietrich Klakow,
Shengwu Xiong,
Seid Muhie Yimam
Abstract:
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassin…
▽ More
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
△ Less
Submitted 23 June, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages
Authors:
Atnafu Lambebo Tonja,
Melkamu Mersha,
Ananya Kalita,
Olga Kolesnikova,
Jugal Kalita
Abstract:
This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpo…
▽ More
This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish
Authors:
Yevhen Kostiuk,
Grigori Sidorov,
Olga Kolesnikova
Abstract:
In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of…
▽ More
In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of the context and the relationships among words and phrases in text. It also needs large amounts of labeled data to train language models effectively. In this paper, we present a dataset of most frequent Spanish verb-noun collocations and sentences where they occur, each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task. Each class represents a relation between the noun and the verb in a collocation involving their semantic and syntactic features. We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure. The dataset was created by dependency tree parsing and matching of the phrases in Spanish news. We provide baselines and data splits for each objective.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts
Authors:
Yevhen Kostiuk,
Atnafu Lambebo Tonja,
Grigori Sidorov,
Olga Kolesnikova
Abstract:
In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this st…
▽ More
In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this study is to contribute to the development of more effective methods for reducing the spread of hate speech in online communities.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Authors:
Atnafu Lambebo Tonja,
Hellina Hailu Nigatu,
Olga Kolesnikova,
Grigori Sidorov,
Alexander Gelbukh,
Jugal Kalita
Abstract:
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning se…
▽ More
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
Authors:
Atnafu Lambebo Tonja,
Christian Maldonado-Sifuentes,
David Alejandro Mendoza Castillo,
Olga Kolesnikova,
Noé Castro-Sánchez,
Grigori Sidorov,
Alexander Gelbukh
Abstract:
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed…
▽ More
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at \url{https://github.com/atnafuatx/Machine-Translation-Resources}
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Authors:
Atnafu Lambebo Tonja,
Tadesse Destaw Belay,
Israel Abebe Azime,
Abinew Ali Ayele,
Moges Ahmed Mehamed,
Olga Kolesnikova,
Seid Muhie Yimam
Abstract:
This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This r…
▽ More
This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to identify research gaps and disseminate the information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.
△ Less
Submitted 25 March, 2023;
originally announced March 2023.
-
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts
Authors:
Atnafu Lambebo Tonja,
Mesay Gemeda Yigezu,
Olga Kolesnikova,
Moein Shahiki Tash,
Grigori Sidorov,
Alexander Gelbuk
Abstract:
Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Politécnico Nacional, Centro de Investigación en Computación (CIC) team's syst…
▽ More
Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Politécnico Nacional, Centro de Investigación en Computación (CIC) team's system description paper for the CoLI-Kanglish shared task at ICON2022. In this paper, we propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Authors:
Tadesse Destaw Belay,
Atnafu Lambebo Tonja,
Olga Kolesnikova,
Seid Muhie Yimam,
Abinew Ali Ayele,
Silesh Bogale Haile,
Grigori Sidorov,
Alexander Gelbukh
Abstract:
Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first…
▽ More
Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. Using these compiled data, we build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model achieving a BLEU score of 37.79 in Amharic-English 32.74 in English-Amharic translation. Additionally, we explore the effects of Amharic homophone normalization on the machine translation task. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.