-
Morphological evaluation of subwords vocabulary used by BETO language model
Authors:
Óscar García-Sierra,
Ana Fernández-Pampillón Cesteros,
Miguel Ortega-Martín
Abstract:
Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a met…
▽ More
Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Building another Spanish dictionary, this time with GPT-4
Authors:
Miguel Ortega-Martín,
Óscar García-Sierra,
Alfonso Ardoiz,
Juan Carlos Armenteros,
Ignacio Garrido,
Jorge Álvarez,
Camilo Torrón,
Iñigo Galdeano,
Ignacio Arranz,
Oleg Vorontsov,
Adrián Alonso
Abstract:
We present the "Spanish Built Factual Freectianary 2.0" (Spanish-BFF-2) as the second iteration of an AI-generated Spanish dictionary. Previously, we developed the inaugural version of this unique free dictionary employing GPT-3. In this study, we aim to improve the dictionary by using GPT-4-turbo instead. Furthermore, we explore improvements made to the initial version and compare the performance…
▽ More
We present the "Spanish Built Factual Freectianary 2.0" (Spanish-BFF-2) as the second iteration of an AI-generated Spanish dictionary. Previously, we developed the inaugural version of this unique free dictionary employing GPT-3. In this study, we aim to improve the dictionary by using GPT-4-turbo instead. Furthermore, we explore improvements made to the initial version and compare the performance of both models.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
RADIA -- Radio Advertisement Detection with Intelligent Analytics
Authors:
Jorge Álvarez,
Juan Carlos Armenteros,
Camilo Torrón,
Miguel Ortega-Martín,
Alfonso Ardoiz,
Óscar García,
Ignacio Arranz,
Íñigo Galdeano,
Ignacio Garrido,
Adrián Alonso,
Fernando Bayón,
Oleg Vorontsov
Abstract:
Radio advertising remains an integral part of modern marketing strategies, with its appeal and potential for targeted reach undeniably effective. However, the dynamic nature of radio airtime and the rising trend of multiple radio spots necessitates an efficient system for monitoring advertisement broadcasts. This study investigates a novel automated radio advertisement detection technique incorpor…
▽ More
Radio advertising remains an integral part of modern marketing strategies, with its appeal and potential for targeted reach undeniably effective. However, the dynamic nature of radio airtime and the rising trend of multiple radio spots necessitates an efficient system for monitoring advertisement broadcasts. This study investigates a novel automated radio advertisement detection technique incorporating advanced speech recognition and text classification algorithms. RadIA's approach surpasses traditional methods by eliminating the need for prior knowledge of the broadcast content. This contribution allows for detecting impromptu and newly introduced advertisements, providing a comprehensive solution for advertisement detection in radio broadcasting. Experimental results show that the resulting model, trained on carefully segmented and tagged text data, achieves an F1-macro score of 87.76 against a theoretical maximum of 89.33. This paper provides insights into the choice of hyperparameters and their impact on the model's performance. This study demonstrates its potential to ensure compliance with advertising broadcast contracts and offer competitive surveillance. This groundbreaking research could fundamentally change how radio advertising is monitored and open new doors for marketing optimization.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Spanish Built Factual Freectianary (Spanish-BFF): the first AI-generated free dictionary
Authors:
Miguel Ortega-Martín,
Óscar García-Sierra,
Alfonso Ardoiz,
Juan Carlos Armenteros,
Jorge Álvarez,
Adrián Alonso
Abstract:
Dictionaries are one of the oldest and most used linguistic resources. Building them is a complex task that, to the best of our knowledge, has yet to be explored with generative Large Language Models (LLMs). We introduce the "Spanish Built Factual Freectianary" (Spanish-BFF) as the first Spanish AI-generated dictionary. This first-of-its-kind free dictionary uses GPT-3. We also define future steps…
▽ More
Dictionaries are one of the oldest and most used linguistic resources. Building them is a complex task that, to the best of our knowledge, has yet to be explored with generative Large Language Models (LLMs). We introduce the "Spanish Built Factual Freectianary" (Spanish-BFF) as the first Spanish AI-generated dictionary. This first-of-its-kind free dictionary uses GPT-3. We also define future steps we aim to follow to improve this initial commitment to the field, such as more additional languages.
△ Less
Submitted 28 February, 2023; v1 submitted 24 February, 2023;
originally announced February 2023.
-
Linguistic ambiguity analysis in ChatGPT
Authors:
Miguel Ortega-Martín,
Óscar García-Sierra,
Alfonso Ardoiz,
Jorge Álvarez,
Juan Carlos Armenteros,
Adrián Alonso
Abstract:
Linguistic ambiguity is and has always been one of the main challenges in Natural Language Processing (NLP) systems. Modern Transformer architectures like BERT, T5 or more recently InstructGPT have achieved some impressive improvements in many NLP fields, but there is still plenty of work to do. Motivated by the uproar caused by ChatGPT, in this paper we provide an introduction to linguistic ambig…
▽ More
Linguistic ambiguity is and has always been one of the main challenges in Natural Language Processing (NLP) systems. Modern Transformer architectures like BERT, T5 or more recently InstructGPT have achieved some impressive improvements in many NLP fields, but there is still plenty of work to do. Motivated by the uproar caused by ChatGPT, in this paper we provide an introduction to linguistic ambiguity, its varieties and their relevance in modern NLP, and perform an extensive empiric analysis. ChatGPT strengths and weaknesses are revealed, as well as strategies to get the most of this model.
△ Less
Submitted 20 February, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.