-
Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging
Authors:
Abrar Abir,
Kemal Oflazer
Abstract:
This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \& news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition,…
▽ More
This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \& news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model's performance. Our system achieved a score of 25.41, placing us 4$^{th}$ on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Authors:
Leonie Weissweiler,
Valentin Hofmann,
Anjali Kantharuban,
Anna Cai,
Ritam Dutt,
Amey Hengle,
Anubha Kabra,
Atharva Kulkarni,
Abhishek Vijayakumar,
Haofei Yu,
Hinrich Schütze,
Kemal Oflazer,
David R. Mortensen
Abstract:
Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i…
▽ More
Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.
△ Less
Submitted 26 October, 2023; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Event Coreference Resolution Using Neural Network Classifiers
Authors:
Arun Pandian,
Lamana Mulaffer,
Kemal Oflazer,
Amna AlZeyara
Abstract:
This paper presents a neural network classifier approach to detecting both within- and cross- document event coreference effectively using only event mention based features. Our approach does not (yet) rely on any event argument features such as semantic roles or spatiotemporal arguments. Experimental results on the ECB+ dataset show that our approach produces F1 scores that significantly outperfo…
▽ More
This paper presents a neural network classifier approach to detecting both within- and cross- document event coreference effectively using only event mention based features. Our approach does not (yet) rely on any event argument features such as semantic roles or spatiotemporal arguments. Experimental results on the ECB+ dataset show that our approach produces F1 scores that significantly outperform the state-of-the-art methods for both within-document and cross-document event coreference resolution when we use B3 and CEAFe evaluation measures, but gets worse F1 score with the MUC measure. However, when we use the CoNLL measure, which is the average of these three scores, our approach has slightly better F1 for within- document event coreference resolution but is significantly better for cross-document event coreference resolution.
△ Less
Submitted 9 October, 2018;
originally announced October 2018.
-
MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction
Authors:
Ossama Obeid,
Salam Khalifa,
Nizar Habash,
Houda Bouamor,
Wajdi Zaghouani,
Kemal Oflazer
Abstract:
In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech,…
▽ More
In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech, lemma, gloss, and dialect identification. MADARi has a suite of utilities to help with annotator productivity. For example, annotators are provided with pre-computed analyses to assist them in their task and reduce the amount of work needed to complete it. MADARi also allows annotators to query a morphological analyzer for a list of possible analyses in multiple dialects or look up previously submitted analyses. The MADARi management interface enables a lead annotator to easily manage and organize the whole annotation process remotely and concurrently. We describe the motivation, design and implementation of this interface; and we present details from a user study working with this system.
△ Less
Submitted 25 August, 2018;
originally announced August 2018.
-
Morphological Disambiguation by Voting Constraints
Authors:
Kemal Oflazer,
Gokhan Tur
Abstract:
We present a constraint-based morphological disambiguation system in which individual constraints vote on matching morphological parses, and disambiguation of all the tokens in a sentence is performed at the end by selecting parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the…
▽ More
We present a constraint-based morphological disambiguation system in which individual constraints vote on matching morphological parses, and disambiguation of all the tokens in a sentence is performed at the end by selecting parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing. Our results for disambiguating Turkish indicate that using about 500 constraint rules and some additional simple statistics, we can attain a recall of 95-96% and a precision of 94-95% with about 1.01 parses per token. Our system is implemented in Prolog and we are currently investigating an efficient implementation based on finite state transducers.
△ Less
Submitted 25 April, 1997;
originally announced April 1997.
-
Tactical Generation in a Free Constituent Order Language
Authors:
Dilek Zeynep Hakkani,
Kemal Oflazer,
Ilyas Cicekli
Abstract:
This paper describes tactical generation in Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the…
▽ More
This paper describes tactical generation in Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the order is almost freely changeable, depending on the constraints of the text flow or discourse. We have used a recursively structured finite state machine for handling the changes in constituent order, implemented as a right-linear grammar backbone. Our implementation environment is the GenKit system, developed at Carnegie Mellon University--Center for Machine Translation. Morphological realization has been implemented using an external morphological analysis/generation component which performs concrete morpheme selection and handles morphographemic processes.
△ Less
Submitted 5 May, 1996;
originally announced May 1996.
-
Error-tolerant Tree Matching
Authors:
Kemal Oflazer
Abstract:
This paper presents an efficient algorithm for retrieving from a database of trees, all trees that match a given query tree approximately, that is, within a certain error tolerance. It has natural language processing applications in searching for matches in example-based translation systems, and retrieval from lexical databases containing entries of complex feature structures. The algorithm has…
▽ More
This paper presents an efficient algorithm for retrieving from a database of trees, all trees that match a given query tree approximately, that is, within a certain error tolerance. It has natural language processing applications in searching for matches in example-based translation systems, and retrieval from lexical databases containing entries of complex feature structures. The algorithm has been implemented on SparcStations, and for large randomly generated synthetic tree databases (some having tens of thousands of trees) it can associatively search for trees with a small error, in a matter of tenths of a second to few seconds.
△ Less
Submitted 17 April, 1996; v1 submitted 11 April, 1996;
originally announced April 1996.
-
A Constraint-based Case Frame Lexicon
Authors:
Kemal Oflazer,
Okan Yilmaz
Abstract:
We present a constraint-based case frame lexicon architecture for bi-directional mapping between a syntactic case frame and a semantic frame. The lexicon uses a semantic sense as the basic unit and employs a multi-tiered constraint structure for the resolution of syntactic information into the appropriate senses and/or idiomatic usage. Valency changing transformations such as morphologically mar…
▽ More
We present a constraint-based case frame lexicon architecture for bi-directional mapping between a syntactic case frame and a semantic frame. The lexicon uses a semantic sense as the basic unit and employs a multi-tiered constraint structure for the resolution of syntactic information into the appropriate senses and/or idiomatic usage. Valency changing transformations such as morphologically marked passivized or causativized forms are handled via lexical rules that manipulate case frames templates. The system has been implemented in a typed-feature system and applied to Turkish.
△ Less
Submitted 11 April, 1996;
originally announced April 1996.
-
Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation
Authors:
Kemal Oflazer,
Gokhan Tur
Abstract:
This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology--specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work, but with the observation that his transformational approach is not directly applic…
▽ More
This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology--specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work, but with the observation that his transformational approach is not directly applicable to languages like Turkish. Our system combines corpus independent hand-crafted constraint rules, constraint rules that are learned via unsupervised learning from a training corpus, and additional statistical information from the corpus to be morphologically disambiguated. The hand-crafted rules are linguistically motivated and tuned to improve precision without sacrificing recall. The unsupervised learning process produces two sets of rules: (i) choose rules which choose morphological parses of a lexical item satisfying constraint effectively discarding other parses, and (ii) delete rules, which delete parses satisfying a constraint. Our approach also uses a novel approach to unknown word processing by employing a secondary morphological processor which recovers any relevant inflectional and derivational information from a lexical item whose root is unknown. With this approach, well below 1 percent of the tokens remains as unknown in the texts we have experimented with. Our results indicate that by combining these hand-crafted,statistical and learned information sources, we can attain a recall of 96 to 97 percent with a corresponding precision of 93 to 94 percent, and ambiguity of 1.02 to 1.03 parses per token.
△ Less
Submitted 12 April, 1996; v1 submitted 11 April, 1996;
originally announced April 1996.
-
A Constraint-based Case Frame Lexicon Architecture
Authors:
Kemal Oflazer,
Okan Yilmaz
Abstract:
In Turkish, (and possibly in many other languages) verbs often convey several meanings (some totally unrelated) when they are used with subjects, objects, oblique objects, adverbial adjuncts, with certain lexical, morphological, and semantic features, and co-occurrence restrictions. In addition to the usual sense variations due to selectional restrictions on verbal arguments, in most cases, the…
▽ More
In Turkish, (and possibly in many other languages) verbs often convey several meanings (some totally unrelated) when they are used with subjects, objects, oblique objects, adverbial adjuncts, with certain lexical, morphological, and semantic features, and co-occurrence restrictions. In addition to the usual sense variations due to selectional restrictions on verbal arguments, in most cases, the meaning conveyed by a case frame is idiomatic and not compositional, with subtle constraints. In this paper, we present an approach to building a constraint-based case frame lexicon for use in natural language processing in Turkish, whose prototype we have implemented under the TFS system developed at Univ. of Stuttgart.
A number of observations that we have made on Turkish have indicated that we need something beyond the traditional transitive and intransitive distinction, and utilize a framework where verb valence is considered as the obligatory co-existence of an arbitrary subset of possible arguments along with the obligatory exclusion of certain others, relative to a verb sense. Additional morphological lexical and semantic constraints on the syntactic constituents organized as a 5-tier constraint hierarchy, are utilized to map a given syntactic structure case-fraame to a specific verb sense.
△ Less
Submitted 21 July, 1995;
originally announced July 1995.
-
Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction
Authors:
Kemal Oflazer
Abstract:
Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give e…
▽ More
Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again, it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by a finite state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, error-tolerant
△ Less
Submitted 21 July, 1995; v1 submitted 28 April, 1995;
originally announced April 1995.
-
Using a Corpus for Teaching Turkish Morphology
Authors:
H. Altay Guvenir,
Kemal Oflazer
Abstract:
This paper reports on the preliminary phase of our ongoing research towards developing an intelligent tutoring environment for Turkish grammar. One of the components of this environment is a corpus search tool which, among other aspects of the language, will be used to present the learner sample sentences along with their morphological analyses. Following a brief introduction to the Turkish langua…
▽ More
This paper reports on the preliminary phase of our ongoing research towards developing an intelligent tutoring environment for Turkish grammar. One of the components of this environment is a corpus search tool which, among other aspects of the language, will be used to present the learner sample sentences along with their morphological analyses. Following a brief introduction to the Turkish language and its morphology, the paper describes the morphological analysis and ambiguity resolution used to construct the corpus used in the search tool. Finally, implementation issues and details involving the user interface of the tool are discussed.
△ Less
Submitted 1 March, 1995;
originally announced March 1995.
-
Spelling Correction in Agglutinative Languages
Authors:
Kemal Oflazer
Abstract:
This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word…
▽ More
This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word formation by derivational and inflectional affixations. After an overview of certain issues and relevant mathematical preliminaries, we formally present the problem and our solution. We then present results from our experiments with spelling correction in Turkish, a Ural--Altaic agglutinative language. Our results indicate that we can find the intended correct word in 95\% of the cases and offer it as the first candidate in 74\% of the cases, when the edit distance is 1.
△ Less
Submitted 7 October, 1994; v1 submitted 6 October, 1994;
originally announced October 1994.
-
Tagging and Morphological Disambiguation of Turkish Text
Authors:
Kemal Oflazer,
Ilker Kuruoz
Abstract:
Automatic text tagging is an important component in higher level analysis of text corpora, and its output can be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging, as the structures of many lexical forms are morphologically ambiguous. This paper describes a…
▽ More
Automatic text tagging is an important component in higher level analysis of text corpora, and its output can be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging, as the structures of many lexical forms are morphologically ambiguous. This paper describes a POS tagger for Turkish text based on a full-scale two-level specification of Turkish morphology that is based on a lexicon of about 24,000 root words. This is augmented with a multi-word and idiomatic construct recognizer, and most importantly morphological disambiguator based on local neighborhood constraints, heuristics and limited amount of statistical information. The tagger also has functionality for statistics compilation and fine tuning of the morphological analyzer, such as logging erroneous morphological parses, commonly used roots, etc. Preliminary results indicate that the tagger can tag about 98-99\% of the texts accurately with very minimal user intervention. Furthermore for sentences morphologically disambiguated with the tagger, an LFG parser developed for Turkish, generates, on the average, 50\% less ambiguous parses and parses almost 2.5 times faster. The tagging functionality is not specific to Turkish, and can be applied to any language with a proper morphological analysis interface.
△ Less
Submitted 29 July, 1994;
originally announced July 1994.
-
Parsing Turkish with the Lexical Functional Grammar Formalism
Authors:
Zelal Gungordu,
Kemal Oflazer
Abstract:
This paper describes our work on parsing Turkish using the lexical-functional grammar formalism. This work represents the first significant effort for parsing Turkish. Our implementation is based on Tomita's parser developed at Carnegie-Mellon University Center for Machine Translation. The grammar covers a substantial subset of Turkish including simple and complex sentences, and deals with a rea…
▽ More
This paper describes our work on parsing Turkish using the lexical-functional grammar formalism. This work represents the first significant effort for parsing Turkish. Our implementation is based on Tomita's parser developed at Carnegie-Mellon University Center for Machine Translation. The grammar covers a substantial subset of Turkish including simple and complex sentences, and deals with a reasonable amount of word order freeness. The complex agglutinative morphology of Turkish lexical structures is handled using a separate two-level morphological analyzer. After a discussion of key relevant issues regarding Turkish grammar, we discuss aspects of our system and present results from our implementation. Our initial results suggest that our system can parse about 82\% of the sentences directly and almost all the remaining with very minor pre-editing.
△ Less
Submitted 2 June, 1994;
originally announced June 1994.