Search | arXiv e-print repository

arXiv:2407.01360 [pdf, other]

Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging

Authors: Abrar Abir, Kemal Oflazer

Abstract: This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \& news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition,… ▽ More This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \& news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model's performance. Our system achieved a score of 25.41, placing us 4$^{th}$ on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: To appear in proceedings of 2024 Arabic NLP Conference

arXiv:2310.15113 [pdf]

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

Abstract: Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i… ▽ More Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading. △ Less

Submitted 26 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: EMNLP 2023

arXiv:1810.04216 [pdf, other]

Event Coreference Resolution Using Neural Network Classifiers

Authors: Arun Pandian, Lamana Mulaffer, Kemal Oflazer, Amna AlZeyara

Abstract: This paper presents a neural network classifier approach to detecting both within- and cross- document event coreference effectively using only event mention based features. Our approach does not (yet) rely on any event argument features such as semantic roles or spatiotemporal arguments. Experimental results on the ECB+ dataset show that our approach produces F1 scores that significantly outperfo… ▽ More This paper presents a neural network classifier approach to detecting both within- and cross- document event coreference effectively using only event mention based features. Our approach does not (yet) rely on any event argument features such as semantic roles or spatiotemporal arguments. Experimental results on the ECB+ dataset show that our approach produces F1 scores that significantly outperform the state-of-the-art methods for both within-document and cross-document event coreference resolution when we use B3 and CEAFe evaluation measures, but gets worse F1 score with the MUC measure. However, when we use the CoNLL measure, which is the average of these three scores, our approach has slightly better F1 for within- document event coreference resolution but is significantly better for cross-document event coreference resolution. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1808.08392 [pdf, other]

MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

Authors: Ossama Obeid, Salam Khalifa, Nizar Habash, Houda Bouamor, Wajdi Zaghouani, Kemal Oflazer

Abstract: In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech,… ▽ More In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech, lemma, gloss, and dialect identification. MADARi has a suite of utilities to help with annotator productivity. For example, annotators are provided with pre-computed analyses to assist them in their task and reduce the amount of work needed to complete it. MADARi also allows annotators to query a morphological analyzer for a list of possible analyses in multiple dialects or look up previously submitted analyses. The MADARi management interface enables a lead annotator to easily manage and organize the whole annotation process remotely and concurrently. We describe the motivation, design and implementation of this interface; and we present details from a user study working with this system. △ Less

Submitted 25 August, 2018; originally announced August 2018.

Comments: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

arXiv:cmp-lg/9704011 [pdf, ps, other]

Morphological Disambiguation by Voting Constraints

Authors: Kemal Oflazer, Gokhan Tur

Abstract: We present a constraint-based morphological disambiguation system in which individual constraints vote on matching morphological parses, and disambiguation of all the tokens in a sentence is performed at the end by selecting parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the… ▽ More We present a constraint-based morphological disambiguation system in which individual constraints vote on matching morphological parses, and disambiguation of all the tokens in a sentence is performed at the end by selecting parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing. Our results for disambiguating Turkish indicate that using about 500 constraint rules and some additional simple statistics, we can attain a recall of 95-96% and a precision of 94-95% with about 1.01 parses per token. Our system is implemented in Prolog and we are currently investigating an efficient implementation based on finite state transducers. △ Less

Submitted 25 April, 1997; originally announced April 1997.

Comments: 8 pages, Latex source. To appear in Proceedings of ACL/EACL'97 Compressed postscript also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/acl97.ps.z

arXiv:cmp-lg/9605008 [pdf, ps]

Tactical Generation in a Free Constituent Order Language

Authors: Dilek Zeynep Hakkani, Kemal Oflazer, Ilyas Cicekli

Abstract: This paper describes tactical generation in Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the… ▽ More This paper describes tactical generation in Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the order is almost freely changeable, depending on the constraints of the text flow or discourse. We have used a recursively structured finite state machine for handling the changes in constituent order, implemented as a right-linear grammar backbone. Our implementation environment is the GenKit system, developed at Carnegie Mellon University--Center for Machine Translation. Morphological realization has been implemented using an external morphological analysis/generation component which performs concrete morpheme selection and handles morphographemic processes. △ Less

Submitted 5 May, 1996; originally announced May 1996.

Comments: gzipped, uuencoded postscript file

Journal ref: Proceedings of 1996 International Workshop on Natural Language Generation

arXiv:cmp-lg/9604003 [pdf, ps]

Error-tolerant Tree Matching

Authors: Kemal Oflazer

Abstract: This paper presents an efficient algorithm for retrieving from a database of trees, all trees that match a given query tree approximately, that is, within a certain error tolerance. It has natural language processing applications in searching for matches in example-based translation systems, and retrieval from lexical databases containing entries of complex feature structures. The algorithm has… ▽ More This paper presents an efficient algorithm for retrieving from a database of trees, all trees that match a given query tree approximately, that is, within a certain error tolerance. It has natural language processing applications in searching for matches in example-based translation systems, and retrieval from lexical databases containing entries of complex feature structures. The algorithm has been implemented on SparcStations, and for large randomly generated synthetic tree databases (some having tens of thousands of trees) it can associatively search for trees with a small error, in a matter of tenths of a second to few seconds. △ Less

Submitted 17 April, 1996; v1 submitted 11 April, 1996; originally announced April 1996.

Comments: gzipped and uuencoded postscript, 5 pages. Minor fix in one of the figures. Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/coling96-ettm.ps.z

arXiv:cmp-lg/9604002 [pdf, ps]

A Constraint-based Case Frame Lexicon

Authors: Kemal Oflazer, Okan Yilmaz

Abstract: We present a constraint-based case frame lexicon architecture for bi-directional mapping between a syntactic case frame and a semantic frame. The lexicon uses a semantic sense as the basic unit and employs a multi-tiered constraint structure for the resolution of syntactic information into the appropriate senses and/or idiomatic usage. Valency changing transformations such as morphologically mar… ▽ More We present a constraint-based case frame lexicon architecture for bi-directional mapping between a syntactic case frame and a semantic frame. The lexicon uses a semantic sense as the basic unit and employs a multi-tiered constraint structure for the resolution of syntactic information into the appropriate senses and/or idiomatic usage. Valency changing transformations such as morphologically marked passivized or causativized forms are handled via lexical rules that manipulate case frames templates. The system has been implemented in a typed-feature system and applied to Turkish. △ Less

Submitted 11 April, 1996; originally announced April 1996.

Comments: gzipped, uuencoded postscript, 6 pages. Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/coling96-ccl.ps.z ; To Appear in Proceedings of COLING 96, Copenhaged, Denmark, August 1996

arXiv:cmp-lg/9604001 [pdf, ps]

Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation

Authors: Kemal Oflazer, Gokhan Tur

Abstract: This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology--specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work, but with the observation that his transformational approach is not directly applic… ▽ More This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology--specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work, but with the observation that his transformational approach is not directly applicable to languages like Turkish. Our system combines corpus independent hand-crafted constraint rules, constraint rules that are learned via unsupervised learning from a training corpus, and additional statistical information from the corpus to be morphologically disambiguated. The hand-crafted rules are linguistically motivated and tuned to improve precision without sacrificing recall. The unsupervised learning process produces two sets of rules: (i) choose rules which choose morphological parses of a lexical item satisfying constraint effectively discarding other parses, and (ii) delete rules, which delete parses satisfying a constraint. Our approach also uses a novel approach to unknown word processing by employing a secondary morphological processor which recovers any relevant inflectional and derivational information from a lexical item whose root is unknown. With this approach, well below 1 percent of the tokens remains as unknown in the texts we have experimented with. Our results indicate that by combining these hand-crafted,statistical and learned information sources, we can attain a recall of 96 to 97 percent with a corresponding precision of 93 to 94 percent, and ambiguity of 1.02 to 1.03 parses per token. △ Less

Submitted 12 April, 1996; v1 submitted 11 April, 1996; originally announced April 1996.

Comments: gzipped and uuencoded postscript, 13 pages. Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/emnlp.ps.z

arXiv:cmp-lg/9507008 [pdf, ps]

A Constraint-based Case Frame Lexicon Architecture

Authors: Kemal Oflazer, Okan Yilmaz

Abstract: In Turkish, (and possibly in many other languages) verbs often convey several meanings (some totally unrelated) when they are used with subjects, objects, oblique objects, adverbial adjuncts, with certain lexical, morphological, and semantic features, and co-occurrence restrictions. In addition to the usual sense variations due to selectional restrictions on verbal arguments, in most cases, the… ▽ More In Turkish, (and possibly in many other languages) verbs often convey several meanings (some totally unrelated) when they are used with subjects, objects, oblique objects, adverbial adjuncts, with certain lexical, morphological, and semantic features, and co-occurrence restrictions. In addition to the usual sense variations due to selectional restrictions on verbal arguments, in most cases, the meaning conveyed by a case frame is idiomatic and not compositional, with subtle constraints. In this paper, we present an approach to building a constraint-based case frame lexicon for use in natural language processing in Turkish, whose prototype we have implemented under the TFS system developed at Univ. of Stuttgart. A number of observations that we have made on Turkish have indicated that we need something beyond the traditional transitive and intransitive distinction, and utilize a framework where verb valence is considered as the obligatory co-existence of an arbitrary subset of possible arguments along with the obligatory exclusion of certain others, relative to a verb sense. Additional morphological lexical and semantic constraints on the syntactic constituents organized as a 5-tier constraint hierarchy, are utilized to map a given syntactic structure case-fraame to a specific verb sense. △ Less

Submitted 21 July, 1995; originally announced July 1995.

Comments: gzipped, uuencoded postscipt file, 11 pages. To be presented at the ESSLLI Workshop -- The Computational Lexicon. Also available as ftp://ftp.cs.bilkent.edu.tr/pub/tech-reports/1995/BU-CEIS-9511.ps.z

arXiv:cmp-lg/9504031 [pdf, ps]

Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction

Authors: Kemal Oflazer

Abstract: Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give e… ▽ More Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again, it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by a finite state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, error-tolerant △ Less

Submitted 21 July, 1995; v1 submitted 28 April, 1995; originally announced April 1995.

Comments: Replaces 9504031. gzipped, uuencoded postscript file. To appear in Computational Linguistics Volume 22 No:1, 1996, Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.z

arXiv:cmp-lg/9503001 [pdf, ps]

Using a Corpus for Teaching Turkish Morphology

Authors: H. Altay Guvenir, Kemal Oflazer

Abstract: This paper reports on the preliminary phase of our ongoing research towards developing an intelligent tutoring environment for Turkish grammar. One of the components of this environment is a corpus search tool which, among other aspects of the language, will be used to present the learner sample sentences along with their morphological analyses. Following a brief introduction to the Turkish langua… ▽ More This paper reports on the preliminary phase of our ongoing research towards developing an intelligent tutoring environment for Turkish grammar. One of the components of this environment is a corpus search tool which, among other aspects of the language, will be used to present the learner sample sentences along with their morphological analyses. Following a brief introduction to the Turkish language and its morphology, the paper describes the morphological analysis and ambiguity resolution used to construct the corpus used in the search tool. Finally, implementation issues and details involving the user interface of the tool are discussed. △ Less

Submitted 1 March, 1995; originally announced March 1995.

Comments: uuencoded gzip'ed postscript file. Appeared in Proceedings of TWLT-7, University of Twente, The Netherlands, June 1994. Software described is available at ftp://ftp.cs.bilkent.edu.tr/pub/Turklang/corpus-search/

Report number: Bilkent University CS Dept Tech Report BU-CEIS-9423

arXiv:cmp-lg/9410004 [pdf, ps]

Spelling Correction in Agglutinative Languages

Authors: Kemal Oflazer

Abstract: This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word… ▽ More This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word formation by derivational and inflectional affixations. After an overview of certain issues and relevant mathematical preliminaries, we formally present the problem and our solution. We then present results from our experiments with spelling correction in Turkish, a Ural--Altaic agglutinative language. Our results indicate that we can find the intended correct word in 95\% of the cases and offer it as the first candidate in 74\% of the cases, when the edit distance is 1. △ Less

Submitted 7 October, 1994; v1 submitted 6 October, 1994; originally announced October 1994.

Comments: uuencoded postscript file, poster version to appear in ANLP proceedings. (Abstract now fixed)

arXiv:cmp-lg/9407026 [pdf, ps]

Tagging and Morphological Disambiguation of Turkish Text

Authors: Kemal Oflazer, Ilker Kuruoz

Abstract: Automatic text tagging is an important component in higher level analysis of text corpora, and its output can be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging, as the structures of many lexical forms are morphologically ambiguous. This paper describes a… ▽ More Automatic text tagging is an important component in higher level analysis of text corpora, and its output can be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging, as the structures of many lexical forms are morphologically ambiguous. This paper describes a POS tagger for Turkish text based on a full-scale two-level specification of Turkish morphology that is based on a lexicon of about 24,000 root words. This is augmented with a multi-word and idiomatic construct recognizer, and most importantly morphological disambiguator based on local neighborhood constraints, heuristics and limited amount of statistical information. The tagger also has functionality for statistics compilation and fine tuning of the morphological analyzer, such as logging erroneous morphological parses, commonly used roots, etc. Preliminary results indicate that the tagger can tag about 98-99\% of the texts accurately with very minimal user intervention. Furthermore for sentences morphologically disambiguated with the tagger, an LFG parser developed for Turkish, generates, on the average, 50\% less ambiguous parses and parses almost 2.5 times faster. The tagging functionality is not specific to Turkish, and can be applied to any language with a proper morphological analysis interface. △ Less

Submitted 29 July, 1994; originally announced July 1994.

Comments: To appear in Proceedings of 4th ACL-ANLP Conf. uuencoded gzip'ed postscript file, 6 pages

Report number: Bilkent University CS Dept. Tech Report NO: BU-CEIS-9416

arXiv:cmp-lg/9406008 [pdf, ps]

Parsing Turkish with the Lexical Functional Grammar Formalism

Authors: Zelal Gungordu, Kemal Oflazer

Abstract: This paper describes our work on parsing Turkish using the lexical-functional grammar formalism. This work represents the first significant effort for parsing Turkish. Our implementation is based on Tomita's parser developed at Carnegie-Mellon University Center for Machine Translation. The grammar covers a substantial subset of Turkish including simple and complex sentences, and deals with a rea… ▽ More This paper describes our work on parsing Turkish using the lexical-functional grammar formalism. This work represents the first significant effort for parsing Turkish. Our implementation is based on Tomita's parser developed at Carnegie-Mellon University Center for Machine Translation. The grammar covers a substantial subset of Turkish including simple and complex sentences, and deals with a reasonable amount of word order freeness. The complex agglutinative morphology of Turkish lexical structures is handled using a separate two-level morphological analyzer. After a discussion of key relevant issues regarding Turkish grammar, we discuss aspects of our system and present results from our implementation. Our initial results suggest that our system can parse about 82\% of the sentences directly and almost all the remaining with very minor pre-editing. △ Less

Submitted 2 June, 1994; originally announced June 1994.

Comments: 7 pages, Postscript (compressed (gzip) and uuencoded)

Report number: (BU-CEIS-9402 Bilkent University CS Dept Tech Report)

Journal ref: Proceedings of COLING'94

Showing 1–15 of 15 results for author: Oflazer, K