Search | arXiv e-print repository

Nested Named Entity Recognition as Single-Pass Sequence Labeling

Authors: Alberto Muñoz-Ortiz, David Vilares, Caio COrro, Carlos Gómez-Rodríguez

Abstract: We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ taggin… ▽ More We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Submitted to EMNLP 2025

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2505.11693 [pdf, ps, other]

Hierarchical Bracketing Encodings for Dependency Parsing as Tagging

Authors: Ana Ezquerro, David Vilares, Anssi Yli-Jyrä, Carlos Gómez-Rodríguez

Abstract: We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12… ▽ More We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Accepted to ACL 2025. Original submission; camera-ready coming soon

arXiv:2502.20866 [pdf, other]

Better Benchmarking LLMs for Zero-Shot Dependency Parsing

Authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

Abstract: While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The… ▽ More While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs. △ Less

Submitted 28 February, 2025; originally announced February 2025.

Comments: Accepted at NoDaLiDa/Baltic-HLT 2025

arXiv:2410.17972 [pdf, other]

Dependency Graph Parsing as Sequence Labeling

Authors: Ana Ezquerro, David Vilares, Carlos Gómez-Rodríguez

Abstract: Various linearizations have been proposed to cast syntactic dependency parsing as sequence labeling. However, these approaches do not support more complex graph-based representations, such as semantic dependencies or enhanced universal dependencies, as they cannot handle reentrancy or cycles. By extending them, we define a range of unbounded and bounded linearizations that can be used to cast grap… ▽ More Various linearizations have been proposed to cast syntactic dependency parsing as sequence labeling. However, these approaches do not support more complex graph-based representations, such as semantic dependencies or enhanced universal dependencies, as they cannot handle reentrancy or cycles. By extending them, we define a range of unbounded and bounded linearizations that can be used to cast graph parsing as a tagging task, enlarging the toolbox of problems that can be solved under this paradigm. Experimental results on semantic dependency and enhanced UD parsing show that with a good choice of encoding, sequence-labeling dependency graph parsers combine high efficiency with accuracies close to the state of the art, in spite of their simplicity. △ Less

Submitted 23 October, 2024; originally announced October 2024.

Comments: Accepted at EMNLP-2024

arXiv:2406.16071 [pdf, other]

Dancing in the syntax forest: fast, accurate and explainable sentiment analysis with SALSA

Authors: Carlos Gómez-Rodríguez, Muhammad Imran, David Vilares, Elena Solera, Olga Kellert

Abstract: Sentiment analysis is a key technology for companies and institutions to gauge public opinion on products, services or events. However, for large-scale sentiment analysis to be accessible to entities with modest computational resources, it needs to be performed in a resource-efficient way. While some efficient sentiment analysis systems exist, they tend to apply shallow heuristics, which do not ta… ▽ More Sentiment analysis is a key technology for companies and institutions to gauge public opinion on products, services or events. However, for large-scale sentiment analysis to be accessible to entities with modest computational resources, it needs to be performed in a resource-efficient way. While some efficient sentiment analysis systems exist, they tend to apply shallow heuristics, which do not take into account syntactic phenomena that can radically change sentiment. Conversely, alternatives that take syntax into account are computationally expensive. The SALSA project, funded by the European Research Council under a Proof-of-Concept Grant, aims to leverage recently-developed fast syntactic parsing techniques to build sentiment analysis systems that are lightweight and efficient, while still providing accuracy and explainability through the explicit use of syntax. We intend our approaches to be the backbone of a working product of interest for SMEs to use in production. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted for publication at SEPLN-CEDI2024: Seminar of the Spanish Society for Natural Language Processing at the 7th Spanish Conference on Informatics

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2405.06483 [pdf, other]

LyS at SemEval-2024 Task 3: An Early Prototype for End-to-End Multimodal Emotion Linking as Graph-Based Parsing

Authors: Ana Ezquerro, David Vilares

Abstract: This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conv… ▽ More This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: Accepted at SemEval 2024

arXiv:2402.02782 [pdf, other]

From Partial to Strictly Incremental Constituent Parsing

Authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

Abstract: We study incremental constituent parsers to assess their capacity to output trees based on prefix representations alone. Guided by strictly left-to-right generative language models and tree-decoding modules, we build parsers that adhere to a strong definition of incrementality across languages. This builds upon work that asserted incrementality, but that mostly only enforced it on either the encod… ▽ More We study incremental constituent parsers to assess their capacity to output trees based on prefix representations alone. Guided by strictly left-to-right generative language models and tree-decoding modules, we build parsers that adhere to a strong definition of incrementality across languages. This builds upon work that asserted incrementality, but that mostly only enforced it on either the encoder or the decoder. Finally, we conduct an analysis against non-incremental and partially incremental models. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted at EACL 2024

arXiv:2310.14319 [pdf, other]

4 and 7-bit Labeling for Projective and Non-Projective Dependency Trees

Authors: Carlos Gómez-Rodríguez, Diego Roca, David Vilares

Abstract: We introduce an encoding for parsing as sequence labeling that can represent any projective dependency tree as a sequence of 4-bit labels, one per word. The bits in each word's label represent (1) whether it is a right or left dependent, (2) whether it is the outermost (left/right) dependent of its parent, (3) whether it has any left children and (4) whether it has any right children. We show that… ▽ More We introduce an encoding for parsing as sequence labeling that can represent any projective dependency tree as a sequence of 4-bit labels, one per word. The bits in each word's label represent (1) whether it is a right or left dependent, (2) whether it is the outermost (left/right) dependent of its parent, (3) whether it has any left children and (4) whether it has any right children. We show that this provides an injective mapping from trees to labels that can be encoded and decoded in linear time. We then define a 7-bit extension that represents an extra plane of arcs, extending the coverage to almost full non-projectivity (over 99.9% empirical arc coverage). Results on a set of diverse treebanks show that our 7-bit encoding obtains substantial accuracy gains over the previously best-performing sequence labeling encodings. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: Accepted for publication at EMNLP 2023

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2309.16254 [pdf, other]

On the Challenges of Fully Incremental Neural Dependency Parsing

Authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares

Abstract: Since the popularization of BiLSTMs and Transformer-based bidirectional encoders, state-of-the-art syntactic parsers have lacked incrementality, requiring access to the whole sentence and deviating from human language processing. This paper explores whether fully incremental dependency parsing with modern architectures can be competitive. We build parsers combining strictly left-to-right neural en… ▽ More Since the popularization of BiLSTMs and Transformer-based bidirectional encoders, state-of-the-art syntactic parsers have lacked incrementality, requiring access to the whole sentence and deviating from human language processing. This paper explores whether fully incremental dependency parsing with modern architectures can be competitive. We build parsers combining strictly left-to-right neural encoders with fully incremental sequence-labeling and transition-based decoders. The results show that fully incremental parsing with modern architectures considerably lags behind bidirectional parsing, noting the challenges of psycholinguistically plausible parsing. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Accepted at IJCNLP-AACL 2023

arXiv:2309.11165 [pdf, other]

Assessment of Pre-Trained Models Across Languages and Grammars

Authors: Alberto Muñoz-Ortiz, David Vilares, Carlos Gómez-Rodríguez

Abstract: We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show th… ▽ More We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at IJCNLP-AACL 2023

arXiv:2308.09067 [pdf]

doi 10.1007/s10462-024-10903-2

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Authors: Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares

Abstract: We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differen… ▽ More We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Human texts exhibit more scattered sentence length distributions, more variety of vocabulary, a distinct use of dependency and constituent types, shorter constituents, and more optimized dependency distances. Humans tend to exhibit stronger negative emotions (such as fear and disgust) and less joy compared to text generated by LLMs, with the toxicity of these models increasing as their size grows. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs, and even magnified in all of them but one. Differences between LLMs and humans are larger than between LLMs. △ Less

Submitted 2 September, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: Published at Artificial Intelligence Review vol. 57, 265

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: Artificial Intelligence Review 57, 265 (2024)

arXiv:2305.15119 [pdf, other]

Another Dead End for Morphological Tags? Perturbed Inputs and Parsing

Authors: Alberto Muñoz-Ortiz, David Vilares

Abstract: The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use… ▽ More The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use of morphological information by parsers: (i) contributes to error propagation or (ii) if on the other hand it can play a role to correct mistakes that word-only neural parsers make. The results on 14 diverse UD treebanks show that under such attacks, for transition- and graph-based models their use contributes to degrade the performance even faster, while for the (lower-performing) sequence labeling parsers they are helpful. We also show that if morphological tags were utopically robust against lexical perturbations, they would be able to correct parsing mistakes. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at Findings of ACL 2023

arXiv:2210.15219 [pdf, other]

Parsing linearizations appreciate PoS tags - but some are fussy about errors

Authors: Alberto Muñoz-Ortiz, Mark Anderson, David Vilares, Carlos Gómez-Rodríguez

Abstract: PoS tags, once taken for granted as a useful resource for syntactic parsing, have become more situational with the popularization of deep learning. Recent work on the impact of PoS tags on graph- and transition-based parsers suggests that they are only useful when tagging accuracy is prohibitively high, or in low-resource scenarios. However, such an analysis is lacking for the emerging sequence la… ▽ More PoS tags, once taken for granted as a useful resource for syntactic parsing, have become more situational with the popularization of deep learning. Recent work on the impact of PoS tags on graph- and transition-based parsers suggests that they are only useful when tagging accuracy is prohibitively high, or in low-resource scenarios. However, such an analysis is lacking for the emerging sequence labeling parsing paradigm, where it is especially relevant as some models explicitly use PoS tags for encoding and decoding. We undertake a study and uncover some trends. Among them, PoS tags are generally more useful for sequence labeling parsers than for other paradigms, but the impact of their accuracy is highly encoding-dependent, with the PoS-based head-selection encoding being best only when both tagging accuracy and resource availability are high. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted at AACL 2022

arXiv:2209.06699 [pdf, other]

The Fragility of Multi-Treebank Parsing Evaluation

Authors: Iago Alonso-Alonso, David Vilares, Carlos Gómez-Rodríguez

Abstract: Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typolo… ▽ More Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typology or data scarcity. Second, we run a large-scale version of this experiment, create vast amounts of random subsets of treebanks, and compare on them many parsers whose scores are available. The results show substantial variability across subsets and that although establishing guidelines for good treebank selection is hard, it is possible to detect potentially harmful strategies. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: Accepted at COLING 2022

arXiv:2205.09350 [pdf, other]

Cross-lingual Inflection as a Data Augmentation Method for Parsing

Authors: Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares

Abstract: We propose a morphology-based method for low-resource (LR) dependency parsing. We train a morphological inflector for target LR languages, and apply it to related rich-resource (RR) treebanks to create cross-lingual (x-inflected) treebanks that resemble the target LR language. We use such inflected treebanks to train parsers in zero- (training on x-inflected treebanks) and few-shot (training on x-… ▽ More We propose a morphology-based method for low-resource (LR) dependency parsing. We train a morphological inflector for target LR languages, and apply it to related rich-resource (RR) treebanks to create cross-lingual (x-inflected) treebanks that resemble the target LR language. We use such inflected treebanks to train parsers in zero- (training on x-inflected treebanks) and few-shot (training on x-inflected and target language treebanks) setups. The results show that the method sometimes improves the baselines, but not consistently. △ Less

Submitted 20 May, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: 10 pages, 7 tables, 5 figures. Workshop on Insights from Negative Results in NLP 2022 (co-located with ACL)

arXiv:2204.12820 [pdf, other]

LyS_ACoruña at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools for Sentiment Analysis as Semantic Dependency Parsing

Authors: Iago Alonso-Alonso, David Vilares, Carlos Gómez-Rodríguez

Abstract: This paper addressed the problem of structured sentiment analysis using a bi-affine semantic dependency parser, large pre-trained language models, and publicly available translation models. For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages that can be adequately processed by cross-lingua… ▽ More This paper addressed the problem of structured sentiment analysis using a bi-affine semantic dependency parser, large pre-trained language models, and publicly available translation models. For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages that can be adequately processed by cross-lingual language models. For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data (we release as much of it as licenses allow), and (ii) merging those translated treebanks to obtain training data. In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English treebanks and did not use word-level translations, and yet obtained better results. According to the official results, we ranked 8th and 9th in the monolingual and cross-lingual setups. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: To be published at SemEval-2022

arXiv:2108.07556 [pdf, other]

Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing

Authors: Alberto Muñoz-Ortiz, Michalina Strzyz, David Vilares

Abstract: Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. H… ▽ More Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration. △ Less

Submitted 17 August, 2021; originally announced August 2021.

Comments: Accepted at RANLP 2021 (https://ranlp.org/ranlp2021)

arXiv:2105.02947 [pdf, other]

On the logistical difficulties and findings of Jopara Sentiment Analysis

Authors: Marvin M. Agüero-Torales, David Vilares, Antonio G. López-Herrera

Abstract: This paper addresses the problem of sentiment analysis for Jopara, a code-switching language between Guarani and Spanish. We first collect a corpus of Guarani-dominant tweets and discuss on the difficulties of finding quality data for even relatively easy-to-annotate tasks, such as sentiment analysis. Then, we train a set of neural models, including pre-trained language models, and explore whether… ▽ More This paper addresses the problem of sentiment analysis for Jopara, a code-switching language between Guarani and Spanish. We first collect a corpus of Guarani-dominant tweets and discuss on the difficulties of finding quality data for even relatively easy-to-annotate tasks, such as sentiment analysis. Then, we train a set of neural models, including pre-trained language models, and explore whether they perform better than traditional machine learning ones in this low-resource setup. Transformer architectures obtain the best results, despite not considering Guarani during pre-training, but traditional machine learning models perform close due to the low-resource nature of the problem. △ Less

Submitted 11 May, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

Comments: Accepted in the CALCS 2021 (co-located with NAACL 2021) - Fifth Workshop on Computational Approaches to Linguistic Code Switching, to appear (June 2021)

MSC Class: 68-02 68T50 68T07 91D30

Journal ref: Proceedings on CALCS 2021 (co-located with NAACL 2021) - Fifth Workshop on Computational Approaches to Linguistic Code Switching

arXiv:2103.13799 [pdf, other]

doi 10.26342/2021-66-1

Bertinho: Galician BERT Representations

Authors: David Vilares, Marcos Garcia, Carlos Gómez-Rodríguez

Abstract: This paper presents a monolingual BERT model for Galician. We follow the recent trend that shows that it is feasible to build robust monolingual BERT models even for relatively low-resource languages, while performing better than the well-known official multilingual BERT (mBERT). More particularly, we release two monolingual Galician BERT models, built using 6 and 12 transformer layers, respective… ▽ More This paper presents a monolingual BERT model for Galician. We follow the recent trend that shows that it is feasible to build robust monolingual BERT models even for relatively low-resource languages, while performing better than the well-known official multilingual BERT (mBERT). More particularly, we release two monolingual Galician BERT models, built using 6 and 12 transformer layers, respectively; trained with limited resources (~45 million tokens on a single GPU of 24GB). We then provide an exhaustive evaluation on a number of tasks such as POS-tagging, dependency parsing and named entity recognition. For this purpose, all these tasks are cast in a pure sequence labeling setup in order to run BERT without the need to include any additional layers on top of it (we only use an output classification layer to map the contextualized representations into the predicted label). The experiments show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks. △ Less

Submitted 25 March, 2021; originally announced March 2021.

Comments: Accepted in the journal Procesamiento del Lenguaje Natural

Journal ref: Procesamiento del Lenguaje Natural. 66 (2021) 13-26

arXiv:2011.00596 [pdf, ps, other]

Bracketing Encodings for 2-Planar Dependency Parsing

Authors: Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez

Abstract: We present a bracketing-based encoding that can be used to represent any 2-planar dependency tree over a sentence of length n as a sequence of n labels, hence providing almost total coverage of crossing arcs in sequence labeling parsing. First, we show that existing bracketing encodings for parsing as labeling can only handle a very mild extension of projective trees. Second, we overcome this limi… ▽ More We present a bracketing-based encoding that can be used to represent any 2-planar dependency tree over a sentence of length n as a sequence of n labels, hence providing almost total coverage of crossing arcs in sequence labeling parsing. First, we show that existing bracketing encodings for parsing as labeling can only handle a very mild extension of projective trees. Second, we overcome this limitation by taking into account the well-known property of 2-planarity, which is present in the vast majority of dependency syntactic structures in treebanks, i.e., the arcs of a dependency tree can be split into two planes such that arcs in a given plane do not cross. We take advantage of this property to design a method that balances the brackets and that encodes the arcs belonging to each of those planes, allowing for almost unrestricted non-projectivity (round 99.9% coverage) in sequence labeling parsing. The experiments show that our linearizations improve over the accuracy of the original bracketing encoding in highly non-projective treebanks (on average by 0.4 LAS), while achieving a similar speed. Also, they are especially suitable when PoS tags are not used as input parameters to the models. △ Less

Submitted 22 March, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

Comments: COLING2020 (long papers), 13 pages (incl. appendix) with corrected parsing speeds for Danish and Gothic

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2011.00584 [pdf, ps, other]

A Unifying Theory of Transition-based and Sequence Labeling Parsing

Authors: Carlos Gómez-Rodríguez, Michalina Strzyz, David Vilares

Abstract: We define a mapping from transition-based parsing algorithms that read sentences from left to right to sequence labeling encodings of syntactic trees. This not only establishes a theoretical relation between transition-based parsing and sequence-labeling parsing, but also provides a method to obtain new encodings for fast and simple sequence labeling parsing from the many existing transition-based… ▽ More We define a mapping from transition-based parsing algorithms that read sentences from left to right to sequence labeling encodings of syntactic trees. This not only establishes a theoretical relation between transition-based parsing and sequence-labeling parsing, but also provides a method to obtain new encodings for fast and simple sequence labeling parsing from the many existing transition-based parsers for different formalisms. Applying it to dependency parsing, we implement sequence labeling versions of four algorithms, showing that they are learnable and obtain comparable performance to existing encodings. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: Camera-ready version (final peer-reviewed manuscript) to appear at proceedings of COLING 2020. 18 pages (incl. appendices)

MSC Class: 68T50; 68Q45 ACM Class: F.4.3; I.2.7

arXiv:2010.00633 [pdf, other]

Discontinuous Constituent Parsing as Sequence Labeling

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: This paper reduces discontinuous parsing to sequence labeling. It first shows that existing reductions for constituent parsing as labeling do not support discontinuities. Second, it fills this gap and proposes to encode tree discontinuities as nearly ordered permutations of the input sequence. Third, it studies whether such discontinuous representations are learnable. The experiments show that des… ▽ More This paper reduces discontinuous parsing to sequence labeling. It first shows that existing reductions for constituent parsing as labeling do not support discontinuities. Second, it fills this gap and proposes to encode tree discontinuities as nearly ordered permutations of the input sequence. Third, it studies whether such discontinuous representations are learnable. The experiments show that despite the architectural simplicity, under the right representation, the models are fast and accurate. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: To appear in EMNLP 2020

arXiv:2002.01685 [pdf, other]

Parsing as Pretraining

Authors: David Vilares, Michalina Strzyz, Anders Søgaard, Carlos Gómez-Rodríguez

Abstract: Recent analyses suggest that encoders pretrained for language modeling capture certain morpho-syntactic structure. However, probing frameworks for word vectors still do not report results on standard setups such as constituent and dependency parsing. This paper addresses this problem and does full parsing (on English) relying only on pretraining architectures -- and no decoding. We first cast cons… ▽ More Recent analyses suggest that encoders pretrained for language modeling capture certain morpho-syntactic structure. However, probing frameworks for word vectors still do not report results on standard setups such as constituent and dependency parsing. This paper addresses this problem and does full parsing (on English) relying only on pretraining architectures -- and no decoding. We first cast constituent and dependency parsing as sequence tagging. We then use a single feed-forward layer to directly map word vectors to labels that encode a linearized tree. This is used to: (i) see how far we can reach on syntax modelling with just pretrained encoders, and (ii) shed some light about the syntax-sensitivity of different word vectors (by freezing the weights of the pretraining network during training). For evaluation, we use bracketing F1-score and LAS, and analyze in-depth differences across representations for span lengths and dependency displacements. The overall results surpass existing sequence tagging parsers on the PTB (93.5%) and end-to-end EN-EWT UD (78.8%). △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: AAAI 2020 - The Thirty-Fourth AAAI Conference on Artificial Intelligence

arXiv:1909.01053 [pdf, other]

Towards Making a Dependency Parser See

Authors: Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez

Abstract: We explore whether it is possible to leverage eye-tracking data in an RNN dependency parser (for English) when such information is only available during training, i.e., no aggregated or token-level gaze features are used at inference time. To do so, we train a multitask learning model that parses sentences as sequence labeling and leverages gaze features as auxiliary tasks. Our method also learns… ▽ More We explore whether it is possible to leverage eye-tracking data in an RNN dependency parser (for English) when such information is only available during training, i.e., no aggregated or token-level gaze features are used at inference time. To do so, we train a multitask learning model that parses sentences as sequence labeling and leverages gaze features as auxiliary tasks. Our method also learns to train from disjoint datasets, i.e. it can be used to test whether already collected gaze features are useful to improve the performance on new non-gazed annotated treebanks. Accuracy gains are modest but positive, showing the feasibility of the approach. It can serve as a first step towards architectures that can better leverage eye-tracking data or other complementary information available only for training sentences, possibly leading to improvements in syntactic parsing. △ Less

Submitted 3 September, 2019; originally announced September 2019.

Comments: Camera-ready version to appear at EMNLP 2019 (final peer-reviewed manuscript). 8 pages (incl. appendix)

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1908.03480 [pdf, other]

Artificially Evolved Chunks for Morphosyntactic Analysis

Authors: Mark Anderson, David Vilares, Carlos Gómez-Rodríguez

Abstract: We introduce a language-agnostic evolutionary technique for automatically extracting chunks from dependency treebanks. We evaluate these chunks on a number of morphosyntactic tasks, namely POS tagging, morphological feature tagging, and dependency parsing. We test the utility of these chunks in a host of different ways. We first learn chunking as one task in a shared multi-task framework together… ▽ More We introduce a language-agnostic evolutionary technique for automatically extracting chunks from dependency treebanks. We evaluate these chunks on a number of morphosyntactic tasks, namely POS tagging, morphological feature tagging, and dependency parsing. We test the utility of these chunks in a host of different ways. We first learn chunking as one task in a shared multi-task framework together with POS and morphological feature tagging. The predictions from this network are then used as input to augment sequence-labelling dependency parsing. Finally, we investigate the impact chunks have on dependency parsing in a multi-task framework. Our results from these analyses show that these chunks improve performance at different levels of syntactic abstraction on English UD treebanks and a small, diverse subset of non-English UD treebanks. △ Less

Submitted 21 August, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

Comments: To be published in proceedings of the 18th International Workshop on Treebanks and Linguistic Theories

arXiv:1907.01339 [pdf, other]

doi 10.18653/v1/P19-1531

Sequence Labeling Parsing by Learning Across Representations

Authors: Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez

Abstract: We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, a… ▽ More We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.14 F1 points, and for dependency parsing by 0.62 UAS points. △ Less

Submitted 7 January, 2020; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Revised version after fixing evaluation bug

arXiv:1906.04701 [pdf, other]

HEAD-QA: A Healthcare Dataset for Complex Reasoning

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show t… ▽ More We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work. △ Less

Submitted 11 June, 2019; originally announced June 2019.

Comments: ACL 2019 (short papers)

arXiv:1905.11037 [pdf, other]

Harry Potter and the Action Prediction Challenge from Natural Language

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: We explore the challenge of action prediction from textual descriptions of scenes, a testbed to approximate whether text inference can be used to predict upcoming actions. As a case of study, we consider the world of the Harry Potter fantasy novels and inferring what spell will be cast next given a fragment of a story. Spells act as keywords that abstract actions (e.g. 'Alohomora' to open a door)… ▽ More We explore the challenge of action prediction from textual descriptions of scenes, a testbed to approximate whether text inference can be used to predict upcoming actions. As a case of study, we consider the world of the Harry Potter fantasy novels and inferring what spell will be cast next given a fragment of a story. Spells act as keywords that abstract actions (e.g. 'Alohomora' to open a door) and denote a response to the environment. This idea is used to automatically build HPAC, a corpus containing 82,836 samples and 85 actions. We then evaluate different baselines. Among the tested models, an LSTM-based approach obtains the best performance for frequent actions and large scene descriptions, but approaches such as logistic regression behave well on infrequent actions. △ Less

Submitted 27 May, 2019; originally announced May 2019.

Comments: NAACL 2019 (short papers)

arXiv:1902.10985 [pdf, other]

Better, Faster, Stronger Sequence Tagging Constituent Parsers

Authors: David Vilares, Mostafa Abdou, Anders Søgaard

Abstract: Sequence tagging models for constituent parsing are faster, but less accurate than other types of parsers. In this work, we address the following weaknesses of such constituent parsers: (a) high error rates around closing brackets of long constituents, (b) large label sets, leading to sparsity, and (c) error propagation arising from greedy decoding. To effectively close brackets, we train a model… ▽ More Sequence tagging models for constituent parsing are faster, but less accurate than other types of parsers. In this work, we address the following weaknesses of such constituent parsers: (a) high error rates around closing brackets of long constituents, (b) large label sets, leading to sparsity, and (c) error propagation arising from greedy decoding. To effectively close brackets, we train a model that learns to switch between tagging schemes. To reduce sparsity, we decompose the label set and use multi-task learning to jointly learn to predict sublabels. Finally, we mitigate issues from greedy decoding through auxiliary losses and sentence-level fine-tuning with policy gradient. Combining these techniques, we clearly surpass the performance of sequence tagging constituent parsers on the English and Chinese Penn Treebanks, and reduce their parsing time even further. On the SPMRL datasets, we observe even greater improvements across the board, including a new state of the art on Basque, Hebrew, Polish and Swedish. △ Less

Submitted 14 October, 2019; v1 submitted 28 February, 2019; originally announced February 2019.

Comments: NAACL 2019 (long papers). Contains corrigendum

arXiv:1902.10505 [pdf, other]

Viable Dependency Parsing as Sequence Labeling

Authors: Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez

Abstract: We recast dependency parsing as a sequence labeling problem, exploring several encodings of dependency trees as labels. While dependency parsing by means of sequence labeling had been attempted in existing work, results suggested that the technique was impractical. We show instead that with a conventional BiLSTM-based model it is possible to obtain fast and accurate parsers. These parsers are conc… ▽ More We recast dependency parsing as a sequence labeling problem, exploring several encodings of dependency trees as labels. While dependency parsing by means of sequence labeling had been attempted in existing work, results suggested that the technique was impractical. We show instead that with a conventional BiLSTM-based model it is possible to obtain fast and accurate parsers. These parsers are conceptually simple, not needing traditional parsing algorithms or auxiliary structures. However, experiments on the PTB and a sample of UD treebanks show that they provide a good speed-accuracy tradeoff, with results competitive with more complex approaches. △ Less

Submitted 29 March, 2019; v1 submitted 27 February, 2019; originally announced February 2019.

Comments: Camera-ready version to appear at NAACL 2019 (final peer-reviewed manuscript). 8 pages (incl. appendix)

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1810.08997 [pdf, ps, other]

Transition-based Parsing with Lighter Feed-Forward Networks

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: We explore whether it is possible to build lighter parsers, that are statistically equivalent to their corresponding standard version, for a wide set of languages showing different structures and morphologies. As testbed, we use the Universal Dependencies and transition-based dependency parsers trained on feed-forward networks. For these, most existing research assumes de facto standard embedded f… ▽ More We explore whether it is possible to build lighter parsers, that are statistically equivalent to their corresponding standard version, for a wide set of languages showing different structures and morphologies. As testbed, we use the Universal Dependencies and transition-based dependency parsers trained on feed-forward networks. For these, most existing research assumes de facto standard embedded features and relies on pre-computation tricks to obtain speed-ups. We explore how these features and their size can be reduced and whether this translates into speed-ups with a negligible impact on accuracy. The experiments show that grand-daughter features can be removed for the majority of treebanks without a significant (negative or positive) LAS difference. They also show how the size of the embeddings can be notably reduced. △ Less

Submitted 21 October, 2018; originally announced October 2018.

Comments: UD Workshop (co-located with EMNLP 2018)

arXiv:1810.08994 [pdf, other]

Constituent Parsing as Sequence Labeling

Authors: Carlos Gómez-Rodríguez, David Vilares

Abstract: We introduce a method to reduce constituent parsing to sequence labeling. For each word w_t, it generates a label that encodes: (1) the number of ancestors in the tree that the words w_t and w_{t+1} have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the appro… ▽ More We introduce a method to reduce constituent parsing to sequence labeling. For each word w_t, it generates a label that encodes: (1) the number of ancestors in the tree that the words w_t and w_{t+1} have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the approach is made extensible to all constituency trees by collapsing unary branches. We then use the PTB and CTB treebanks as testbeds and propose a set of fast baselines. We achieve 90.7% F-score on the PTB test set, outperforming the Vinyals et al. (2015) sequence-to-sequence parser. In addition, sacrificing some accuracy, our approach achieves the fastest constituent parsing speeds reported to date on PTB by a wide margin. △ Less

Submitted 17 September, 2019; v1 submitted 21 October, 2018; originally announced October 2018.

Comments: EMNLP 2018 (Long Papers). Revised version with improved results after fixing evaluation bug

arXiv:1805.09055 [pdf, ps, other]

Grounding the Semantics of Part-of-Day Nouns Worldwide using Twitter

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: The usage of part-of-day nouns, such as 'night', and their time-specific greetings ('good night'), varies across languages and cultures. We show the possibilities that Twitter offers for studying the semantics of these terms and its variability between countries. We mine a worldwide sample of multilingual tweets with temporal greetings, and study how their frequencies vary in relation with local t… ▽ More The usage of part-of-day nouns, such as 'night', and their time-specific greetings ('good night'), varies across languages and cultures. We show the possibilities that Twitter offers for studying the semantics of these terms and its variability between countries. We mine a worldwide sample of multilingual tweets with temporal greetings, and study how their frequencies vary in relation with local time. The results provide insights into the semantics of these temporal expressions and the cultural and sociological factors influencing their usage. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Comments: In PEOPLES2018 short papers (NAACL workshop), 6 pages, 5 figures, 1 table

arXiv:1805.09007 [pdf, other]

A Transition-based Algorithm for Unrestricted AMR Parsing

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: Non-projective parsing can be useful to handle cycles and reentrancy in AMR graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition… ▽ More Non-projective parsing can be useful to handle cycles and reentrancy in AMR graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition system itself. The model is evaluated on the LDC2015E86 corpus, obtaining results close to the state of the art, including a Smatch of 64%, and showing good behavior on reentrant edges. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Comments: In NAACL 2018 (short papers): 8 pages, 1 Figure

arXiv:1708.05269 [pdf, other]

Towards Syntactic Iberian Polarity Classification

Authors: David Vilares, Marcos Garcia, Miguel A. Alonso, Carlos Gómez-Rodríguez

Abstract: Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five off… ▽ More Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available. △ Less

Submitted 17 August, 2017; originally announced August 2017.

Comments: 7 pages, 5 tables. Contribution to the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017) at EMNLP 2017

arXiv:1707.03228 [pdf, other]

A non-projective greedy dependency parser with bidirectional LSTMs

Authors: David Vilares, Carlos Gómez-Rodríguez

Abstract: The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation of the Covington (2001) algorithm for non-projective dependency parsing. The bidirectional LSTM approach by Kipperwasser and Goldberg (2016) is used to train a greedy parser with a dynamic oracle to mitigate error propagation. The model participated in the CoNLL 2017 UD Shared Task. In spite of not using any ensemble methods a… ▽ More The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation of the Covington (2001) algorithm for non-projective dependency parsing. The bidirectional LSTM approach by Kipperwasser and Goldberg (2016) is used to train a greedy parser with a dynamic oracle to mitigate error propagation. The model participated in the CoNLL 2017 UD Shared Task. In spite of not using any ensemble methods and using the baseline segmentation and PoS tagging, the parser obtained good results on both macro-average LAS and UAS in the big treebanks category (55 languages), ranking 7th out of 33 teams. In the all treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the all and big categories is mainly due to the poor performance on four parallel PUD treebanks, suggesting that some `suffixed' treebanks (e.g. Spanish-AnCora) perform poorly on cross-treebank settings, which does not occur with the corresponding `unsuffixed' treebank (e.g. Spanish). By changing that, we obtain the 11th best LAS among all runs (official and unofficial). The code is made available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSE △ Less

Submitted 11 July, 2017; originally announced July 2017.

Comments: 12 pages, 2 figures, 5 tables

Journal ref: In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 152-162, Vancouver, Canada, 2017

arXiv:1706.02141 [pdf, other]

doi 10.1007/s10462-017-9584-0

How Important is Syntactic Parsing Accuracy? An Empirical Evaluation on Rule-Based Sentiment Analysis

Authors: Carlos Gómez-Rodríguez, Iago Alonso-Alonso, David Vilares

Abstract: Syntactic parsing, the process of obtaining the internal structure of sentences in natural languages, is a crucial task for artificial intelligence applications that need to extract meaning from natural language text or speech. Sentiment analysis is one example of application for which parsing has recently proven useful. In recent years, there have been significant advances in the accuracy of pa… ▽ More Syntactic parsing, the process of obtaining the internal structure of sentences in natural languages, is a crucial task for artificial intelligence applications that need to extract meaning from natural language text or speech. Sentiment analysis is one example of application for which parsing has recently proven useful. In recent years, there have been significant advances in the accuracy of parsing algorithms. In this article, we perform an empirical, task-oriented evaluation to determine how parsing accuracy influences the performance of a state-of-the-art rule-based sentiment analysis system that determines the polarity of sentences from their parse trees. In particular, we evaluate the system using four well-known dependency parsers, including both current models with state-of-the-art accuracy and more innacurate models which, however, require less computational resources. The experiments show that all of the parsers produce similarly good results in the sentiment analysis task, without their accuracy having any relevant influence on the results. Since parsing is currently a task with a relatively high computational cost that varies strongly between algorithms, this suggests that sentiment analysis researchers and users should prioritize speed over accuracy when choosing a parser; and parsing researchers should investigate models that improve speed further, even at some cost to accuracy. △ Less

Submitted 24 October, 2017; v1 submitted 7 June, 2017; originally announced June 2017.

Comments: 19 pages. Accepted for publication in Artificial Intelligence Review. This update only adds the DOI link to comply with journal's terms

MSC Class: 68T50; 97R40 ACM Class: I.2.7

arXiv:1606.05545 [pdf, ps, other]

doi 10.1016/j.knosys.2016.11.014

Universal, Unsupervised (Rule-Based), Uncovered Sentiment Analysis

Authors: David Vilares, Carlos Gómez-Rodríguez, Miguel A. Alonso

Abstract: We present a novel unsupervised approach for multilingual sentiment analysis driven by compositional syntax-based rules. On the one hand, we exploit some of the main advantages of unsupervised algorithms: (1) the interpretability of their output, in contrast with most supervised models, which behave as a black box and (2) their robustness across different corpora and domains. On the other hand, by… ▽ More We present a novel unsupervised approach for multilingual sentiment analysis driven by compositional syntax-based rules. On the one hand, we exploit some of the main advantages of unsupervised algorithms: (1) the interpretability of their output, in contrast with most supervised models, which behave as a black box and (2) their robustness across different corpora and domains. On the other hand, by introducing the concept of compositional operations and exploiting syntactic information in the form of universal dependencies, we tackle one of their main drawbacks: their rigidity on data that are structured differently depending on the language concerned. Experiments show an improvement both over existing unsupervised methods, and over state-of-the-art supervised models when evaluating outside their corpus of origin. Experiments also show how the same compositional operations can be shared across languages. The system is available at http://www.grupolys.org/software/UUUSA/ △ Less

Submitted 5 January, 2017; v1 submitted 17 June, 2016; originally announced June 2016.

Comments: 19 pages, 5 Tables, 6 Figures. This is the authors version of a work that was accepted for publication in Knowledge-Based Systems

Journal ref: Knowledge-Based Systems, 118:45-55, 2017

arXiv:1507.08449 [pdf, ps, other]

One model, two languages: training bilingual parsers with harmonized treebanks

Authors: David Vilares, Carlos Gómez-Rodríguez, Miguel A. Alonso

Abstract: We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingual p… ▽ More We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingual parsers are more than competitive, as most combinations not only preserve accuracy, but some even achieve significant improvements over the corresponding monolingual parsers. Preliminary experiments also show the approach to be promising on texts with code-switching and when more languages are added. △ Less

Submitted 19 May, 2016; v1 submitted 30 July, 2015; originally announced July 2015.

Comments: 7 pages, 4 tables, 1 figure

Showing 1–39 of 39 results for author: Vilares, D