Showing 1–1 of 1 results for author: Dailidėnaitė, M

Search v0.5.6 released 2020-02-24

arXiv:1911.10038 [pdf, ps, other]

cs.CL

Multilingual Culture-Independent Word Analogy Datasets

Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English,… ▽ More In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings. △ Less

Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 7 pages, LREC2020 conference

ACM Class: J.5

Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080

Search v0.5.6 released 2020-02-24