Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Khalid, Usama; Hussain, Aizaz; Arshad, Muhammad Umair; Shahzad, Waseem; Beg, Mirza Omer

Computer Science > Computation and Language

arXiv:2102.10957 (cs)

[Submitted on 22 Feb 2021]

Title:Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Authors:Usama Khalid, Aizaz Hussain, Muhammad Umair Arshad, Waseem Shahzad, Mirza Omer Beg

View PDF

Abstract:Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn't enough to naturally process the language by NLP techniques. Very efficient language models exist for the English language, a high resource language, but Urdu and other under-resourced languages have been neglected for a long time. To create efficient language models for these languages we must have good word embedding models. For Urdu, we can only find word embeddings trained and developed using the skip-gram model. In this paper, we have built a corpus for Urdu by scraping and integrating data from various sources and compiled a vocabulary for the Urdu language. We also modify fasttext embeddings and N-Grams models to enable training them on our built corpus. We have used these trained embeddings for a word similarity task and compared the results with existing techniques.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2102.10957 [cs.CL]
	(or arXiv:2102.10957v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2102.10957

Submission history

From: Usama Khalid [view email]
[v1] Mon, 22 Feb 2021 12:56:26 UTC (324 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-02

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Waseem Shahzad

export BibTeX citation

Computer Science > Computation and Language

Title:Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators