Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Patil, Vaidehi; Talukdar, Partha; Sarawagi, Sunita

Computer Science > Computation and Language

arXiv:2203.01976 (cs)

[Submitted on 3 Mar 2022 (v1), last revised 23 Mar 2022 (this version, v2)]

Title:Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Authors:Vaidehi Patil, Partha Talukdar, Sunita Sarawagi

View PDF

Abstract:Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.

Comments:	Accepted to appear at the ACL 2022 Main conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2203.01976 [cs.CL]
	(or arXiv:2203.01976v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.01976

Submission history

From: Vaidehi Patil [view email]
[v1] Thu, 3 Mar 2022 19:35:24 UTC (6,330 KB)
[v2] Wed, 23 Mar 2022 10:55:13 UTC (6,333 KB)

Computer Science > Computation and Language

Title:Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators