Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Tan, Samson; Joty, Shafiq; Varshney, Lav R.; Kan, Min-Yen

Computer Science > Computation and Language

arXiv:2004.14870 (cs)

[Submitted on 30 Apr 2020 (v1), last revised 18 Nov 2020 (this version, v4)]

Title:Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Authors:Samson Tan, Shafiq Joty, Lav R. Varshney, Min-Yen Kan

View PDF

Abstract:Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

Comments:	Published in the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:2004.14870 [cs.CL]
	(or arXiv:2004.14870v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2004.14870
Journal reference:	2020.emnlp-main.455

Submission history

From: Samson Tan [view email]
[v1] Thu, 30 Apr 2020 15:15:40 UTC (64 KB)
[v2] Sun, 11 Oct 2020 18:54:40 UTC (7,229 KB)
[v3] Fri, 16 Oct 2020 05:20:28 UTC (7,230 KB)
[v4] Wed, 18 Nov 2020 06:16:31 UTC (7,229 KB)

Computer Science > Computation and Language

Title:Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators