Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman, Cagri; Yilmaz, Eyup Halit; Şahinuç, Furkan; Ozcelik, Oguzhan

doi:10.1145/3578707

Computer Science > Computation and Language

arXiv:2204.08832 (cs)

[Submitted on 19 Apr 2022]

Title:Impact of Tokenization on Language Models: An Analysis for Turkish

Authors:Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, Oguzhan Ozcelik

View PDF

Abstract:Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

Comments:	submitted to ACM TALLIP
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2204.08832 [cs.CL]
	(or arXiv:2204.08832v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2204.08832
Journal reference:	ACM Transactions on Asian and Low-Resource Language Information Processing (2023) Volume 22 Issue 4 pp 1-21
Related DOI:	https://doi.org/10.1145/3578707

Submission history

From: Cagri Toraman [view email]
[v1] Tue, 19 Apr 2022 12:01:46 UTC (254 KB)

Computer Science > Computation and Language

Title:Impact of Tokenization on Language Models: An Analysis for Turkish

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Impact of Tokenization on Language Models: An Analysis for Turkish

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators