A partition cover approach to tokenization

Lim, Jia Peng; Tan, Shawn; Choo, Davin; Lauw, Hady W.

Computer Science > Computation and Language

arXiv:2501.06246 (cs)

[Submitted on 8 Jan 2025 (v1), last revised 25 May 2025 (this version, v2)]

Title:A partition cover approach to tokenization

Authors:Jia Peng Lim, Shawn Tan, Davin Choo, Hady W. Lauw

View PDF HTML (experimental)

Abstract:Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

Comments:	under review
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2501.06246 [cs.CL]
	(or arXiv:2501.06246v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.06246

Submission history

From: Jia Peng Lim [view email]
[v1] Wed, 8 Jan 2025 17:07:07 UTC (1,025 KB)
[v2] Sun, 25 May 2025 15:39:07 UTC (2,007 KB)

Computer Science > Computation and Language

Title:A partition cover approach to tokenization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A partition cover approach to tokenization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators