Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Zimmerman, Julia Witte; Hudon, Denis; Cramer, Kathryn; Ruiz, Alejandro J.; Beauregard, Calla; Fehr, Ashley; Fudolig, Mikaela Irene; Demarest, Bradford; Bird, Yoshi Meke; Trujillo, Milo Z.; Danforth, Christopher M.; Dodds, Peter Sheridan

Computer Science > Computation and Language

arXiv:2412.10924 (cs)

[Submitted on 14 Dec 2024 (v1), last revised 13 Apr 2025 (this version, v4)]

Title:Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Authors:Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

View PDF HTML (experimental)

Abstract:Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being arguably meaningfully insulated from the main system intelligence. [First uploaded to arXiv in December, 2024.]

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.10924 [cs.CL]
	(or arXiv:2412.10924v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.10924

Submission history

From: Julia Zimmerman [view email]
[v1] Sat, 14 Dec 2024 18:18:52 UTC (31,944 KB)
[v2] Wed, 18 Dec 2024 16:16:04 UTC (31,947 KB)
[v3] Tue, 24 Dec 2024 17:56:50 UTC (31,947 KB)
[v4] Sun, 13 Apr 2025 16:17:45 UTC (31,948 KB)

Computer Science > Computation and Language

Title:Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators