An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Nayak, Anuj K.; Varshney, Lav R.

Computer Science > Information Theory

arXiv:2410.01243 (cs)

[Submitted on 2 Oct 2024 (v1), last revised 15 Oct 2024 (this version, v2)]

Title:An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Authors:Anuj K. Nayak, Lav R. Varshney

View PDF HTML (experimental)

Abstract:Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

Comments:	14 pages, 5 figures
Subjects:	Information Theory (cs.IT)
Cite as:	arXiv:2410.01243 [cs.IT]
	(or arXiv:2410.01243v2 [cs.IT] for this version)
	https://doi.org/10.48550/arXiv.2410.01243

Submission history

From: Anuj Keshava Nayak [view email]
[v1] Wed, 2 Oct 2024 05:12:13 UTC (673 KB)
[v2] Tue, 15 Oct 2024 22:09:24 UTC (714 KB)

Computer Science > Information Theory

Title:An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Theory

Title:An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators