Scaling Laws for Autoregressive Generative Modeling

Henighan, Tom; Kaplan, Jared; Katz, Mor; Chen, Mark; Hesse, Christopher; Jackson, Jacob; Jun, Heewoo; Brown, Tom B.; Dhariwal, Prafulla; Gray, Scott; Hallacy, Chris; Mann, Benjamin; Radford, Alec; Ramesh, Aditya; Ryder, Nick; Ziegler, Daniel M.; Schulman, John; Amodei, Dario; McCandlish, Sam

Computer Science > Machine Learning

arXiv:2010.14701 (cs)

[Submitted on 28 Oct 2020 (v1), last revised 6 Nov 2020 (this version, v2)]

Title:Scaling Laws for Autoregressive Generative Modeling

Authors:Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish

View PDF

Abstract:We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.
The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions.
We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

Comments:	20+17 pages, 33 figures; added appendix with additional language results
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2010.14701 [cs.LG]
	(or arXiv:2010.14701v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2010.14701

Submission history

From: Samuel McCandlish [view email]
[v1] Wed, 28 Oct 2020 02:17:24 UTC (2,445 KB)
[v2] Fri, 6 Nov 2020 04:16:36 UTC (2,886 KB)

Computer Science > Machine Learning

Title:Scaling Laws for Autoregressive Generative Modeling

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Scaling Laws for Autoregressive Generative Modeling

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators