Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia; Deng; Park, Jongsoo; Tang, Ping Tak Peter; Liu, Haixin; Jie; Yang; Yuen, Hector; Huang, Jianyu; Khudia, Daya; Wei, Xiaohan; Wen, Ellie; Choudhary, Dhruv; Krishnamoorthi, Raghuraman; Wu, Carole-Jean; Nadathur, Satish; Kim, Changkyu; Naumov, Maxim; Naghshineh, Sam; Smelyanskiy, Mikhail

Computer Science > Machine Learning

arXiv:2105.12676 (cs)

[Submitted on 26 May 2021]

Title:Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Authors:Zhaoxia (Summer)Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu, Jie (Amy)Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy

View PDF

Abstract:Tremendous success of machine learning (ML) and the unabated growth in ML model complexity motivated many ML-specific designs in both CPU and accelerator architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Impressive compute throughputs are indeed often exhibited by these architectures on benchmark ML models. Nevertheless, production models such as recommendation systems important to Facebook's personalization services are demanding and complex: These systems must serve billions of users per month responsively with low latency while maintaining high prediction accuracy, notwithstanding computations with many tens of billions parameters per inference. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan during which topic trends and users' interests inevitably evolve. Practicing these low-precision technologies helped us save datacenter capacities while deploying models with up to 5X complexity that would otherwise not be deployed on traditional general-purpose CPUs. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering and advance the state of the art of ML in industry.

Subjects:	Machine Learning (cs.LG); Hardware Architecture (cs.AR); Information Retrieval (cs.IR); Performance (cs.PF); Numerical Analysis (math.NA)
Cite as:	arXiv:2105.12676 [cs.LG]
	(or arXiv:2105.12676v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2105.12676

Submission history

From: Zhaoxia Deng [view email]
[v1] Wed, 26 May 2021 16:42:33 UTC (1,570 KB)

Computer Science > Machine Learning

Title:Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators