An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Dehghankar, Mohsen; Erfanian, Mahdi; Asudeh, Abolfazl

Computer Science > Machine Learning

arXiv:2411.06360 (cs)

[Submitted on 10 Nov 2024 (v1), last revised 2 May 2025 (this version, v3)]

Title:An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Authors:Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh

View PDF HTML (experimental)

Abstract:Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

Comments:	Accepted at ICML 2025
Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2411.06360 [cs.LG]
	(or arXiv:2411.06360v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.06360

Submission history

From: Mohsen Dehghankar [view email]
[v1] Sun, 10 Nov 2024 04:56:14 UTC (1,457 KB)
[v2] Sun, 29 Dec 2024 18:43:04 UTC (1,464 KB)
[v3] Fri, 2 May 2025 16:58:10 UTC (1,085 KB)

Computer Science > Machine Learning

Title:An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators