Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication
Authors:
Mohammadreza Esmali Nojehdeh,
Hossein Mokhtarnia,
Julian Pavon Rivera,
Narcis Rodas Quiroga,
Roger Figueras Bagué,
Enrico Reggiani,
Miquel Moreto,
Osman Unsal,
Adrian Cristal,
Eduard Ayguade
Abstract:
This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions…
▽ More
This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V SIMD-based architectures. In addition to increasing throughput, CAMP's architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17$\times$ and 23$\times$ performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture using Synopsys tools -- targeting ARM TSMC 7nm for A64FX and GlobalFoundries 22nm for the RISC-V SoC -- add only 1\% and 4\% area overhead, respectively, compared to the baseline designs.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
Efficient Hardware Realizations of Feedforward Artificial Neural Networks
Authors:
Mohammadreza Esmali Nojehdeh,
Sajjad Parvin,
Mustafa Altun
Abstract:
This article presents design techniques proposed for efficient hardware implementation of feedforward artificial neural networks (ANNs) under parallel and time-multiplexed architectures. To reduce their design complexity, after the weights of ANN are determined in a training phase, we introduce a technique to find the minimum quantization value used to convert the floating-point weight values to i…
▽ More
This article presents design techniques proposed for efficient hardware implementation of feedforward artificial neural networks (ANNs) under parallel and time-multiplexed architectures. To reduce their design complexity, after the weights of ANN are determined in a training phase, we introduce a technique to find the minimum quantization value used to convert the floating-point weight values to integers. For each design architecture, we also propose an algorithm that tunes the integer weights to reduce the hardware complexity avoiding a loss in the hardware accuracy. Furthermore, the multiplications of constant weights by input variables are implemented under the shift-adds architecture using the fewest number of addition/subtraction operations found by prominent previously proposed algorithms. Finally, we introduce a computer-aided design (CAD) tool, called SIMURG, that can describe an ANN design in hardware automatically based on the ANN structure and the solutions of proposed design techniques and algorithms. Experimental results indicate that the tuning techniques can significantly reduce the ANN hardware complexity under a design architecture and the multiplierless design of ANN can lead to a significant reduction in area and energy consumption, increasing the latency slightly.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.