-
Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication
Authors:
Mohammadreza Esmali Nojehdeh,
Hossein Mokhtarnia,
Julian Pavon Rivera,
Narcis Rodas Quiroga,
Roger Figueras Bagué,
Enrico Reggiani,
Miquel Moreto,
Osman Unsal,
Adrian Cristal,
Eduard Ayguade
Abstract:
This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions…
▽ More
This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V SIMD-based architectures. In addition to increasing throughput, CAMP's architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17$\times$ and 23$\times$ performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture using Synopsys tools -- targeting ARM TSMC 7nm for A64FX and GlobalFoundries 22nm for the RISC-V SoC -- add only 1\% and 4\% area overhead, respectively, compared to the baseline designs.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Flex-SFU: Accelerating DNN Activation Functions by Non-Uniform Piecewise Approximation
Authors:
Enrico Reggiani,
Renzo Andri,
Lukas Cavigelli
Abstract:
Modern DNN workloads increasingly rely on activation functions consisting of computationally complex operations. This poses a challenge to current accelerators optimized for convolutions and matrix-matrix multiplications. This work presents Flex-SFU, a lightweight hardware accelerator for activation functions implementing non-uniform piecewise interpolation supporting multiple data formats. Non-Un…
▽ More
Modern DNN workloads increasingly rely on activation functions consisting of computationally complex operations. This poses a challenge to current accelerators optimized for convolutions and matrix-matrix multiplications. This work presents Flex-SFU, a lightweight hardware accelerator for activation functions implementing non-uniform piecewise interpolation supporting multiple data formats. Non-Uniform segments and floating-point numbers are enabled by implementing a binary-tree comparison within the address decoding unit. An SGD-based optimization algorithm with heuristics is proposed to find the interpolation function reducing the mean squared error. Thanks to non-uniform interpolation and floating-point support, Flex-SFU achieves on average 22.3x better mean squared error compared to previous piecewise linear interpolation approaches. The evaluation with more than 700 computer vision and natural language processing models shows that Flex-SFU can, on average, improve the end-to-end performance of state-of-the-art AI hardware accelerators by 35.7%, achieving up to 3.3x speedup with negligible impact in the models' accuracy when using 32 segments, and only introducing an area and power overhead of 5.9% and 0.8% relative to the baseline vector processing unit.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
Adaptable Register File Organization for Vector Processors
Authors:
Cristóbal Ramírez Lazo,
Enrico Reggiani,
Carlos Rojas Morales,
Roger Figueras Bagué,
Luis Alfonso Villa Vargas,
Marco Antonio Ramírez Salinas,
Mateo Valero Cortés,
Osman Sabri Unsal,
Adrián Cristal
Abstract:
Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL). Unfortunately, both approaches have drawbacks. On the one han…
▽ More
Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL). Unfortunately, both approaches have drawbacks. On the one hand, short vector length VP designs struggle to provide high efficiency for applications featuring long vectors with high Data Level Parallelism (DLP). On the other hand, long vector VP designs waste resources and underutilize the Vector Register File (VRF) when executing low DLP applications with short vector lengths. Therefore, those long vector VP implementations are limited to a specialized subset of applications, where relatively high DLP must be present to achieve excellent performance with high efficiency. To overcome these limitations, we propose an Adaptable Vector Architecture (AVA) that leads to having the best of both worlds. AVA is designed for short vectors (MVL=16 elements) and is thus area and energy-efficient. However, AVA has the functionality to reconfigure the MVL, thereby allowing to exploit the benefits of having a longer vector (up to 128 elements) microarchitecture when abundant DLP is present. We model AVA on the gem5 simulator and evaluate the performance with six applications taken from the RiVEC Benchmark Suite. To obtain area and power consumption metrics, we model AVA on McPAT for 22nm technology. Our results show that by reconfiguring our small VRF (8KB) plus our novel issue queue scheme, AVA yields a 2X speedup over the default configuration for short vectors. Additionally, AVA shows competitive performance when compared to a long vector VP, while saving 50% of area.
△ Less
Submitted 29 May, 2022; v1 submitted 9 November, 2021;
originally announced November 2021.