Search | arXiv e-print repository

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Authors: Shivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra

Abstract: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware c… ▽ More Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers. △ Less

Submitted 5 July, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Accepted in FPL (International Conference on Field-Programmable Logic and Applications) 2024 conference. Revised with updated results

arXiv:2102.03420 [pdf]

doi 10.1109/MC.2020.3029975

Understanding and Fixing Complex Faults in Embedded Cyberphysical Systems

Authors: Alexander Weiss, Smitha Gautham, Athira Varma Jayakumar, Carl Elks, D. Richard Kuhn, Raghu N. Kacker, Thomas B. Preusser

Abstract: Understanding fault types can lead to novel approaches to debugging and runtime verification. Dealing with complex faults, particularly in the challenging area of embedded systems, craves for more powerful tools, which are now becoming available to engineers. Understanding fault types can lead to novel approaches to debugging and runtime verification. Dealing with complex faults, particularly in the challenging area of embedded systems, craves for more powerful tools, which are now becoming available to engineers. △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2010.05894 [pdf, other]

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Authors: Wenqi Jiang, Zhenhao He, Shuai Zhang, Thomas B. Preußer, Kai Zeng, Liang Feng, Jiansong Zhang, Tongxuan Liu, Yong Li, Jingren Zhou, Ce Zhang, Gustavo Alonso

Abstract: Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this pape… ▽ More Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this paper, we propose MicroRec, a high-performance inference engine for recommendation systems. MicroRec accelerates recommendation inference by (1) redesigning the data structures involved in the embeddings to reduce the number of lookups needed and (2) taking advantage of the availability of High-Bandwidth Memory (HBM) in FPGA accelerators to tackle the latency by enabling parallel lookups. We have implemented the resulting design on an FPGA board including the embedding lookup step as well as the complete inference process. Compared to the optimized CPU baseline (16 vCPU, AVX2-enabled), MicroRec achieves 13.8~14.7x speedup on embedding lookup alone and 2.5$~5.4x speedup for the entire recommendation inference in terms of throughput. As for latency, CPU-based engines needs milliseconds for inferring a recommendation while MicroRec only takes microseconds, a significant advantage in real-time recommendation systems. △ Less

Submitted 19 February, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

Comments: Accepted by MLSys'21 (the 4th Conference on Machine Learning and Systems)

arXiv:2005.13332 [pdf, other]

HyperLogLog Sketch Acceleration on FPGA

Authors: Amit Kulkarni, Monica Chiosa, Thomas B. Preußer, Kaan Kara, David Sidler, Gustavo Alonso

Abstract: Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many dis… ▽ More Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many distinct data items there are in a data set). Although it does not count every data item (to reduce memory consumption), it provides probabilistic guarantees on the result, and it is, thus, often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than an optimized HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads. △ Less

Submitted 20 October, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

Comments: This paper was accepted as a full paper to FPL 2020. The latest/full version of this paper is available: https://ieeexplore.ieee.org/document/9221525

arXiv:2004.11080 [pdf, ps, other]

Using DSP Slices as Content-Addressable Update Queues

Authors: Thomas B. Preußer, Monica Chiosa, Alexander Weiss, Gustavo Alonso

Abstract: Content-Addressable Memory (CAM) is a powerful abstraction for building memory caches, routing tables and hazard detection logic. Without a native CAM structure available on FPGA devices, their functionality must be emulated using the structural primitives at hand. Such an emulation causes significant overhead in the consumption of the underlying resources, typically general-purpose fabric and on-… ▽ More Content-Addressable Memory (CAM) is a powerful abstraction for building memory caches, routing tables and hazard detection logic. Without a native CAM structure available on FPGA devices, their functionality must be emulated using the structural primitives at hand. Such an emulation causes significant overhead in the consumption of the underlying resources, typically general-purpose fabric and on-chip block RAM (BRAM). This often motivates mitigating trade-offs, such as the reduction of the associativity of memory caches. This paper describes a technique to implement the hazard resolution in a memory update queue that hides the off-chip memory readout latency of read-modify-write cycles while guaranteeing the delivery of the full memory bandwidth. The innovative use of DSP slices allows them to assume and combine the functions of (a) the tag and data storage, (b) the tag matching, and (c) the data update in this key-value storage scenario. The proposed approach provides designers with extra flexibility by adding this resource type as another option to implement CAM. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: Submitted to FPL 2020

arXiv:1901.00370 [pdf, other]

doi 10.1145/3337929

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Authors: Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preusser, Magnus Sjalander

Abstract: Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offerin… ▽ More Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes 6-LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC. △ Less

Submitted 11 June, 2019; v1 submitted 2 January, 2019; originally announced January 2019.

Comments: Invited paper at ACM TRETS as extension of FPL'18 paper arXiv:1806.08862

arXiv:1807.03123 [pdf, other]

Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic

Authors: Michaela Blott, Thomas B. Preusser, Nicholas Fraser, Giulio Gambardella, Kenneth OBrien, Yaman Umuroglu, Miriam Leeser

Abstract: Convolutional Neural Networks have dramatically improved in recent years, surpassing human accuracy on certain problems and performance exceeding that of traditional computer vision algorithms. While the compute pattern in itself is relatively simple, significant compute and memory challenges remain as CNNs may contain millions of floating-point parameters and require billions of floating-point op… ▽ More Convolutional Neural Networks have dramatically improved in recent years, surpassing human accuracy on certain problems and performance exceeding that of traditional computer vision algorithms. While the compute pattern in itself is relatively simple, significant compute and memory challenges remain as CNNs may contain millions of floating-point parameters and require billions of floating-point operations to process a single image. These computational requirements, combined with storage footprints that exceed typical cache sizes, pose a significant performance and power challenge for modern compute architectures. One of the promising opportunities to scale performance and power efficiency is leveraging reduced precision representations for all activations and weights as this allows to scale compute capabilities, reduce weight and feature map buffering requirements as well as energy consumption. While a small reduction in accuracy is encountered, these Quantized Neural Networks have been shown to achieve state-of-the-art accuracy on standard benchmark datasets, such as MNIST, CIFAR-10, SVHN and even ImageNet, and thus provide highly attractive design trade-offs. Current research has focused mainly on the implementation of extreme variants with full binarization of weights and or activations, as well typically smaller input images. Within this paper, we investigate the scalability of dataflow architectures with respect to supporting various precisions for both weights and activations, larger image dimensions, and increasing numbers of feature map channels. Key contributions are a formalized approach to understanding the scalability of the existing hardware architecture with cost models and a performance prediction as a function of the target device size. We provide validating experimental results for an ImageNet classification on a server-class platform, namely the AWS F1 node. △ Less

Submitted 26 June, 2018; originally announced July 2018.

arXiv:1806.08095 [pdf, ps, other]

doi 10.23919/FPL.2017.8056834

Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs

Authors: Thomas B. Preußer

Abstract: Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the computation of the dot product or the implementation of FIR filters. Compressor implementations have been constantly evolving for greater efficiency both in gene… ▽ More Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the computation of the dot product or the implementation of FIR filters. Compressor implementations have been constantly evolving for greater efficiency both in general and in the context of concrete applications or specific implementation technologies. This paper is building on this history and describes a generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool. It contributes FPGA-oriented metrics for the evaluation of elementary parallel bit counters, a systematic analysis and partial decomposition of previously proposed counters and a fully implemented construction heuristic with a flexible compression target matching the device capabilities. The generic implementation is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations. △ Less

Submitted 21 June, 2018; originally announced June 2018.

arXiv:1806.08085 [pdf, other]

doi 10.23919/DATE.2018.8342121

Inference of Quantized Neural Networks on Heterogeneous All-Programmable Devices

Authors: Thomas B. Preußer, Giulio Gambardella, Nicholas Fraser, Michaela Blott

Abstract: Neural networks have established as a generic and powerful means to approach challenging problems such as image classification, object detection or decision making. Their successful employment foots on an enormous demand of compute. The quantization of network parameters and the processed data has proven a valuable measure to reduce the challenges of network inference so effectively that the feasi… ▽ More Neural networks have established as a generic and powerful means to approach challenging problems such as image classification, object detection or decision making. Their successful employment foots on an enormous demand of compute. The quantization of network parameters and the processed data has proven a valuable measure to reduce the challenges of network inference so effectively that the feasible scope of applications is expanded even into the embedded domain. This paper describes the making of a real-time object detection in a live video stream processed on an embedded all-programmable device. The presented case illustrates how the required processing is tamed and parallelized across both the CPU cores and the programmable logic and how the most suitable resources and powerful extensions, such as NEON vectorization, are leveraged for the individual processing steps. The crafted result is an extended Darknet framework implementing a fully integrated, end-to-end solution from video capture over object annotation to video output applying neural network inference at different quantization levels running at 16~frames per second on an embedded Zynq UltraScale+ (XCZU3EG) platform. △ Less

Submitted 21 June, 2018; originally announced June 2018.

arXiv:1801.02075 [pdf, ps, other]

QBM - Mapping User-Specified Functions to Programmable Logic through a QBF Satisfiability Problem

Authors: Thomas B. Preußer

Abstract: This is a brief overview on the background behind the test set formulas generated by the QBM tool. After establishing its application context, its formal approach to the generation of QBF formulas and the concrete test set formulas are described. Finally, some related work will be credited and the source to obtain the open-source tool will be identified. This is a brief overview on the background behind the test set formulas generated by the QBM tool. After establishing its application context, its formal approach to the generation of QBF formulas and the concrete test set formulas are described. Finally, some related work will be credited and the source to obtain the open-source tool will be identified. △ Less

Submitted 6 January, 2018; originally announced January 2018.

Comments: Instance in Prenex CNF Track of QBFEVAL'17 competition: http://www.qbflib.org/family_detail.php?idFamily=775

Showing 1–10 of 10 results for author: Preußer, T B