Search | arXiv e-print repository

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Authors: Keran Zheng, Yinting Huang, Zhewen Yu, Christos-Savvas Bouganis

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant… ▽ More Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This paper addresses this challenge through a software-hardware co-design framework, ITERA-LLM, which integrates sub-8-bit quantization with SVD-based iterative low-rank tensor decomposition for error compensation, leading to higher compression ratios and reduced computational complexity. The proposed approach is complemented by a hardware-aware Design Space Exploration (DSE) process that optimizes accuracy, latency, and resource utilization, tailoring the configuration to the specific requirements of the targeted LLM. Our results show that ITERA-LLM achieves linear layer latency reduction of up to 41.1%, compared to quantization-only baseline approach while maintaining similar model accuracy. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2502.04923 [pdf, other]

Cached Multi-Lora Composition for Multi-Concept Image Generation

Authors: Xiandong Zou, Mingzhu Shen, Christos-Savvas Bouganis, Yiren Zhao

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper,… ▽ More Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of $2.19\%$ in CLIPScore, and $11.25\%$ in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: The Thirteenth International Conference on Learning Representations (ICLR 2025)

arXiv:2406.03088 [pdf, other]

HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

Authors: Zhewen Yu, Sudarshan Sreeram, Krish Agrawal, Junyi Wu, Alexander Montgomerie-Corcoran, Cheng Zhang, Jianyi Cheng, Christos-Savvas Bouganis, Yiren Zhao

Abstract: Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and i… ▽ More Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3$\times$ to 4.2$\times$ compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: \url{https://github.com/Yu-Zhewen/HASS} △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: accepted to FPL2024

arXiv:2406.01125 [pdf, other]

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Authors: Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, Tao Chen

Abstract: Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of… ▽ More Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework $Δ$-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called $Δ$-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-$α$ and DiT-XL demonstrate that the $Δ$-DiT can achieve a $1.6\times$ speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging $1.12\times$ acceleration, our method significantly outperforms existing methods. Our code will be publicly available. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 12 pages, 6 figures, 6 tables

arXiv:2403.18921 [pdf, other]

SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction

Authors: Petros Toupas, Zhewen Yu, Christos-Savvas Bouganis, Dimitrios Tzovaras

Abstract: Convolutional Neural Networks (CNNs) have demonstrated their effectiveness in numerous vision tasks. However, their high processing requirements necessitate efficient hardware acceleration to meet the application's performance targets. In the space of FPGAs, streaming-based dataflow architectures are often adopted by users, as significant performance gains can be achieved through layer-wise pipeli… ▽ More Convolutional Neural Networks (CNNs) have demonstrated their effectiveness in numerous vision tasks. However, their high processing requirements necessitate efficient hardware acceleration to meet the application's performance targets. In the space of FPGAs, streaming-based dataflow architectures are often adopted by users, as significant performance gains can be achieved through layer-wise pipelining and reduced off-chip memory access by retaining data on-chip. However, modern topologies, such as the UNet, YOLO, and X3D models, utilise long skip connections, requiring significant on-chip storage and thus limiting the performance achieved by such system architectures. The paper addresses the above limitation by introducing weight and activation eviction mechanisms to off-chip memory along the computational pipeline, taking into account the available compute and memory resources. The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer. This enables the mapping of such modern CNNs to devices with limited on-chip memory, under the streaming architecture design approach. SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks, achieving up to 10.65 X throughput improvement compared to previous works. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 12 pages, 8 figures, 5 tables

arXiv:2403.14715 [pdf, other]

Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

Authors: Guoxuan Xia, Olivier Laurent, Gianni Franchi, Christos-Savvas Bouganis

Abstract: Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to rejec… ▽ More Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by suppressing the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective. △ Less

Submitted 20 February, 2025; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Published as a conference paper at ICLR 2025

arXiv:2311.04764 [pdf, other]

AutoWS: Automate Weights Streaming in Layer-wise Pipelined DNN Accelerators

Authors: Zhewen Yu, Christos-Savvas Bouganis

Abstract: With the great success of Deep Neural Networks (DNN), the design of efficient hardware accelerators has triggered wide interest in the research community. Existing research explores two architectural strategies: sequential layer execution and layer-wise pipelining. While the former supports a wider range of models, the latter is favoured for its enhanced customization and efficiency. A challenge f… ▽ More With the great success of Deep Neural Networks (DNN), the design of efficient hardware accelerators has triggered wide interest in the research community. Existing research explores two architectural strategies: sequential layer execution and layer-wise pipelining. While the former supports a wider range of models, the latter is favoured for its enhanced customization and efficiency. A challenge for the layer-wise pipelining architecture is its substantial demand for the on-chip memory for weights storage, impeding the deployment of large-scale networks on resource-constrained devices. This paper introduces AutoWS, a pioneering memory management methodology that exploits both on-chip and off-chip memory to optimize weight storage within a layer-wise pipelining architecture, taking advantage of its static schedule. Through a comprehensive investigation on both the hardware design and the Design Space Exploration, our methodology is fully automated and enables the deployment of large-scale DNN models on resource-constrained devices, which was not possible in existing works that target layer-wise pipelining architectures. AutoWS is open-source: https://github.com/Yu-Zhewen/AutoWS △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: accepted by DATE2024

arXiv:2309.01587 [pdf, other]

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

Authors: Alexander Montgomerie-Corcoran, Petros Toupas, Zhewen Yu, Christos-Savvas Bouganis

Abstract: AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered th… ▽ More AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2307.15517 [pdf, other]

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

Authors: Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao

Abstract: Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require la… ▽ More Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large dynamic ranges and high memory density. In this paper, we propose a compiler named MASE for exploring mixed-precision MX formats on dataflow hardware accelerators for LLM inference. Our main contributions are twofold. First, we propose a novel orchestration abstraction to explore both software and hardware optimizations with new data formats. Second, MASE achieves LLM inference at an average precision of 4-bits, with minimal to no accuracy degradation. To our knowledge, MASE represents the first effort to harness fine-grain multi-precision MX formats in the design of LLM hardware accelerators. Over a range of LLMs and datasets, MASE achieves an average improvement of 24% in $Δ$ accuracy with an overhead of only 3% in energy efficiency compared to designs using 8-bit fixed-point numbers. △ Less

Submitted 19 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2307.07821 [pdf, other]

PASS: Exploiting Post-Activation Sparsity in Streaming Architectures for CNN Acceleration

Authors: Alexander Montgomerie-Corcoran, Zhewen Yu, Jianyi Cheng, Christos-Savvas Bouganis

Abstract: With the ever-growing popularity of Artificial Intelligence, there is an increasing demand for more performant and efficient underlying hardware. Convolutional Neural Networks (CNN) are a workload of particular importance, which achieve high accuracy in computer vision applications. Inside CNNs, a significant number of the post-activation values are zero, resulting in many redundant computations.… ▽ More With the ever-growing popularity of Artificial Intelligence, there is an increasing demand for more performant and efficient underlying hardware. Convolutional Neural Networks (CNN) are a workload of particular importance, which achieve high accuracy in computer vision applications. Inside CNNs, a significant number of the post-activation values are zero, resulting in many redundant computations. Recent works have explored this post-activation sparsity on instruction-based CNN accelerators but not on streaming CNN accelerators, despite the fact that streaming architectures are considered the leading design methodology in terms of performance. In this paper, we highlight the challenges associated with exploiting post-activation sparsity for performance gains in streaming CNN accelerators, and demonstrate our approach to address them. Using a set of modern CNN benchmarks, our streaming sparse accelerators achieve 1.41x to 1.93x efficiency (GOP/s/DSP) compared to state-of-the-art instruction-based sparse accelerators. △ Less

Submitted 15 July, 2023; originally announced July 2023.

arXiv:2306.05021 [pdf, other]

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition

Authors: Zhewen Yu, Christos-Savvas Bouganis

Abstract: Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow… ▽ More Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD △ Less

Submitted 22 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: accepted by FPL2023

arXiv:2305.19896 [pdf, other]

doi 10.1109/FPL60245.2023.00020

fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs

Authors: Petros Toupas, Christos-Savvas Bouganis, Dimitrios Tzovaras

Abstract: Surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval are just few of the many applications in which 3D Convolutional Neural Networks are exploited. However, their extensive use is restricted by their high computational and memory requirements, especially when integrated into systems with limited resources. This study proposes a toolflow that optimises the mappin… ▽ More Surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval are just few of the many applications in which 3D Convolutional Neural Networks are exploited. However, their extensive use is restricted by their high computational and memory requirements, especially when integrated into systems with limited resources. This study proposes a toolflow that optimises the mapping of 3D CNN models for Human Action Recognition onto FPGA devices, taking into account FPGA resources and off-chip memory characteristics. The proposed system employs Synchronous Dataflow (SDF) graphs to model the designs and introduces transformations to expand and explore the design space, resulting in high-throughput designs. A variety of 3D CNN models were evaluated using the proposed toolflow on multiple FPGA devices, demonstrating its potential to deliver competitive performance compared to earlier hand-tuned and model-specific designs. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: 7 pages, 3 figures, 4 tables. arXiv admin note: substantial text overlap with arXiv:2305.18479

arXiv:2305.18479 [pdf, other]

doi 10.1109/ASAP57973.2023.00030

FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition

Authors: Petros Toupas, Christos-Savvas Bouganis, Dimitrios Tzovaras

Abstract: 3D Convolutional Neural Networks are gaining increasing attention from researchers and practitioners and have found applications in many domains, such as surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval. However, their widespread adoption is hindered by their high computational and memory requirements, especially when resource-constrained systems are targete… ▽ More 3D Convolutional Neural Networks are gaining increasing attention from researchers and practitioners and have found applications in many domains, such as surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval. However, their widespread adoption is hindered by their high computational and memory requirements, especially when resource-constrained systems are targeted. This paper addresses the problem of mapping X3D, a state-of-the-art model in Human Action Recognition that achieves accuracy of 95.5\% in the UCF101 benchmark, onto any FPGA device. The proposed toolflow generates an optimised stream-based hardware system, taking into account the available resources and off-chip memory characteristics of the FPGA device. The generated designs push further the current performance-accuracy pareto front, and enable for the first time the targeting of such complex model architectures for the Human Action Recognition task. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 8 pages, 6 figures, 2 tables

arXiv:2304.08400 [pdf, other]

ATHEENA: A Toolflow for Hardware Early-Exit Network Automation

Authors: Benjamin Biggs, Christos-Savvas Bouganis, George A. Constantinides

Abstract: The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we… ▽ More The continued need for improvements in accuracy, throughput, and efficiency of Deep Neural Networks has resulted in a multitude of methods that make the most of custom architectures on FPGAs. These include the creation of hand-crafted networks and the use of quantization and pruning to reduce extraneous network parameters. However, with the potential of static solutions already well exploited, we propose to shift the focus to using the varying difficulty of individual data samples to further improve efficiency and reduce average compute for classification. Input-dependent computation allows for the network to make runtime decisions to finish a task early if the result meets a confidence threshold. Early-Exit network architectures have become an increasingly popular way to implement such behaviour in software. We create: A Toolflow for Hardware Early-Exit Network Automation (ATHEENA), an automated FPGA toolflow that leverages the probability of samples exiting early from such networks to scale the resources allocated to different sections of the network. The toolflow uses the data-flow model of fpgaConvNet, extended to support Early-Exit networks as well as Design Space Exploration to optimize the generated streaming architecture hardware with the goal of increasing throughput/reducing area while maintaining accuracy. Experimental results on three different networks demonstrate a throughput increase of $2.00\times$ to $2.78\times$ compared to an optimized baseline network implementation with no early exits. Additionally, the toolflow can achieve a throughput matching the same baseline with as low as $46\%$ of the resources the baseline requires. △ Less

Submitted 14 April, 2025; v1 submitted 17 April, 2023; originally announced April 2023.

Journal ref: Proceedings of IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2023, 121-132

arXiv:2303.17218 [pdf, other]

doi 10.1109/FCCM57271.2023.00024

HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices

Authors: Petros Toupas, Alexander Montgomerie-Corcoran, Christos-Savvas Bouganis, Dimitrios Tzovaras

Abstract: For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks have proven to be highly effective, achieving state-of-the-art results. This study introduces a novel streaming architecture based toolflow for mapping such models onto FPGAs considering the model's inherent characteristics and the features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX… ▽ More For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks have proven to be highly effective, achieving state-of-the-art results. This study introduces a novel streaming architecture based toolflow for mapping such models onto FPGAs considering the model's inherent characteristics and the features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics, generating a design that minimizes the latency of the computation. The toolflow is comprised of a number of parts, including i) a 3D CNN parser, ii) a performance and resource model, iii) a scheduling algorithm for executing 3D models on the generated hardware, iv) a resource-aware optimization engine tailored for 3D models, v) an automated mapping to synthesizable code for FPGAs. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs. Furthermore, the toolflow has produced high-performing results for 3D CNN models that have not been mapped to FPGAs before, demonstrating the potential of FPGA-based systems in this space. Overall, HARFLOW3D has demonstrated its ability to deliver competitive latency compared to a range of state-of-the-art hand-tuned approaches being able to achieve up to 5$\times$ better performance compared to some of the existing works. △ Less

Submitted 29 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: 11 pages, 8 figures, 6 tables

arXiv:2303.08010 [pdf, other]

Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models

Authors: Guoxuan Xia, Christos-Savvas Bouganis

Abstract: Deep Ensembles are a simple, reliable, and effective method of improving both the predictive performance and uncertainty estimates of deep learning approaches. However, they are widely criticised as being computationally expensive, due to the need to deploy multiple independent models. Recent work has challenged this view, showing that for predictive accuracy, ensembles can be more computationally… ▽ More Deep Ensembles are a simple, reliable, and effective method of improving both the predictive performance and uncertainty estimates of deep learning approaches. However, they are widely criticised as being computationally expensive, due to the need to deploy multiple independent models. Recent work has challenged this view, showing that for predictive accuracy, ensembles can be more computationally efficient (at inference) than scaling single models within an architecture family. This is achieved by cascading ensemble members via an early-exit approach. In this work, we investigate extending these efficiency gains to tasks related to uncertainty estimation. As many such tasks, e.g. selective classification, are binary classification, our key novel insight is to only pass samples within a window close to the binary decision boundary to later cascade stages. Experiments on ImageNet-scale data across a number of network architectures and uncertainty tasks show that the proposed window-based early-exit approach is able to achieve a superior uncertainty-computation trade-off compared to scaling single models. For example, a cascaded EfficientNet-B2 ensemble is able to achieve similar coverage at 5% risk as a single EfficientNet-B4 with <30% the number of MACs. We also find that cascades/ensembles give more reliable improvements on OOD data vs scaling models up. Code for this work is available at: https://github.com/Guoxoug/window-early-exit. △ Less

Submitted 9 October, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted to ICCV 2023 (camera-ready version, 9 pages)

arXiv:2208.10404 [pdf, other]

SVD-NAS: Coupling Low-Rank Approximation and Neural Architecture Search

Authors: Zhewen Yu, Christos-Savvas Bouganis

Abstract: The task of compressing pre-trained Deep Neural Networks has attracted wide interest of the research community due to its great benefits in freeing practitioners from data access requirements. In this domain, low-rank approximation is a promising method, but existing solutions considered a restricted number of design choices and failed to efficiently explore the design space, which lead to severe… ▽ More The task of compressing pre-trained Deep Neural Networks has attracted wide interest of the research community due to its great benefits in freeing practitioners from data access requirements. In this domain, low-rank approximation is a promising method, but existing solutions considered a restricted number of design choices and failed to efficiently explore the design space, which lead to severe accuracy degradation and limited compression ratio achieved. To address the above limitations, this work proposes the SVD-NAS framework that couples the domains of low-rank approximation and neural architecture search. SVD-NAS generalises and expands the design choices of previous works by introducing the Low-Rank architecture space, LR-space, which is a more fine-grained design space of low-rank approximation. Afterwards, this work proposes a gradient-descent-based search for efficiently traversing the LR-space. This finer and more thorough exploration of the possible design choices results in improved accuracy as well as reduction in parameters, FLOPS, and latency of a CNN model. Results demonstrate that the SVD-NAS achieves 2.06-12.85pp higher accuracy on ImageNet than state-of-the-art methods under the data-limited problem setting. SVD-NAS is open-sourced at https://github.com/Yu-Zhewen/SVD-NAS. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: Accepted at WACV 2023

arXiv:2207.07517 [pdf, other]

On the Usefulness of Deep Ensemble Diversity for Out-of-Distribution Detection

Authors: Guoxuan Xia, Christos-Savvas Bouganis

Abstract: The ability to detect Out-of-Distribution (OOD) data is important in safety-critical applications of deep learning. The aim is to separate In-Distribution (ID) data drawn from the training distribution from OOD data using a measure of uncertainty extracted from a deep neural network. Deep Ensembles are a well-established method of improving the quality of uncertainty estimates produced by deep neu… ▽ More The ability to detect Out-of-Distribution (OOD) data is important in safety-critical applications of deep learning. The aim is to separate In-Distribution (ID) data drawn from the training distribution from OOD data using a measure of uncertainty extracted from a deep neural network. Deep Ensembles are a well-established method of improving the quality of uncertainty estimates produced by deep neural networks, and have been shown to have superior OOD detection performance compared to single models. An existing intuition in the literature is that the diversity of Deep Ensemble predictions indicates distributional shift, and so measures of diversity such as Mutual Information (MI) should be used for OOD detection. We show experimentally that this intuition is not valid on ImageNet-scale OOD detection -- using MI leads to 30-40% worse %FPR@95 compared to single-model entropy on some OOD datasets. We suggest an alternative explanation for Deep Ensembles' better OOD detection performance -- OOD detection is binary classification and we are ensembling diverse classifiers. As such we show that practically, even better OOD detection performance can be achieved for Deep Ensembles by averaging task-specific detection scores such as Energy over the ensemble. △ Less

Submitted 20 September, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: Workshop on Uncertainty Quantification for Computer Vision, ECCV 2022

arXiv:2207.07506 [pdf, other]

Augmenting Softmax Information for Selective Classification with Out-of-Distribution Data

Authors: Guoxuan Xia, Christos-Savvas Bouganis

Abstract: Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD d… ▽ More Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD data (SCOD). That is to say, the motivation for detecting OOD samples is to reject them so their impact on the quality of predictions is reduced. We show under this task specification, that existing post-hoc methods perform quite differently compared to when evaluated only on OOD detection. This is because it is no longer an issue to conflate in-distribution (ID) data with OOD data if the ID data is going to be misclassified. However, the conflation within ID data of correct and incorrect predictions becomes undesirable. We also propose a novel method for SCOD, Softmax Information Retaining Combination (SIRC), that augments softmax-based confidence scores with feature-agnostic information such that their ability to identify OOD samples is improved without sacrificing separation between correct and incorrect ID predictions. Experiments on a wide variety of ImageNet-scale datasets and convolutional neural network architectures show that SIRC is able to consistently match or outperform the baseline for SCOD, whilst existing OOD detection methods fail to do so. △ Less

Submitted 14 March, 2023; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: ACCV 2022 (Best Paper Award) https://openaccess.thecvf.com/content/ACCV2022/html/Xia_Augmenting_Softmax_Information_for_Selective_Classification_with_Out-of-Distribution_Data_ACCV_2022_paper.html

arXiv:2205.09376 [pdf, other]

Multi-DNN Accelerators for Next-Generation AI Systems

Authors: Stylianos I. Venieris, Christos-Savvas Bouganis, Nicholas D. Lane

Abstract: As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs f… ▽ More As the use of AI-powered applications widens across multiple domains, so do increase the computational demands. Primary driver of AI technology are the deep neural networks (DNNs). When focusing either on cloud-based systems that serve multiple AI queries from different users each with their own DNN model, or on mobile robots and smartphones employing pipelines of various models or parallel DNNs for the concurrent processing of multi-modal data, the next generation of AI systems will have multi-DNN workloads at their core. Large-scale deployment of AI services and integration across mobile and embedded systems require additional breakthroughs in the computer architecture front, with processors that can maintain high performance as the number of DNNs increases while meeting the quality-of-service requirements, giving rise to the topic of multi-DNN accelerator design. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accepted for publication at the IEEE Computer journal, 2022

arXiv:2203.00772 [pdf, ps, other]

Low-Cost On-device Partial Domain Adaptation (LoCO-PDA): Enabling efficient CNN retraining on edge devices

Authors: Aditya Rajagopal, Christos-Savvas Bouganis

Abstract: With the increased deployment of Convolutional Neural Networks (CNNs) on edge devices, the uncertainty of the observed data distribution upon deployment has led researchers to to utilise large and extensive datasets such as ILSVRC'12 to train CNNs. Consequently, it is likely that the observed data distribution upon deployment is a subset of the training data distribution. In such cases, not adapti… ▽ More With the increased deployment of Convolutional Neural Networks (CNNs) on edge devices, the uncertainty of the observed data distribution upon deployment has led researchers to to utilise large and extensive datasets such as ILSVRC'12 to train CNNs. Consequently, it is likely that the observed data distribution upon deployment is a subset of the training data distribution. In such cases, not adapting a network to the observed data distribution can cause performance degradation due to negative transfer and alleviating this is the focus of Partial Domain Adaptation (PDA). Current works targeting PDA do not focus on performing the domain adaptation on an edge device, adapting to a changing target distribution or reducing the cost of deploying the adapted network. This work proposes a novel PDA methodology that targets all of these directions and opens avenues for on-device PDA. LoCO-PDA adapts a deployed network to the observed data distribution by enabling it to be retrained on an edge device. Across subsets of the ILSVRC12 dataset, LoCO-PDA improves classification accuracy by 3.04pp on average while achieving up to 15.1x reduction in retraining memory consumption and 2.07x improvement in inference latency on the NVIDIA Jetson TX2. The work is open-sourced at \emph{link removed for anonymity}. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2112.00170 [pdf, other]

SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures

Authors: Alexander Montgomerie-Corcoran, Zhewen Yu, Christos-Savvas Bouganis

Abstract: Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performing designs for a diverse set of applications. However, within these toolflows, the problem of finding an optimal mapping is often overlooked, with the expectation that the end user w… ▽ More Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performing designs for a diverse set of applications. However, within these toolflows, the problem of finding an optimal mapping is often overlooked, with the expectation that the end user will tune their generated hardware for their desired platform. This is particularly prominent within Streaming Architecture toolflows, where there is a large design space to explore . In this work, we establish the framework SAMO: a Streaming Architecture Mapping Optimiser. SAMO exploits the structure of CNN models and the common features that exist in Streaming Architectures, and casts the mapping optimisation problem under a unified methodology. Furthermore, SAMO explicitly explores the reconfigurability property of FPGAs, allowing the methodology to overcome mapping limitations imposed by certain toolflows under resource-constrained scenarios, as well as improve on the achievable throughput. Three optimisation methods - Brute-Force, Simulated Annealing and Rule-Based - have been developed in order to generate valid, high performance designs for a range of target platforms and CNN models. Results show that SAMO-optimised designs can achieve 4-20x better performance compared to existing hand-tuned designs. The SAMO framework is open-source: https://github.com/AlexMontgomerie/samo. △ Less

Submitted 9 August, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

arXiv:2108.05580 [pdf, other]

perf4sight: A toolflow to model CNN training performance on Edge GPUs

Authors: Aditya Rajagopal, Christos-Savvas Bouganis

Abstract: The increased memory and processing capabilities of today's edge devices create opportunities for greater edge intelligence. In the domain of vision, the ability to adapt a Convolutional Neural Network's (CNN) structure and parameters to the input data distribution leads to systems with lower memory footprint, latency and power consumption. However, due to the limited compute resources and memory… ▽ More The increased memory and processing capabilities of today's edge devices create opportunities for greater edge intelligence. In the domain of vision, the ability to adapt a Convolutional Neural Network's (CNN) structure and parameters to the input data distribution leads to systems with lower memory footprint, latency and power consumption. However, due to the limited compute resources and memory budget on edge devices, it is necessary for the system to be able to predict the latency and memory footprint of the training process in order to identify favourable training configurations of the network topology and device combination for efficient network adaptation. This work proposes perf4sight, an automated methodology for developing accurate models that predict CNN training memory footprint and latency given a target device and network. This enables rapid identification of network topologies that can be retrained on the edge device with low resource consumption. With PyTorch as the framework and NVIDIA Jetson TX2 as the target device, the developed models predict training memory footprint and latency with 95% and 91% accuracy respectively for a wide range of networks, opening the path towards efficient network adaptation on edge GPUs. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: Accepted into the Workshop on Embedded and Real-World Computer Vision in Autonomous Driving (ERCVAD), ICCV 2021

arXiv:2107.10047 [pdf, other]

Performance landscape of resource-constrained platforms targeting DNNs

Authors: Panagiotis Miliadis, Christos-Savvas Bouganis, Dionisios Pnevmatikatos

Abstract: Over the recent years, a significant number of complex, deep neural networks have been developed for a variety of applications including speech and face recognition, computer vision in the areas of health-care, automatic translation, image classification, etc. Moreover, there is an increasing demand in deploying these networks in resource-constrained edge devices. As the computational demands of t… ▽ More Over the recent years, a significant number of complex, deep neural networks have been developed for a variety of applications including speech and face recognition, computer vision in the areas of health-care, automatic translation, image classification, etc. Moreover, there is an increasing demand in deploying these networks in resource-constrained edge devices. As the computational demands of these models keep increasing, pushing to their limits the targeted devices, the constant development of new hardware systems tailored to those workloads has been observed. Since programmability of these diverse and complex platforms -- compounded by the rapid development of new DNN models -- is a major challenge, platform vendors have developed Machine Learning tailored SDKs to maximize the platform's performance. This work investigates the performance achieved on a number of modern commodity embedded platforms coupled with the vendors' provided software support when state-of-the-art DNN models from image classification, object detection and image segmentation are targeted. The work quantifies the relative latency gains of the particular embedded platforms and provides insights on the relationship between the required minimum batch size for achieving maximum throughput, concluding that modern embedded systems reach their maximum performance even for modest batch sizes when a modern state of the art DNN model is targeted. Overall, the presented results provide a guide for the expected performance for a number of state-of-the-art DNNs on popular embedded platforms across the image classification, detection and segmentation domains. △ Less

Submitted 3 November, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

arXiv:2006.13829 [pdf, other]

Caffe Barista: Brewing Caffe with FPGAs in the Training Loop

Authors: Diederik Adriaan Vink, Aditya Rajagopal, Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: As the complexity of deep learning (DL) models increases, their compute requirements increase accordingly. Deploying a Convolutional Neural Network (CNN) involves two phases: training and inference. With the inference task typically taking place on resource-constrained devices, a lot of research has explored the field of low-power inference on custom hardware accelerators. On the other hand, train… ▽ More As the complexity of deep learning (DL) models increases, their compute requirements increase accordingly. Deploying a Convolutional Neural Network (CNN) involves two phases: training and inference. With the inference task typically taking place on resource-constrained devices, a lot of research has explored the field of low-power inference on custom hardware accelerators. On the other hand, training is both more compute- and memory-intensive and is primarily performed on power-hungry GPUs in large-scale data centres. CNN training on FPGAs is a nascent field of research. This is primarily due to the lack of tools to easily prototype and deploy various hardware and/or algorithmic techniques for power-efficient CNN training. This work presents Barista, an automated toolflow that provides seamless integration of FPGAs into the training of CNNs within the popular deep learning framework Caffe. To the best of our knowledge, this is the only tool that allows for such versatile and rapid deployment of hardware and algorithms for the FPGA-based training of CNNs, providing the necessary infrastructure for further research and development. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: Published as short paper at FPL2020

arXiv:2006.09049 [pdf, other]

Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs

Authors: Aditya Rajagopal, Diederik Adriaan Vink, Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vani… ▽ More Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vanishing gradients. Existing state-of-the-art methods combat this issue by means of a mixed-precision approach utilising two different precision levels, FP32 (32-bit floating-point) and FP16/FP8 (16-/8-bit floating-point), leveraging the hardware support of recent GPU architectures for FP16 operations to obtain performance gains. This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including low-precision fixed-point representations. The novel training strategy, MuPPET, combines the use of multiple number representation regimes together with a precision-switching mechanism that decides at run time the transition point between precision regimes. Overall, the proposed strategy tailors the training process to the hardware-level capabilities of the target hardware architecture and yields improvements in training time and energy efficiency compared to state-of-the-art approaches. Applying MuPPET on the training of AlexNet, ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA Turing GPU, MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$\times$ and an average speedup of 1.58$\times$ across the networks. △ Less

Submitted 16 June, 2020; originally announced June 2020.

Comments: Accepted at the 37th International Conference on Machine Learning (ICML), 2020

arXiv:2006.08554 [pdf, other]

Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge

Authors: Aditya Rajagopal, Christos-Savvas Bouganis

Abstract: In today's world, a vast amount of data is being generated by edge devices that can be used as valuable training data to improve the performance of machine learning algorithms in terms of the achieved accuracy or to reduce the compute requirements of the model. However, due to user data privacy concerns as well as storage and communication bandwidth limitations, this data cannot be moved from the… ▽ More In today's world, a vast amount of data is being generated by edge devices that can be used as valuable training data to improve the performance of machine learning algorithms in terms of the achieved accuracy or to reduce the compute requirements of the model. However, due to user data privacy concerns as well as storage and communication bandwidth limitations, this data cannot be moved from the device to the data centre for further improvement of the model and subsequent deployment. As such there is a need for increased edge intelligence, where the deployed models can be fine-tuned on the edge, leading to improved accuracy and/or reducing the model's workload as well as its memory and power footprint. In the case of Convolutional Neural Networks (CNNs), both the weights of the network as well as its topology can be tuned to adapt to the data that it processes. This paper provides a first step towards enabling CNN finetuning on an edge device based on structured pruning. It explores the performance gains and costs of doing so and presents an extensible open-source framework that allows the deployment of such approaches on a wide range of network architectures and devices. The results show that on average, data-aware pruning with retraining can provide 10.2pp increased accuracy over a wide range of subsets, networks and pruning levels with a maximum improvement of 42.0pp over pruning and retraining in a manner agnostic to the data being processed by the network. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: Accepted for publication at CVPR2020 workshop - Efficient Deep Learning for Computer Vision

arXiv:1905.00689 [pdf, other]

Approximate LSTMs for Time-Constrained Inference: Enabling Fast Reaction in Self-Driving Cars

Authors: Alexandros Kouris, Stylianos I. Venieris, Michail Rizakis, Christos-Savvas Bouganis

Abstract: The need to recognise long-term dependencies in sequential data such as video streams has made Long Short-Term Memory (LSTM) networks a prominent Artificial Intelligence model for many emerging applications. However, the high computational and memory demands of LSTMs introduce challenges in their deployment on latency-critical systems such as self-driving cars which are equipped with limited compu… ▽ More The need to recognise long-term dependencies in sequential data such as video streams has made Long Short-Term Memory (LSTM) networks a prominent Artificial Intelligence model for many emerging applications. However, the high computational and memory demands of LSTMs introduce challenges in their deployment on latency-critical systems such as self-driving cars which are equipped with limited computational resources on-board. In this paper, we introduce a progressive inference computing scheme that combines model pruning and computation restructuring leading to the best possible approximation of the result given the available latency budget of the target application. The proposed methodology enables mission-critical systems to make informed decisions even in early stages of the computation, based on approximate LSTM inference, meeting their specifications on safety and robustness. Our experiments on a state-of-the-art driving model for autonomous vehicle navigation demonstrate that the proposed approach can yield outputs with similar quality of result compared to a faithful LSTM baseline, up to 415x faster (198x on average, 76x geo. mean). △ Less

Submitted 30 October, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Comments: PREPRINT: Accepted for publication at the IEEE Consumer Electronics Magazine (CEM). [Acceptance Date: 28-Oct-2019]

arXiv:1902.04907 [pdf, other]

A Scalable FPGA-based Architecture for Depth Estimation in SLAM

Authors: Konstantinos Boikos, Christos-Savvas Bouganis

Abstract: The current state of the art of Simultaneous Localisation and Mapping, or SLAM, on low power embedded systems is about sparse localisation and mapping with low resolution results in the name of efficiency. Meanwhile, research in this field has provided many advances for information rich processing and semantic understanding, combined with high computational requirements for real-time processing. T… ▽ More The current state of the art of Simultaneous Localisation and Mapping, or SLAM, on low power embedded systems is about sparse localisation and mapping with low resolution results in the name of efficiency. Meanwhile, research in this field has provided many advances for information rich processing and semantic understanding, combined with high computational requirements for real-time processing. This work provides a solution to bridging this gap, in the form of a scalable SLAM-specific architecture for depth estimation for direct semi-dense SLAM. Targeting an off-the-shelf FPGA-SoC this accelerator architecture achieves a rate of more than 60 mapped frames/sec at a resolution of 640x480 achieving performance on par to a highly-optimised parallel implementation on a high-end desktop CPU with an order of magnitude improved power consumption. Furthermore, the developed architecture is combined with our previous work for the task of tracking, to form the first complete accelerator for semi-dense SLAM on FPGAs, establishing the state of the art in the area of embedded low-power systems. △ Less

Submitted 13 February, 2019; originally announced February 2019.

Comments: Accepted for publication in the 15th International Symposium on Applied Reconfigurable Computing (ARC 2019). This is a pre-print. The final version will be published by Springer-Verlag in the Lecture Notes in Computer Science series

arXiv:1807.06789 [pdf, other]

doi 10.23919/DATE.2018.8342149

DroNet: Efficient convolutional neural network detector for real-time UAV applications

Authors: Christos Kyrkou, George Plastiras, Stylianos Venieris, Theocharis Theocharides, Christos-Savvas Bouganis

Abstract: Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitori… ▽ More Unmanned Aerial Vehicles (drones) are emerging as a promising technology for both environmental and infrastructure monitoring, with broad use in a plethora of applications. Many such applications require the use of computer vision algorithms in order to analyse the information captured from an on-board camera. Such applications include detecting vehicles for emergency response and traffic monitoring. This paper therefore, explores the trade-offs involved in the development of a single-shot object detector based on deep convolutional neural networks (CNNs) that can enable UAVs to perform vehicle detection under a resource constrained environment such as in a UAV. The paper presents a holistic approach for designing such systems; the data collection and training stages, the CNN architecture, and the optimizations necessary to efficiently map such a CNN on a lightweight embedded processing platform suitable for deployment on UAVs. Through the analysis we propose a CNN architecture that is capable of detecting vehicles from aerial UAV images and can operate between 5-18 frames-per-second for a variety of platforms with an overall accuracy of ~95%. Overall, the proposed architecture is suitable for UAV applications, utilizing low-power embedded processors that can be deployed on commercial UAVs. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris and C. S. Bouganis, "DroNet: Efficient convolutional neural network detector for real-time UAV applications," 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, 2018, pp. 967-972. Keywords: Convolutional neural networks, Machine learning, autonomous aerial vehicles, computer vision, embedded systems

Journal ref: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)

arXiv:1807.05053 [pdf, other]

CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks

Authors: Alexandros Kouris, Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision… ▽ More This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data. △ Less

Submitted 13 July, 2018; originally announced July 2018.

Comments: Accepted for publication at the 28th International Conference on Field Programmable Logic & Applications (FPL), 2018

arXiv:1806.08616 [pdf, other]

Deploying Deep Neural Networks in the Embedded Space

Authors: Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis

Abstract: Recently, Deep Neural Networks (DNNs) have emerged as the dominant model across various AI applications. In the era of IoT and mobile systems, the efficient deployment of DNNs on embedded platforms is vital to enable the development of intelligent applications. This paper summarises our recent work on the optimised mapping of DNNs on embedded settings. By covering such diverse topics as DNN-to-acc… ▽ More Recently, Deep Neural Networks (DNNs) have emerged as the dominant model across various AI applications. In the era of IoT and mobile systems, the efficient deployment of DNNs on embedded platforms is vital to enable the development of intelligent applications. This paper summarises our recent work on the optimised mapping of DNNs on embedded settings. By covering such diverse topics as DNN-to-accelerator toolflows, high-throughput cascaded classifiers and domain-specific model design, the presented set of works aim to enable the deployment of sophisticated deep learning models on cutting-edge mobile and embedded systems. △ Less

Submitted 22 June, 2018; originally announced June 2018.

Comments: Accepted at MobiSys18: 2nd International Workshop on Embedded and Mobile Deep Learning (EMDL) 2018

arXiv:1805.10174 [pdf, other]

doi 10.1109/FPL.2018.00072

f-CNN$^{\text{x}}$: A Toolflow for Mapping Multi-CNN Applications on FPGAs

Authors: Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwid… ▽ More The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNN$^{\text{x}}$, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNN$^{\text{x}}$ employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNN$^{\text{x}}$'s designs outperform contention-unaware FPGA mappings by up to 50% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems. △ Less

Submitted 7 June, 2021; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: Accepted at the 28th International Conference on Field Programmable Logic & Applications (FPL) 2018

arXiv:1805.08743 [pdf, other]

CascadeCNN: Pushing the performance limits of quantisation

Authors: Alexandros Kouris, Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed… ▽ More This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, to perform high-throughput inference by exploiting the computation time-accuracy trade-off. Without the need for retraining, a two-stage architecture tailored for any given FPGA device is generated, consisting of a low- and a high-precision unit. A confidence evaluation unit is employed between them to identify misclassified cases at run time and forward them to the high-precision unit or terminate computation. Experiments demonstrate that CascadeCNN achieves a performance boost of up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy. △ Less

Submitted 22 May, 2018; originally announced May 2018.

Comments: Accepted at SysML Conference 2018

arXiv:1803.05900 [pdf, other]

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Authors: Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis

Abstract: In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative p… ▽ More In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows. △ Less

Submitted 15 March, 2018; originally announced March 2018.

Comments: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 2018

arXiv:1801.02190 [pdf, other]

Approximate FPGA-based LSTMs under Computation Time Constraints

Authors: Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis

Abstract: Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM) networks have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. However, the models are becoming increasingly demanding in terms of computational and memory load. Emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent compu… ▽ More Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM) networks have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. However, the models are becoming increasingly demanding in terms of computational and memory load. Emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we address the challenge of deploying computationally demanding LSTMs at a constrained time budget by introducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTM architecture. Combined in an end-to-end framework, the approximation method's parameters are optimised and the architecture is configured to address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life image captioning application indicates that the proposed methods required up to 6.5x less time to achieve the same application-level accuracy compared to a baseline method, while achieving an average of 25x higher accuracy under the same computation time constraints. △ Less

Submitted 7 January, 2018; originally announced January 2018.

Comments: Accepted at the 14th International Symposium in Applied Reconfigurable Computing (ARC) 2018

arXiv:1711.08740 [pdf, other]

fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs

Authors: Stylianos I. Venieris, Christos-Savvas Bouganis

Abstract: In recent years, Convolutional Neural Networks (ConvNets) have become an enabling technology for a wide range of novel embedded Artificial Intelligence systems. Across the range of applications, the performance needs vary significantly, from high-throughput video surveillance to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can b… ▽ More In recent years, Convolutional Neural Networks (ConvNets) have become an enabling technology for a wide range of novel embedded Artificial Intelligence systems. Across the range of applications, the performance needs vary significantly, from high-throughput video surveillance to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing making their mapping to an FPGA device a challenging task. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of SDF transformations in order to efficiently explore the architectural design space. By selectively optimising for throughput, latency or multiobjective criteria, the presented tool is able to efficiently explore the design space and generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall, our framework yields designs that improve the performance by up to 6.65x over highly optimised embedded GPU designs for the same power constraints in embedded environments. △ Less

Submitted 23 November, 2017; originally announced November 2017.

Comments: Accepted at NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices

Showing 1–37 of 37 results for author: Bouganis, C