Search | arXiv e-print repository

doi 10.1109/LASCAS53948.2022.9789055

MAx-DNN: Multi-Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators

Authors: Vasileios Leon, Georgios Makris, Sotirios Xydis, Kiamal Pekmestzi, Dimitrios Soudris

Abstract: Nowadays, the rapid growth of Deep Neural Network (DNN) architectures has established them as the defacto approach for providing advanced Machine Learning tasks with excellent accuracy. Targeting low-power DNN computing, this paper examines the interplay of fine-grained error resilience of DNN workloads in collaboration with hardware approximation techniques, to achieve higher levels of energy eff… ▽ More Nowadays, the rapid growth of Deep Neural Network (DNN) architectures has established them as the defacto approach for providing advanced Machine Learning tasks with excellent accuracy. Targeting low-power DNN computing, this paper examines the interplay of fine-grained error resilience of DNN workloads in collaboration with hardware approximation techniques, to achieve higher levels of energy efficiency. Utilizing the state-of-the-art ROUP approximate multipliers, we systematically explore their fine-grained distribution across the network according to our layer-, filter-, and kernel-level approaches, and examine their impact on accuracy and energy. We use the ResNet-8 model on the CIFAR-10 dataset to evaluate our approximations. The proposed solution delivers up to 54% energy gains in exchange for up to 4% accuracy loss, compared to the baseline quantized model, while it provides 2x energy gains with better accuracy versus the state-of-the-art DNN approximations. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Presented at the 13th IEEE LASCAS Conference

Journal ref: 13th IEEE Latin America Symposium on Circuits and System (LASCAS), 2022

arXiv:2506.21073 [pdf, ps, other]

Post-Quantum and Blockchain-Based Attestation for Trusted FPGAs in B5G Networks

Authors: Ilias Papalamprou, Nikolaos Fotos, Nikolaos Chatzivasileiadis, Anna Angelogianni, Dimosthenis Masouros, Dimitrios Soudris

Abstract: The advent of 5G and beyond has brought increased performance networks, facilitating the deployment of services closer to the user. To meet performance requirements such services require specialized hardware, such as Field Programmable Gate Arrays (FPGAs). However, FPGAs are often deployed in unprotected environments, leaving the user's applications vulnerable to multiple attacks. With the rise of… ▽ More The advent of 5G and beyond has brought increased performance networks, facilitating the deployment of services closer to the user. To meet performance requirements such services require specialized hardware, such as Field Programmable Gate Arrays (FPGAs). However, FPGAs are often deployed in unprotected environments, leaving the user's applications vulnerable to multiple attacks. With the rise of quantum computing, which threatens the integrity of widely-used cryptographic algorithms, the need for a robust security infrastructure is even more crucial. In this paper we introduce a hybrid hardware-software solution utilizing remote attestation to securely configure FPGAs, while integrating Post-Quantum Cryptographic (PQC) algorithms for enhanced security. Additionally, to enable trustworthiness across the whole edge computing continuum, our solution integrates a blockchain infrastructure, ensuring the secure storage of any security evidence. We evaluate the proposed secure configuration process under different PQC algorithms in two FPGA families, showcasing only 2% overheard compared to the non PQC approach. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.12971 [pdf, ps, other]

doi 10.1109/VLSI-SoC54400.2022.9939621

Combining Fault Tolerance Techniques and COTS SoC Accelerators for Payload Processing in Space

Authors: Vasileios Leon, Elissaios Alexios Papatheofanous, George Lentaris, Charalampos Bezaitis, Nikolaos Mastorakis, Georgios Bampilis, Dionysios Reisis, Dimitrios Soudris

Abstract: The ever-increasing demand for computational power and I/O throughput in space applications is transforming the landscape of on-board computing. A variety of Commercial-Off-The-Shelf (COTS) accelerators emerges as an attractive solution for payload processing to outperform the traditional radiation-hardened devices. Towards increasing the reliability of such COTS accelerators, the current paper ex… ▽ More The ever-increasing demand for computational power and I/O throughput in space applications is transforming the landscape of on-board computing. A variety of Commercial-Off-The-Shelf (COTS) accelerators emerges as an attractive solution for payload processing to outperform the traditional radiation-hardened devices. Towards increasing the reliability of such COTS accelerators, the current paper explores and evaluates fault-tolerance techniques for the Zynq FPGA and the Myriad VPU, which are two device families being integrated in industrial space avionics architectures/boards, such as Ubotica's CogniSat, Xiphos' Q7S, and Cobham Gaisler's GR-VPX-XCKU060. On the FPGA side, we combine techniques such as memory scrubbing, partial reconfiguration, triple modular redundancy, and watchdogs. On the VPU side, we detect and correct errors in the instruction and data memories, as well as we apply redundancy at processor level (SHAVE cores). When considering FPGA with VPU co-processing, we also develop a fault-tolerant interface between the two devices based on the CIF/LCD protocols and our custom CRC error-detecting code. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: Presented at the 30th IFIP/IEEE VLSI-SoC Conference

Journal ref: 30th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), 2022

arXiv:2506.12970 [pdf, ps, other]

doi 10.1109/VLSI-SoC54400.2022.9939566

Towards Employing FPGA and ASIP Acceleration to Enable Onboard AI/ML in Space Applications

Authors: Vasileios Leon, George Lentaris, Dimitrios Soudris, Simon Vellas, Mathieu Bernou

Abstract: The success of AI/ML in terrestrial applications and the commercialization of space are now paving the way for the advent of AI/ML in satellites. However, the limited processing power of classical onboard processors drives the community towards extending the use of FPGAs in space with both rad-hard and Commercial-Off-The-Shelf devices. The increased performance of FPGAs can be complemented with VP… ▽ More The success of AI/ML in terrestrial applications and the commercialization of space are now paving the way for the advent of AI/ML in satellites. However, the limited processing power of classical onboard processors drives the community towards extending the use of FPGAs in space with both rad-hard and Commercial-Off-The-Shelf devices. The increased performance of FPGAs can be complemented with VPU or TPU ASIP co-processors to further facilitate high-level AI development and in-flight reconfiguration. Thus, selecting the most suitable devices and designing the most efficient avionics architecture becomes crucial for the success of novel space missions. The current work presents industrial trends, comparative studies with in-house benchmarking, as well as architectural designs utilizing FPGAs and AI accelerators towards enabling AI/ML in future space missions. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: Presented at the 30th IFIP/IEEE VLSI-SoC Conference

Journal ref: 30th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), 2022

arXiv:2506.12968 [pdf, ps, other]

doi 10.1109/ICECS53924.2021.9665462

FPGA & VPU Co-Processing in Space Applications: Development and Testing with DSP/AI Benchmarks

Authors: Vasileios Leon, Charalampos Bezaitis, George Lentaris, Dimitrios Soudris, Dionysios Reisis, Elissaios-Alexios Papatheofanous, Angelos Kyriakos, Aubrey Dunne, Arne Samuelsson, David Steenari

Abstract: The advent of computationally demanding algorithms and high data rate instruments in new space applications pushes the space industry to explore disruptive solutions for on-board data processing. We examine heterogeneous computing architectures involving high-performance and low-power commercial SoCs. The current paper implements an FPGA with VPU co-processing architecture utilizing the CIF & LCD… ▽ More The advent of computationally demanding algorithms and high data rate instruments in new space applications pushes the space industry to explore disruptive solutions for on-board data processing. We examine heterogeneous computing architectures involving high-performance and low-power commercial SoCs. The current paper implements an FPGA with VPU co-processing architecture utilizing the CIF & LCD interfaces for I/O data transfers. A Kintex FPGA serves as our framing processor and heritage accelerator, while we offload novel DSP/AI functions to a Myriad2 VPU. We prototype our architecture in the lab to evaluate the interfaces, the FPGA resource utilization, the VPU computational throughput, as well as the entire data handling system's performance, via custom benchmarking. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: Presented at the 28th IEEE ICECS Conference

Journal ref: 28th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2021

arXiv:2505.23553 [pdf, ps, other]

A Unified Framework for Mapping and Synthesis of Approximate R-Blocks CGRAs

Authors: Georgios Alexandris, Panagiotis Chaidos, Alexis Maras, Barry de Bruin, Manil Dev Gomony, Henk Corporaal, Dimitrios Soudris, Sotirios Xydis

Abstract: The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In t… ▽ More The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In this paper, we introduce a novel end-to-end exploration and synthesis framework for approximate CGRA processors that enables transparent and optimized integration and mapping of state-of-the-art approximate multiplication components into CGRAs. Our methodology introduces a per-channel exploration strategy that maps specific output features onto approximate components based on accuracy degradation constraints. This enables the optimization of the system's energy consumption while retaining the accuracy above a certain threshold. At the circuit level, the integration of approximate components enables the creation of voltage islands that operate at reduced voltage levels, which is attributed to their inherently shorter critical paths. This key enabler allows us to effectively reduce the overall power consumption by an average of 30% across our analyzed architectures, compared to their baseline counterparts, while incurring only a minimal 2% area overhead. The proposed methodology was evaluated on a widely used NN model, MobileNetV2, on the ImageNet dataset, demonstrating that the generated architectures can deliver up to 440 GOPS/W with relatively small output error during inference, outperforming several State-of-the-Art CGRA architectures in terms of throughput and energy efficiency. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2504.04874 [pdf, other]

Futureproof Static Memory Planning

Authors: Christos Lamprakos, Panagiotis Xanthopoulos, Manolis Katsaragakis, Sotirios Xydis, Dimitrios Soudris, Francky Catthoor

Abstract: The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand b… ▽ More The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand buffers. The "AI memory wall", combined with deep neural networks' static architecture, has reignited interest in DSA. We present idealloc, a low-fragmentation, high-performance DSA implementation designed for million-buffer instances. Evaluated on a novel suite of particularly hard benchmarks from several domains, idealloc ranks first against four production implementations in terms of a joint effectiveness/robustness criterion. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: Submitted to ACM TOPLAS

arXiv:2503.21671 [pdf, other]

A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning Applications

Authors: Panagiotis Chaidos, Giorgos Armeniakos, Sotirios Xydis, Dimitrios Soudris

Abstract: Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare. However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors. This paper proposes a bespoke microprocessor… ▽ More Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare. However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors. This paper proposes a bespoke microprocessor design approach to address these challenges, by tailoring the design to specific applications and eliminating unnecessary logic. Targeting machine learning applications, we further optimize core operations by integrating a SIMD MAC unit supporting 4 precision configurations that boost the efficiency of microprocessors. Our evaluation across 6 ML models and the large-scale Zero-Riscy core, shows that our methodology can achieve improvements of 22.2%, 23.6%, and 33.79% in area, power, and speed, respectively, without compromising accuracy. Against state-of-the-art printed processors, our approach can still offer significant speedups, but along with some accuracy degradation. This work explores how such trade-offs can enable low-power printed microprocessors for diverse ML applications. △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: Accepted for publication at the IEEE International Symposium on Circuits and Systems (ISCAS `25), May 25-28, London, United Kingdom

arXiv:2409.16815 [pdf, other]

Accelerating TinyML Inference on Microcontrollers through Approximate Kernels

Authors: Giorgos Armeniakos, Georgios Mentzos, Dimitrios Soudris

Abstract: The rapid growth of microcontroller-based IoT devices has opened up numerous applications, from smart manufacturing to personalized healthcare. Despite the widespread adoption of energy-efficient microcontroller units (MCUs) in the Tiny Machine Learning (TinyML) domain, they still face significant limitations in terms of performance and memory (RAM, Flash). In this work, we combine approximate com… ▽ More The rapid growth of microcontroller-based IoT devices has opened up numerous applications, from smart manufacturing to personalized healthcare. Despite the widespread adoption of energy-efficient microcontroller units (MCUs) in the Tiny Machine Learning (TinyML) domain, they still face significant limitations in terms of performance and memory (RAM, Flash). In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on MCUs. Our kernel-based approximation framework firstly unpacks the operands of each convolution layer and then conducts an offline calculation to determine the significance of each operand. Subsequently, through a design space exploration, it employs a computation skipping approximation strategy based on the calculated significance. Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our Pareto optimal solutions can feature on average 21% latency reduction with no degradation in Top-1 classification accuracy, while for lower accuracy requirements, the corresponding reduction becomes even more pronounced. △ Less

Submitted 25 September, 2024; originally announced September 2024.

arXiv:2409.12939 [pdf, other]

doi 10.1016/j.micpro.2023.104947

Accelerating AI and Computer Vision for Satellite Pose Estimation on the Intel Myriad X Embedded SoC

Authors: Vasileios Leon, Panagiotis Minaidis, George Lentaris, Dimitrios Soudris

Abstract: The challenging deployment of Artificial Intelligence (AI) and Computer Vision (CV) algorithms at the edge pushes the community of embedded computing to examine heterogeneous System-on-Chips (SoCs). Such novel computing platforms provide increased diversity in interfaces, processors and storage, however, the efficient partitioning and mapping of AI/CV workloads still remains an open issue. In this… ▽ More The challenging deployment of Artificial Intelligence (AI) and Computer Vision (CV) algorithms at the edge pushes the community of embedded computing to examine heterogeneous System-on-Chips (SoCs). Such novel computing platforms provide increased diversity in interfaces, processors and storage, however, the efficient partitioning and mapping of AI/CV workloads still remains an open issue. In this context, the current paper develops a hybrid AI/CV system on Intel's Movidius Myriad X, which is an heterogeneous Vision Processing Unit (VPU), for initializing and tracking the satellite's pose in space missions. The space industry is among the communities examining alternative computing platforms to comply with the tight constraints of on-board data processing, while it is also striving to adopt functionalities from the AI domain. At algorithmic level, we rely on the ResNet-50-based UrsoNet network along with a custom classical CV pipeline. For efficient acceleration, we exploit the SoC's neural compute engine and 16 vector processors by combining multiple parallelization and low-level optimization techniques. The proposed single-chip, robust-estimation, and real-time solution delivers a throughput of up to 5 FPS for 1-MegaPixel RGB images within a limited power envelope of 2W. △ Less

Submitted 19 September, 2024; originally announced September 2024.

Comments: Accepted for publication at Elsevier Microprocessors and Microsystems

Journal ref: Elsevier Microprocessors and Microsystems, Vol. 103, Nov. 2023

arXiv:2409.12258 [pdf, other]

doi 10.1109/ICECS61496.2024.10848988

MPAI: A Co-Processing Architecture with MPSoC & AI Accelerators for Vision Applications in Space

Authors: Vasileios Leon, Panagiotis Minaidis, Dimitrios Soudris, George Lentaris

Abstract: The emerging need for fast and power-efficient AI/ML deployment on-board spacecraft has forced the space industry to examine specialized accelerators, which have been successfully used in terrestrial applications. Towards this direction, the current work introduces a very heterogeneous co-processing architecture that is built around UltraScale+ MPSoC and its programmable DPU, as well as commercial… ▽ More The emerging need for fast and power-efficient AI/ML deployment on-board spacecraft has forced the space industry to examine specialized accelerators, which have been successfully used in terrestrial applications. Towards this direction, the current work introduces a very heterogeneous co-processing architecture that is built around UltraScale+ MPSoC and its programmable DPU, as well as commercial AI/ML accelerators such as MyriadX VPU and Edge TPU. The proposed architecture, called MPAI, handles networks of different size/complexity and accommodates speed-accuracy-energy trade-offs by exploiting the diversity of accelerators in precision and computational power. This brief provides technical background and reports preliminary experimental results and outcomes. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: Accepted for publication at the 31st IEEE ICECS Conference, 18-20 Nov, 2024, Nancy, France

Journal ref: 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2024

arXiv:2409.12253 [pdf, other]

doi 10.1109/ICECS61496.2024.10849017

Development of High-Performance DSP Algorithms on the European Rad-Hard NG-ULTRA SoC FPGA

Authors: Vasileios Leon, Anastasios Xynos, Dimitrios Soudris, George Lentaris, Ruben Domingo, Arturo Perez, David Gonzalez-Arjona, Isabelle Conway, David Merodio Codinachs

Abstract: The emergence of demanding space applications has modified the traditional landscape of computing systems in space. When reliability is a first-class concern, in addition to enhanced performance-per-Watt, radiation-hardened FPGAs are favored. In this context, the current paper evaluates the first European radiation-hardened SoC FPGA, i.e., NanoXplore's NG-ULTRA, for accelerating high-performance D… ▽ More The emergence of demanding space applications has modified the traditional landscape of computing systems in space. When reliability is a first-class concern, in addition to enhanced performance-per-Watt, radiation-hardened FPGAs are favored. In this context, the current paper evaluates the first European radiation-hardened SoC FPGA, i.e., NanoXplore's NG-ULTRA, for accelerating high-performance DSP algorithms from space applications. The proposed development & testing methodologies provide efficient implementations, while they also aim to test the new NG-ULTRA hardware and its associated software tools. The results show that NG-ULTRA achieves competitive resource utilization and performance, constituting it as a very promising device for space missions, especially for Europe. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: Accepted for publication at the 31st IEEE ICECS Conference, 18-20 Nov, 2024, Nancy, France

Journal ref: 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2024

arXiv:2408.05235 [pdf, other]

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Authors: Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris

Abstract: As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a fr… ▽ More As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that \textit{throttLL'eM} achieves up to 43.8\% lower energy consumption and an energy efficiency improvement of at least $1.71\times$ under SLOs, when compared to NVIDIA's Triton server. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2407.18386 [pdf, other]

Leveraging Core and Uncore Frequency Scaling for Power-Efficient Serverless Workflows

Authors: Achilleas Tzenetopoulos, Dimosthenis Masouros, Sotirios Xydis, Dimitrios Soudris

Abstract: Serverless workflows have emerged in Function-as-a-Service (FaaS) platforms to represent the operational structure of traditional applications. With latency propagation effects becoming increasingly prominent, step-wise resource tuning is required to address Service-Level-Objectives (SLOs). Modern processors' allowance for fine-grained Dynamic Voltage and Frequency Scaling (DVFS), coupled with ser… ▽ More Serverless workflows have emerged in Function-as-a-Service (FaaS) platforms to represent the operational structure of traditional applications. With latency propagation effects becoming increasingly prominent, step-wise resource tuning is required to address Service-Level-Objectives (SLOs). Modern processors' allowance for fine-grained Dynamic Voltage and Frequency Scaling (DVFS), coupled with serverless workflows' intermittent nature, presents a unique opportunity to reduce power while meeting SLOs. We introduce $Ω$kypous, an SLO-driven DVFS framework for serverless workflows. $Ω$kypous employs a grey-box model that predicts functions' execution latency and power under different Core and Uncore frequency combinations. Based on these predictions and the timing slacks between workflow functions, $Ω$kypous uses a closed-loop control mechanism to dynamically adjust Core and Uncore frequencies, thus minimizing power consumption without compromising predefined end-to-end latency constraints. Our evaluation on real-world traces from Azure, against state-of-the-art power management frameworks, demonstrates an average power consumption reduction of 16\%, while consistently maintaining low SLO violation rates (1.8\%), when operating under power caps. △ Less

Submitted 21 April, 2025; v1 submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.14274 [pdf, other]

doi 10.1145/3676536.3676840

Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi-Pumped Soft SIMD Operations

Authors: Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris

Abstract: Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs… ▽ More Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. To this end, we introduce a hardware-software co-design framework that enables cooperative hardware design, mixed-precision quantization, ISA extensions and inference in cycle-accurate emulations. At hardware level, we firstly expand the ALU unit within our proof-of-concept micro-architecture to support configurable fine grained mixed-precision arithmetic operations. Subsequently, we implement multi-pumping to minimize execution latency, with an additional soft SIMD optimization applied for 2-bit operations. At the ISA level, three distinct MAC instructions are encoded extending the RISC-V ISA, and exposed up to the compiler level, each corresponding to a different mixed-precision operational mode. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 15x energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores. △ Less

Submitted 13 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

Comments: Accepted for publication at the 43rd International Conference on Computer-Aided Design (ICCAD `24), Oct 27-31 2024, New Jersey, USA

arXiv:2407.03711 [pdf, other]

Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers

Authors: Elisavet Lydia Alvanaki, Manolis Katsaragakis, Dimosthenis Masouros, Sotirios Xydis, Dimitrios Soudris

Abstract: Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimizat… ▽ More Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 6 pages, 6 figures, 1 listing, presented in IEEE DATE 2024

Journal ref: 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE

arXiv:2405.16953 [pdf, other]

Evaluation of Resource-Efficient Crater Detectors on Embedded Systems

Authors: Simon Vellas, Bill Psomas, Kalliopi Karadima, Dimitrios Danopoulos, Alexandros Paterakis, George Lentaris, Dimitrios Soudris, Konstantinos Karantzalos

Abstract: Real-time analysis of Martian craters is crucial for mission-critical operations, including safe landings and geological exploration. This work leverages the latest breakthroughs for on-the-edge crater detection aboard spacecraft. We rigorously benchmark several YOLO networks using a Mars craters dataset, analyzing their performance on embedded systems with a focus on optimization for low-power de… ▽ More Real-time analysis of Martian craters is crucial for mission-critical operations, including safe landings and geological exploration. This work leverages the latest breakthroughs for on-the-edge crater detection aboard spacecraft. We rigorously benchmark several YOLO networks using a Mars craters dataset, analyzing their performance on embedded systems with a focus on optimization for low-power devices. We optimize this process for a new wave of cost-effective, commercial-off-the-shelf-based smaller satellites. Implementations on diverse platforms, including Google Coral Edge TPU, AMD Versal SoC VCK190, Nvidia Jetson Nano and Jetson AGX Orin, undergo a detailed trade-off analysis. Our findings identify optimal network-device pairings, enhancing the feasibility of crater detection on resource-constrained hardware and setting a new precedent for efficient and resilient extraterrestrial imaging. Code at: https://github.com/billpsomas/mars_crater_detection. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted at 2024 IEEE International Geoscience and Remote Sensing Symposium

arXiv:2404.13715 [pdf, other]

doi 10.1109/EuCNC/6GSummit60053.2024.10597008

TF2AIF: Facilitating development and deployment of accelerated AI models on the cloud-edge continuum

Authors: Aimilios Leftheriotis, Achilleas Tzenetopoulos, George Lentaris, Dimitrios Soudris, Georgios Theodoridis

Abstract: The B5G/6G evolution relies on connect-compute technologies and highly heterogeneous clusters with HW accelerators, which require specialized coding to be efficiently utilized. The current paper proposes a custom tool for generating multiple SW versions of a certain AI function input in high-level language, e.g., Python TensorFlow, while targeting multiple diverse HW+SW platforms. TF2AIF builds up… ▽ More The B5G/6G evolution relies on connect-compute technologies and highly heterogeneous clusters with HW accelerators, which require specialized coding to be efficiently utilized. The current paper proposes a custom tool for generating multiple SW versions of a certain AI function input in high-level language, e.g., Python TensorFlow, while targeting multiple diverse HW+SW platforms. TF2AIF builds upon disparate tool-flows to create a plethora of relative containers and enable the system orchestrator to deploy the requested function on any peculiar node in the cloud-edge continuum, i.e., to leverage the performance/energy benefits of the underlying HW upon any circumstances. TF2AIF fills an identified gap in today's ecosystem and facilitates research on resource management or automated operations, by demanding minimal time or expertise from users. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: to be published in EUCNC & 6G Summit 2024

arXiv:2402.07545 [pdf, other]

doi 10.1109/TCASAI.2025.3565685

TransAxx: Efficient Transformers with Approximate Computing

Authors: Dimitrios Danopoulos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel

Abstract: Vision Transformer (ViT) models which were recently introduced by the transformer architecture have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs). However, the high computational requirements of these models limit their practical applicability especially on low-power devices. Current state-of-the-art employs approximate multipliers to a… ▽ More Vision Transformer (ViT) models which were recently introduced by the transformer architecture have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs). However, the high computational requirements of these models limit their practical applicability especially on low-power devices. Current state-of-the-art employs approximate multipliers to address the highly increased compute demands of DNN accelerators but no prior research has explored their use on ViT models. In this work we propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic to seamlessly evaluate the impact of approximate computing on DNNs such as ViT models. Using TransAxx we analyze the sensitivity of transformer models on the ImageNet dataset to approximate multiplications and perform approximate-aware finetuning to regain accuracy. Furthermore, we propose a methodology to generate approximate accelerators for ViT models. Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations using a hardware-driven hand-crafted policy. Our evaluation demonstrates the efficacy of our methodology in achieving significant trade-offs between accuracy and power, resulting in substantial gains without compromising on performance. △ Less

Submitted 7 May, 2025; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2312.01172 [pdf, other]

doi 10.23919/DATE58400.2024.10546585

On-sensor Printed Machine Learning Classification via Bespoke ADC and Decision Tree Co-Design

Authors: Giorgos Armeniakos, Paula L. Duarte, Priyanjana Pal, Georgios Zervakis, Mehdi B. Tahoori, Dimitrios Soudris

Abstract: Printed electronics (PE) technology provides cost-effective hardware with unmet customization, due to their low non-recurring engineering and fabrication costs. PE exhibit features such as flexibility, stretchability, porosity, and conformality, which make them a prominent candidate for enabling ubiquitous computing. Still, the large feature sizes in PE limit the realization of complex printed cir… ▽ More Printed electronics (PE) technology provides cost-effective hardware with unmet customization, due to their low non-recurring engineering and fabrication costs. PE exhibit features such as flexibility, stretchability, porosity, and conformality, which make them a prominent candidate for enabling ubiquitous computing. Still, the large feature sizes in PE limit the realization of complex printed circuits, such as machine learning classifiers, especially when processing sensor inputs is necessary, mainly due to the costly analog-to-digital converters (ADCs). To this end, we propose the design of fully customized ADCs and present, for the first time, a co-design framework for generating bespoke Decision Tree classifiers. Our comprehensive evaluation shows that our co-design enables self-powered operation of on-sensor printed classifiers in all benchmark cases. △ Less

Submitted 2 December, 2023; originally announced December 2023.

Comments: Accepted for publication at the 27th Design, Automation and Test in Europe Conference (DATE'24), Mar 25-27 2024, Valencia, Spain

arXiv:2307.11128 [pdf, other]

doi 10.1145/3711683

Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications

Authors: Vasileios Leon, Muhammad Abdullah Hanif, Giorgos Armeniakos, Xun Jiao, Muhammad Shafique, Kiamal Pekmestzi, Dimitrios Soudris

Abstract: The challenging deployment of compute-intensive applications from domains such as Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or perfor… ▽ More The challenging deployment of compute-intensive applications from domains such as Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or performance. This radical paradigm shift has attracted interest from both academia and industry, resulting in significant research on approximation techniques and methodologies at different design layers (from system down to integrated circuits). Motivated by the wide appeal of Approximate Computing over the last 10 years, we conduct a two-part survey to cover key aspects (e.g., terminology and applications) and review the state-of-the art approximation techniques from all layers of the traditional computing stack. Part II of the survey classifies and presents the technical details of application-specific and architectural approximation techniques, which both target the design of resource-efficient processors/accelerators and systems. Moreover, it reports a quantitative analysis of the techniques and a detailed analysis of the application spectrum of Approximate Computing, and finally, it discusses open challenges and future directions. △ Less

Submitted 19 March, 2025; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Published in ACM Computing Surveys (Volume 57, Issue 7, 2025)

Journal ref: ACM Computing Surveys, Volume 57, Issue 7, Article 177, 2025

arXiv:2307.11124 [pdf, other]

doi 10.1145/3716845

Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

Authors: Vasileios Leon, Muhammad Abdullah Hanif, Giorgos Armeniakos, Xun Jiao, Muhammad Shafique, Kiamal Pekmestzi, Dimitrios Soudris

Abstract: The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, over the last 15 year… ▽ More The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, over the last 15 years, the semiconductor industry has established power efficiency as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, which has resulted in novel approximation techniques for all the layers of the traditional computing stack. More specifically, during the last decade, a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories) have been proposed in the literature. The current article is Part I of a comprehensive survey on Approximate Computing. It reviews its motivation, terminology and principles, as well it classifies the state-of-the-art software & hardware approximation techniques, presents their technical details, and reports a comparative quantitative analysis. △ Less

Submitted 19 March, 2025; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Published in ACM Computing Surveys (Volume 57, Issue 7, 2025)

Journal ref: ACM Computing Surveys, Volume 57, Issue 7, Article 185, 2025

arXiv:2305.01497 [pdf, other]

doi 10.1145/3591195.3595279

The Unexpected Efficiency of Bin Packing Algorithms for Dynamic Storage Allocation in the Wild: An Intellectual Abstract

Authors: Christos P. Lamprakos, Sotirios Xydis, Francky Catthoor, Dimitrios Soudris

Abstract: Recent work has shown that viewing allocators as black-box 2DBP solvers bears meaning. For instance, there exists a 2DBP-based fragmentation metric which often correlates monotonically with maximum resident set size (RSS). Given the field's indeterminacy with respect to fragmentation definitions, as well as the immense value of physical memory savings, we are motivated to set allocator-generated p… ▽ More Recent work has shown that viewing allocators as black-box 2DBP solvers bears meaning. For instance, there exists a 2DBP-based fragmentation metric which often correlates monotonically with maximum resident set size (RSS). Given the field's indeterminacy with respect to fragmentation definitions, as well as the immense value of physical memory savings, we are motivated to set allocator-generated placements against their 2DBP-devised, makespan-optimizing counterparts. Of course, allocators must operate online while 2DBP algorithms work on complete request traces; but since both sides optimize criteria related to minimizing memory wastage, the idea of studying their relationship preserves its intellectual--and practical--interest. Unfortunately no implementations of 2DBP algorithms for DSA are available. This paper presents a first, though partial, implementation of the state-of-the-art. We validate its functionality by comparing its outputs' makespan to the theoretical upper bound provided by the original authors. Along the way, we identify and document key details to assist analogous future efforts. Our experiments comprise 4 modern allocators and 8 real application workloads. We make several notable observations on our empirical evidence: in terms of makespan, allocators outperform Robson's worst-case lower bound $93.75\%$ of the time. In $87.5\%$ of cases, GNU's \texttt{malloc} implementation demonstrates equivalent or superior performance to the 2DBP state-of-the-art, despite the second operating offline. Most surprisingly, the 2DBP algorithm proves competent in terms of fragmentation, producing up to $2.46$x better solutions. Future research can leverage such insights towards memory-targeting optimizations. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 13 pages, 10 figures, 3 tables. To appear in ISMM '23

arXiv:2304.10862 [pdf, other]

Viewing Allocators as Bin Packing Solvers Demystifies Fragmentation

Authors: Christos P. Lamprakos, Sotirios Xydis, Francky Catthoor, Dimitrios Soudris

Abstract: This paper presents a trace-based simulation methodology for constructing representations of workload-allocator interaction. We use two-dimensional rectangular bin packing (2DBP) as our foundation. Classical 2DBP algorithms minimize their products' makespan, but virtual memory systems employing demand paging deem such a criterion inappropriate. We view an allocator's placement decisions as a solut… ▽ More This paper presents a trace-based simulation methodology for constructing representations of workload-allocator interaction. We use two-dimensional rectangular bin packing (2DBP) as our foundation. Classical 2DBP algorithms minimize their products' makespan, but virtual memory systems employing demand paging deem such a criterion inappropriate. We view an allocator's placement decisions as a solution to a 2DBP instance, optimizing some unknown criterion particular to that allocator's policy. Our end product is a compact data structure that fits e.g. the simulation of 80 million requests in a 350 MiB file. By design, it is concerned with events residing entirely in virtual memory; no information on memory accesses, indexing costs or any other factor is kept. We bootstrap our contribution's significance by exploring its relationship to maximum resident set size (RSS). Our baseline is the assumption that less fragmentation amounts to smaller peak RSS. We thus define a fragmentation metric in the 2DBP substrate and compute it for 28 workloads linked to 4 modern allocators. We also measure peak RSS for the 112 resulting pairs. Our metric exhibits a strong monotonic relationship (Spearman coefficient $ρ>0.65$) in half of those cases: allocators achieving better 2DBP placements yield $9\%$-$30\%$ smaller peak RSS, with the trends remaining consistent across two different machines. Considering our representation's minimalism, the presented empirical evidence is a robust indicator of its potency. If workload-allocator interplay in the virtual address space suffices to evaluate a novel fragmentation definition, numerous other useful applications of our tool can be studied. Both augmenting 2DBP and exploring alternative computations on it provide ample fertile ground for future research. △ Less

Submitted 24 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

Comments: 13 pages, 10 figures, 5 tables Edit: removed "regular submission" subtitle, cleaned page headers

arXiv:2304.00953 [pdf, other]

Energy Consumption Evaluation of Optane DC Persistent Memory for Indexing Data Structures

Authors: Manolis Katsaragakis, Christos Baloukas, Lazaros Papadopoulos, Verena Kantere, Francky Catthoor, Dimitrios Soudris

Abstract: The Intel Optane DC Persistent Memory (DCPM) is an attractive novel technology for building storage systems for data intensive HPC applications, as it provides lower cost per byte, low standby power and larger capacities than DRAM, with comparable latency. This work provides an in-depth evaluation of the energy consumption of the Optane DCPM, using well-established indexes specifically designed to… ▽ More The Intel Optane DC Persistent Memory (DCPM) is an attractive novel technology for building storage systems for data intensive HPC applications, as it provides lower cost per byte, low standby power and larger capacities than DRAM, with comparable latency. This work provides an in-depth evaluation of the energy consumption of the Optane DCPM, using well-established indexes specifically designed to address the challenges and constraints of the persistent memories. We study the energy efficiency of the Optane DCPM for several indexing data structures and for the LevelDB key-value store, under different types of YCSB workloads. By integrating an Optane DCPM in a memory system, the energy drops by 71.2% and the throughput increases by 37.3% for the LevelDB experiments, compared to a typical SSD storage solution. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: 10 pages Has been accepted and presented to IEEE International Conference on High Performance Computing 2022(HiPC), Bengaluru, India

arXiv:2303.08255 [pdf, other]

doi 10.1109/TCAD.2023.3258668

Model-to-Circuit Cross-Approximation For Printed Machine Learning Classifiers

Authors: Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, Jörg Henkel

Abstract: Printed electronics (PE) promises on-demand fabrication, low non-recurring engineering costs, and sub-cent fabrication costs. It also allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. Nevertheless, large feature sizes in PE prohibit the realization of complex ML models in… ▽ More Printed electronics (PE) promises on-demand fabrication, low non-recurring engineering costs, and sub-cent fabrication costs. It also allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. Nevertheless, large feature sizes in PE prohibit the realization of complex ML models in PE, even with bespoke architectures. In this work, we present an automated, cross-layer approximation framework tailored to bespoke architectures that enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. Our framework adopts cooperatively a hardware-driven coefficient approximation of the ML model at algorithmic level, a netlist pruning at logic level, and a voltage over-scaling at the circuit level. Extensive experimental evaluation on 12 MLPs and 12 SVMs and more than 6000 approximate and exact designs demonstrates that our model-to-circuit cross-approximation delivers power and area optimal designs that, compared to the state-of-the-art exact designs, feature on average 51% and 66% area and power reduction, respectively, for less than 5% accuracy loss. Finally, we demonstrate that our framework enables 80% of the examined classifiers to be battery-powered with almost identical accuracy with the exact designs, paving thus the way towards smart complex printed applications. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted for publication by IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, March 2023. arXiv admin note: text overlap with arXiv:2203.05915

arXiv:2302.14576 [pdf, other]

doi 10.1109/TC.2023.3251863

Co-Design of Approximate Multilayer Perceptron for Ultra-Resource Constrained Printed Circuits

Authors: Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, Jörg Henkel

Abstract: Printed Electronics (PE) exhibits on-demand, extremely low-cost hardware due to its additive manufacturing process, enabling machine learning (ML) applications for domains that feature ultra-low cost, conformity, and non-toxicity requirements that silicon-based systems cannot deliver. Nevertheless, large feature sizes in PE prohibit the realization of complex printed ML circuits. In this work, we… ▽ More Printed Electronics (PE) exhibits on-demand, extremely low-cost hardware due to its additive manufacturing process, enabling machine learning (ML) applications for domains that feature ultra-low cost, conformity, and non-toxicity requirements that silicon-based systems cannot deliver. Nevertheless, large feature sizes in PE prohibit the realization of complex printed ML circuits. In this work, we present, for the first time, an automated printed-aware software/hardware co-design framework that exploits approximate computing principles to enable ultra-resource constrained printed multilayer perceptrons (MLPs). Our evaluation demonstrates that, compared to the state-of-the-art baseline, our circuits feature on average 6x (5.7x) lower area (power) and less than 1% accuracy loss. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted for publication by IEEE Transactions on Computers, February 2023

arXiv:2212.00873 [pdf, other]

CONVOLVE: Smart and seamless design of smart edge processors

Authors: M. Gomony, F. Putter, A. Gebregiorgis, G. Paulin, L. Mei, V. Jain, S. Hamdioui, V. Sanchez, T. Grosser, M. Geilen, M. Verhelst, F. Zenke, F. Gurkaynak, B. Bruin, S. Stuijk, S. Davidson, S. De, M. Ghogho, A. Jimborean, S. Eissa, L. Benini, D. Soudris, R. Bishnoi, S. Ainsworth, F. Corradi , et al. (3 additional authors not shown)

Abstract: With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC… ▽ More With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality. △ Less

Submitted 2 May, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2208.14124 [pdf, other]

doi 10.1109/COINS54846.2022.9855002

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Authors: Petros Vavaroutsos, Ioannis Oroutzoglou, Dimosthenis Masouros, Dimitrios Soudris

Abstract: Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available solutions to execute applications. However, choosing the appropriate device per application needs is an extremely challenging task due to the abstract relationship be… ▽ More Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available solutions to execute applications. However, choosing the appropriate device per application needs is an extremely challenging task due to the abstract relationship between hardware and software. Automatic optimization algorithms that are accurate are required to cope with the complexity and variety of current hardware and software. Optimal execution has always relied on time-consuming trial and error approaches. Machine learning (ML) and Natural Language Processing (NLP) has flourished over the last decade with research focusing on deep architectures. In this context, the use of natural language processing techniques to source code in order to conduct autotuning tasks is an emerging field of study. In this paper, we extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection (CPU or GPU) for accelerated OpenCL kernels. We identify three major limitations of Deeptune and, based on these, we propose four different DNN models that provide enhanced contextual information of source codes. Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4\% improvement in prediction accuracy. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: Accepted at IEEE COINS 2022

Journal ref: 2022 IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2022, pp. 1-6

arXiv:2203.08737 [pdf, other]

doi 10.1145/3527156

Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

Authors: Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel

Abstract: Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought beyond human accuracy in many tasks, but at the cost of high computational complexity. To enable efficient execution of DNN inference, more and more research works, therefore, exploit the inherent error resilience of DNNs and e… ▽ More Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought beyond human accuracy in many tasks, but at the cost of high computational complexity. To enable efficient execution of DNN inference, more and more research works, therefore, exploit the inherent error resilience of DNNs and employ Approximate Computing (AC) principles to address the elevated energy demands of DNN accelerators. This article provides a comprehensive survey and analysis of hardware approximation techniques for DNN accelerators. First, we analyze the state of the art and by identifying approximation families, we cluster the respective works with respect to the approximation type. Next, we analyze the complexity of the performed evaluations (with respect to the dataset and DNN size) to assess the efficiency, the potential, and limitations of approximate DNN accelerators. Moreover, a broad discussion is provided, regarding error metrics that are more suitable for designing approximate units for DNN accelerators as well as accuracy recovery approaches that are tailored to DNN inference. Finally, we present how Approximate Computing for DNN accelerators can go beyond energy efficiency and address reliability and security issues, as well. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: This paper has been accepted by ACM Computing Surveys (CSUR), 2022

Journal ref: ACM Computing Surveys 2022

arXiv:2203.05915 [pdf, other]

doi 10.23919/DATE54114.2022.9774689

Cross-Layer Approximation For Printed Machine Learning Circuits

Authors: Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, Jörg Henkel

Abstract: Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke ar… ▽ More Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke architectures, the large feature sizes in PE constraint the complexity of the ML models that can be implemented. In this work, we bring together, for the first time, approximate computing and PE design targeting to enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. To this end, we propose and implement a cross-layer approximation, tailored for bespoke ML architectures. At the algorithmic level we apply a hardware-driven coefficient approximation of the ML model and at the circuit level we apply a netlist pruning through a full search exploration. In our extensive experimental evaluation we consider 14 MLPs and SVMs and evaluate more than 4300 approximate and exact designs. Our results demonstrate that our cross approximation delivers Pareto optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: Accepted for publication at the 25th Design, Automation and Test in Europe Conference (DATE'22), Mar 14-23 2022, Antwerp, Belgium

arXiv:2203.04071 [pdf, other]

doi 10.1109/TCAD.2022.3212645

AdaPT: Fast Emulation of Approximate DNN Accelerators in PyTorch

Authors: Dimitrios Danopoulos, Georgios Zervakis, Kostas Siozios, Dimitrios Soudris, Jörg Henkel

Abstract: Current state-of-the-art employs approximate multipliers to address the highly increased power demands of DNN accelerators. However, evaluating the accuracy of approximate DNNs is cumbersome due to the lack of adequate support for approximate arithmetic in DNN frameworks. We address this inefficiency by presenting AdaPT, a fast emulation framework that extends PyTorch to support approximate infere… ▽ More Current state-of-the-art employs approximate multipliers to address the highly increased power demands of DNN accelerators. However, evaluating the accuracy of approximate DNNs is cumbersome due to the lack of adequate support for approximate arithmetic in DNN frameworks. We address this inefficiency by presenting AdaPT, a fast emulation framework that extends PyTorch to support approximate inference as well as approximation-aware retraining. AdaPT can be seamlessly deployed and is compatible with the most DNNs. We evaluate the framework on several DNN models and application fields including CNNs, LSTMs, and GANs for a number of approximate multipliers with distinct bitwidth values. The results show substantial error recovery from approximate re-training and reduced inference time up to 53.9x with respect to the baseline approximate implementation. △ Less

Submitted 11 October, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

arXiv:2106.06752 [pdf, other]

doi 10.3389/fninf.2022.724336

EDEN: A high-performance, general-purpose, NeuroML-based neural simulator

Authors: Sotirios Panagiotou, Harry Sidiropoulos, Mario Negrello, Dimitrios Soudris, Christos Strydis

Abstract: Modern neuroscience employs in silico experimentation on ever-increasing and more detailed neural networks. The high modelling detail goes hand in hand with the need for high model reproducibility, reusability and transparency. Besides, the size of the models and the long timescales under study mandate the use of a simulation system with high computational performance, so as to provide an acceptab… ▽ More Modern neuroscience employs in silico experimentation on ever-increasing and more detailed neural networks. The high modelling detail goes hand in hand with the need for high model reproducibility, reusability and transparency. Besides, the size of the models and the long timescales under study mandate the use of a simulation system with high computational performance, so as to provide an acceptable time to result. In this work, we present EDEN (Extensible Dynamics Engine for Networks), a new general-purpose, NeuroML-based neural simulator that achieves both high model flexibility and high computational performance, through an innovative model-analysis and code-generation technique. The simulator runs NeuroML v2 models directly, eliminating the need for users to learn yet another simulator-specific, model-specification language. EDEN's functional correctness and computational performance were assessed through NeuroML models available on the NeuroML-DB and Open Source Brain model repositories. In qualitative experiments, the results produced by EDEN were verified against the established NEURON simulator, for a wide range of models. At the same time, computational-performance benchmarks reveal that EDEN runs up to 2 orders-of-magnitude faster than NEURON on a typical desktop computer, and does so without additional effort from the user. Finally, and without added user effort, EDEN has been built from scratch to scale seamlessly over multiple CPUs and across computer clusters, when available. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: 29 pages, 9 figures

Journal ref: Front. Neuroinform. 16 (2022)

arXiv:2004.13873 [pdf, other]

Automated Physics-Derived Code Generation for Sensor Fusion and State Estimation

Authors: Orestis Kaparounakis, Vasileios Tsoutsouras, Dimitrios Soudris, Phillip Stanley-Marbell

Abstract: We present a new method for automatically generating the implementation of state-estimation algorithms from a machine-readable specification of the physics of a sensing system and physics of its signals and signal constraints. We implement the new state-estimator code generation method as a backend for a physics specification language and we apply the backend to generate complete C code implementa… ▽ More We present a new method for automatically generating the implementation of state-estimation algorithms from a machine-readable specification of the physics of a sensing system and physics of its signals and signal constraints. We implement the new state-estimator code generation method as a backend for a physics specification language and we apply the backend to generate complete C code implementations of state estimators for both linear systems (Kalman filters) and non-linear systems (extended Kalman filters). The state estimator code generation from physics specification is completely automated and requires no manual intervention. The generated filters can incorporate an Automatic Differentiation technique which combines function evaluation and differentiation in a single process. Using the description of physical system of a range of complexities, we generate extended Kalman filters, which we evaluate in terms of prediction accuracy using simulation traces. The results show that our automatically-generated sensor fusion and state estimation implementations provide state estimation within the same error bound as the human-written hand-optimized counterparts. We additionally quantify the code size and dynamic instruction count requirements of the generated state estimator implementations on the RISC-V architecture. The results show that our synthesized state estimation implementation employing Automatic Differentiation leads to an average improvement in the dynamic instruction count of the generated Kalman filter of 7%-16% compared to the standard differentiation technique. This is improvement comes at the limited cost of an average 4.5% increase in the code size of the generated filters. △ Less

Submitted 28 April, 2020; originally announced April 2020.

Comments: 11 pages, 7 figures

arXiv:1612.01501 [pdf, other]

BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations

Authors: Georgios Smaragdos, Georgios Chatzikonstantis, Rahul Kukreja, Harry Sidiropoulos, Dimitrios Rodopoulos, Ioannis Sourdis, Zaid Al-Ars, Christoforos Kachris, Dimitrios Soudris, Chris I. De Zeeuw, Christos Strydis

Abstract: Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a… ▽ More Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a single acceleration (or homogeneous) platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform, incorporating three distinct acceleration technologies, a Dataflow Engine, a Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different instances of a state-of-the-art neuron model, modeling the Inferior- Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronal- network dimensions but also different network-connectivity circumstances that can drastically change application workload characteristics. Main results: The synthetic approach of three HPC technologies demonstrated that BrainFrame is better able to cope with the modeling diversity encountered. Our performance analysis shows clearly that the model directly affect performance and all three technologies are required to cope with all the model use cases. △ Less

Submitted 15 August, 2017; v1 submitted 5 December, 2016; originally announced December 2016.

Comments: 16 pages, 18 figures, 5 tables

arXiv:1406.0309 [pdf]

Network Function Virtualization based on FPGAs:A Framework for all-Programmable network devices

Authors: Christoforos Kachris, Georgios Sirakoulis, Dimitrios Soudris

Abstract: Network Function Virtualization (NFV) refers to the use of commodity hardware resources as the basic platform to perform specialized network functions as opposed to specialized hardware devices. Currently, NFV is mainly implemented based on general purpose processors, or general purpose network processors. In this paper we propose the use of FPGAs as an ideal platform for NFV that can be used to p… ▽ More Network Function Virtualization (NFV) refers to the use of commodity hardware resources as the basic platform to perform specialized network functions as opposed to specialized hardware devices. Currently, NFV is mainly implemented based on general purpose processors, or general purpose network processors. In this paper we propose the use of FPGAs as an ideal platform for NFV that can be used to provide both the flexibility of virtualizations and the high performance of the specialized hardware. We present the early attempts of using FPGAs dynamic reconfiguration in network processing applications to provide flexible network functions and we present the opportunities for an FPGA-based NFV platform. △ Less

Submitted 2 June, 2014; originally announced June 2014.

Comments: Network function virtualizations, FPGA, dynamic reconfiguration

arXiv:0710.4844 [pdf]

A Partitioning Methodology for Accelerating Applications in Hybrid Reconfigurable Platforms

Authors: M. D. Galanis, A. Milidonis, G. Theodoridis, D. Soudris, C. E. Goutis

Abstract: In this paper, we propose a methodology for partitioning and mapping computational intensive applications in reconfigurable hardware blocks of different granularity. A generic hybrid reconfigurable architecture is considered so as the methodology can be applicable to a large number of heterogeneous reconfigurable platforms. The methodology mainly consists of two stages, the analysis and the mapp… ▽ More In this paper, we propose a methodology for partitioning and mapping computational intensive applications in reconfigurable hardware blocks of different granularity. A generic hybrid reconfigurable architecture is considered so as the methodology can be applicable to a large number of heterogeneous reconfigurable platforms. The methodology mainly consists of two stages, the analysis and the mapping of the application onto fine and coarse-grain hardware resources. A prototype framework consisting of analysis, partitioning and mapping tools has been also developed. For the coarse-grain reconfigurable hardware, we use our previous-developed high-performance coarse-grain data-path. In this work, the methodology is validated using two real-world applications, an OFDM transmitter and a JPEG encoder. In the case of the OFDM transmitter, a maximum clock cycles decrease of 82% relative to the ones in an all fine-grain mapping solution is achieved. The corresponding performance improvement for the JPEG is 43%. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe | Designers'Forum - DATE'05, Munich : Allemagne (2005)

arXiv:0710.4656 [pdf]

A Memory Hierarchical Layer Assigning and Prefetching Technique to Overcome the Memory Performance/Energy Bottleneck

Authors: Minas Dasygenis, Erik Brockmeyer, Bart Durinck, Francky Catthoor, Dimitrios Soudris, Antonios Thanailakis

Abstract: The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multi-layered memory hierarchies as a means to design energy and performance efficient systems. However, most of the previous work do not explore trade-offs systematically. We fill this gap by proposing a formalized technique… ▽ More The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multi-layered memory hierarchies as a means to design energy and performance efficient systems. However, most of the previous work do not explore trade-offs systematically. We fill this gap by proposing a formalized technique that takes into consideration data reuse, limited lifetime of the arrays of an application and application specific prefetching opportunities, and performs a thorough trade-off exploration for different memory layer sizes. This technique has been implemented on a prototype tool, which was tested successfully using nine real-life applications of industrial relevance. Following this approach we have able to reduce execution time up to 60%, and energy consumption up to 70%. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe - DATE'05, Munich : Allemagne (2005)

Showing 1–38 of 38 results for author: Soudris, D