-
COIN: Communication-Aware In-Memory Acceleration for Graph Convolutional Networks
Authors:
Sumit K. Mandal,
Gokul Krishnan,
A. Alper Goksoy,
Gopikrishnan Ravindran Nair,
Yu Cao,
Umit Y. Ogras
Abstract:
Graph convolutional networks (GCNs) have shown remarkable learning capabilities when processing graph-structured data found inherently in many application areas. GCNs distribute the outputs of neural networks embedded in each vertex over multiple iterations to take advantage of the relations captured by the underlying graphs. Consequently, they incur a significant amount of computation and irregul…
▽ More
Graph convolutional networks (GCNs) have shown remarkable learning capabilities when processing graph-structured data found inherently in many application areas. GCNs distribute the outputs of neural networks embedded in each vertex over multiple iterations to take advantage of the relations captured by the underlying graphs. Consequently, they incur a significant amount of computation and irregular communication overheads, which call for GCN-specific hardware accelerators. To this end, this paper presents a communication-aware in-memory computing architecture (COIN) for GCN hardware acceleration. Besides accelerating the computation using custom compute elements (CE) and in-memory computing, COIN aims at minimizing the intra- and inter-CE communication in GCN operations to optimize the performance and energy efficiency. Experimental evaluations with widely used datasets show up to 105x improvement in energy consumption compared to state-of-the-art GCN accelerator.
△ Less
Submitted 15 May, 2022;
originally announced May 2022.
-
DAS: Dynamic Adaptive Scheduling for Energy-Efficient Heterogeneous SoCs
Authors:
A. Alper Goksoy,
Anish Krishnakumar,
Md Sahil Hassan,
Allen J. Farcas,
Ali Akoglu,
Radu Marculescu,
Umit Y. Ogras
Abstract:
Domain-specific systems-on-chip (DSSoCs) aim at bridging the gap between application-specific integrated circuits (ASICs) and general-purpose processors. Traditional operating system (OS) schedulers can undermine the potential of DSSoCs since their execution times can be orders of magnitude larger than the execution time of the task itself. To address this problem, we propose a dynamic adaptive sc…
▽ More
Domain-specific systems-on-chip (DSSoCs) aim at bridging the gap between application-specific integrated circuits (ASICs) and general-purpose processors. Traditional operating system (OS) schedulers can undermine the potential of DSSoCs since their execution times can be orders of magnitude larger than the execution time of the task itself. To address this problem, we propose a dynamic adaptive scheduling (DAS) framework that combines the benefits of a fast (low-overhead) scheduler and a slow (sophisticated, high-performance but high-overhead) scheduler. Experiments with five real-world streaming applications show that DAS consistently outperforms both the fast and slow schedulers. For 40 different workloads, DAS achieves on average 1.29x speedup and 45% lower EDP compared to the sophisticated scheduler at low data rates and 1.28x speedup and 37% lower EDP than the fast scheduler when the workload complexity increases.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
Runtime Task Scheduling using Imitation Learning for Heterogeneous Many-Core Systems
Authors:
Anish Krishnakumar,
Samet E. Arda,
A. Alper Goksoy,
Sumit K. Mandal,
Umit Y. Ogras,
Anderson L. Sartor,
Radu Marculescu
Abstract:
Domain-specific systems-on-chip, a class of heterogeneous many-core systems, are recognized as a key approach to narrow down the performance and energy-efficiency gap between custom hardware accelerators and programmable processors. Reaching the full potential of these architectures depends critically on optimally scheduling the applications to available resources at runtime. Existing optimization…
▽ More
Domain-specific systems-on-chip, a class of heterogeneous many-core systems, are recognized as a key approach to narrow down the performance and energy-efficiency gap between custom hardware accelerators and programmable processors. Reaching the full potential of these architectures depends critically on optimally scheduling the applications to available resources at runtime. Existing optimization-based techniques cannot achieve this objective at runtime due to the combinatorial nature of the task scheduling problem. As the main theoretical contribution, this paper poses scheduling as a classification problem and proposes a hierarchical imitation learning (IL)-based scheduler that learns from an Oracle to maximize the performance of multiple domain-specific applications. Extensive evaluations with six streaming applications from wireless communications and radar domains show that the proposed IL-based scheduler approximates an offline Oracle policy with more than 99% accuracy for performance- and energy-based optimization objectives. Furthermore, it achieves almost identical performance to the Oracle with a low runtime overhead and successfully adapts to new applications, many-core system configurations, and runtime variations in application characteristics.
△ Less
Submitted 6 August, 2020; v1 submitted 18 July, 2020;
originally announced July 2020.
-
DS3: A System-Level Domain-Specific System-on-Chip Simulation Framework
Authors:
Samet E. Arda,
Anish NK,
A. Alper Goksoy,
Nirmal Kumbhare,
Joshua Mack,
Anderson L. Sartor,
Ali Akoglu,
Radu Marculescu,
Umit Y. Ogras
Abstract:
Heterogeneous systems-on-chip (SoCs) are highly favorable computing platforms due to their superior performance and energy efficiency potential compared to homogeneous architectures. They can be further tailored to a specific domain of applications by incorporating processing elements (PEs) that accelerate frequently used kernels in these applications. However, this potential is contingent upon op…
▽ More
Heterogeneous systems-on-chip (SoCs) are highly favorable computing platforms due to their superior performance and energy efficiency potential compared to homogeneous architectures. They can be further tailored to a specific domain of applications by incorporating processing elements (PEs) that accelerate frequently used kernels in these applications. However, this potential is contingent upon optimizing the SoC for the target domain and utilizing its resources effectively at runtime. To this end, system-level design - including scheduling, power-thermal management algorithms and design space exploration studies - plays a crucial role. This paper presents a system-level domain-specific SoC simulation (DS3) framework to address this need. DS3 enables both design space exploration and dynamic resource management for power-performance optimization of domain applications. We showcase DS3 using six real-world applications from wireless communications and radar processing domain. DS3, as well as the reference applications, is shared as open-source software to stimulate research in this area.
△ Less
Submitted 19 March, 2020;
originally announced March 2020.
-
Work-in-Progress: A Simulation Framework for Domain-Specific System-on-Chips
Authors:
Samet E. Arda,
Anish NK,
A. Alper Goksoy,
Joshua Mack,
Nirmal Kumbhare,
Anderson L. Sartor,
Ali Akoglu,
Radu Marculescu,
Umit Y. Ogras
Abstract:
Heterogeneous system-on-chips (SoCs) have become the standard embedded computing platforms due to their potential to deliver superior performance and energy efficiency compared to homogeneous architectures. They can be particularly suited to target a specific domain of applications. However, this potential is contingent upon optimizing the SoC for the target domain and utilizing its resources effe…
▽ More
Heterogeneous system-on-chips (SoCs) have become the standard embedded computing platforms due to their potential to deliver superior performance and energy efficiency compared to homogeneous architectures. They can be particularly suited to target a specific domain of applications. However, this potential is contingent upon optimizing the SoC for the target domain and utilizing its resources effectively at run-time. Cycle-accurate instruction set simulators are not suitable for this optimization, since meaningful temperature and power consumption evaluations require simulating seconds, if not minutes, of workloads.
This paper presents a system-level domain-specific SoC simulation (DS3) framework to address this need. DS3 enables both design space exploration and dynamic resource management for power-performance optimization for domain applications with$~600\times$ speedup compared to commonly used gem5 simulator. We showcase DS3 using five applications from wireless communications and radar processing domain. DS3, as well as the reference applications, will be shared as open-source software to stimulate research in this area.
△ Less
Submitted 9 August, 2019;
originally announced August 2019.