Skip to main content

Showing 1–12 of 12 results for author: Schuiki, F

.
  1. Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

    Authors: Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, Luca Benini

    Abstract: Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an existing memory-streaming RISC-V ISA extension to accelerate both one- and two-sided operand sparsity on widespread sparse tensor formats like compressed sparse row (… ▽ More

    Submitted 2 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: 15 pages, 8 figures. Accepted for publication in IEEE TPDS

  2. arXiv:2104.08009  [pdf, other

    cs.DC cs.AR cs.CV cs.LG

    Implementing CNN Layers on the Manticore Cluster-Based Many-Core Architecture

    Authors: Andreas Kurth, Fabian Schuiki, Luca Benini

    Abstract: This document presents implementations of fundamental convolutional neural network (CNN) layers on the Manticore cluster-based many-core architecture and discusses their characteristics and trade-offs.

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: Technical report. 18 pages, 4 figures, 5 algorithms

    ACM Class: C.4; C.1.4; F.2.1; I.2

  3. arXiv:2011.08070  [pdf, other

    cs.AR

    Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

    Authors: Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, Luca Benini

    Abstract: Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matr… ▽ More

    Submitted 14 December, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

    Comments: 6 pages, 4 figures. Submitted to DATE 2021. Camera-ready version

  4. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

    Authors: Andreas Kurth, Wolfgang Rönninger, Thomas Benz, Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Luca Benini

    Abstract: On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heteroge… ▽ More

    Submitted 11 November, 2021; v1 submitted 11 September, 2020; originally announced September 2020.

    Comments: 14 pages, 24 figures, 4 tables

    ACM Class: B.4.3; C.1.2; C.5.4

  5. arXiv:2008.06502  [pdf, other

    cs.AR

    Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

    Authors: Florian Zaruba, Fabian Schuiki, Luca Benini

    Abstract: Data-parallel problems demand ever growing floating-point (FP) operations per second under tight area- and energy-efficiency constraints. In this work, we present Manticore, a general-purpose, ultra-efficient chiplet-based architecture for data-parallel FP workloads. We have manufactured a prototype of the chiplet's computational core in Globalfoundries 22FDX process and demonstrate more than 5x i… ▽ More

    Submitted 20 November, 2020; v1 submitted 14 August, 2020; originally announced August 2020.

  6. arXiv:2007.01530  [pdf, other

    cs.AR

    FPnew: An Open-Source Multi-Format Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing

    Authors: Stefan Mach, Fabian Schuiki, Florian Zaruba, Luca Benini

    Abstract: The slowdown of Moore's law and the power wall necessitates a shift towards finely tunable precision (a.k.a. transprecision) computing to reduce energy footprint. Hence, we need circuits capable of performing floating-point operations on a wide range of precisions with high energy-proportionality. We present FPnew, a highly configurable open-source transprecision floating-point unit (TP-FPU) capab… ▽ More

    Submitted 3 July, 2020; originally announced July 2020.

  7. arXiv:2004.03494  [pdf, other

    cs.PL

    LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

    Authors: Fabian Schuiki, Andreas Kurth, Tobias Grosser, Luca Benini

    Abstract: Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs ex… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

  8. Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

    Authors: Florian Zaruba, Fabian Schuiki, Torsten Hoefler, Luca Benini

    Abstract: Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to a… ▽ More

    Submitted 8 October, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

  9. arXiv:1911.08356  [pdf, other

    cs.AR cs.DC

    Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

    Authors: Fabian Schuiki, Florian Zaruba, Torsten Hoefler, Luca Benini

    Abstract: Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33% on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy… ▽ More

    Submitted 1 April, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

  10. Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

    Authors: Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Michael Schaffner, Luca Benini

    Abstract: In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x… ▽ More

    Submitted 27 October, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

    Comments: 13 pages. Accepted for publication in IEEE Transactions on Very Large Scale Integration Systems

  11. arXiv:1812.00182  [pdf, other

    cs.DC cs.AR

    NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

    Authors: Fabian Schuiki, Michael Schaffner, Luca Benini

    Abstract: Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-… ▽ More

    Submitted 1 December, 2018; originally announced December 2018.

    Comments: 6 pages, invited paper at DATE 2019

  12. arXiv:1803.04783  [pdf, other

    cs.DC cs.AR

    A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

    Authors: Fabian Schuiki, Michael Schaffner, Frank K. Gürkaynak, Luca Benini

    Abstract: Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX t… ▽ More

    Submitted 17 October, 2018; v1 submitted 19 February, 2018; originally announced March 2018.

    Comments: 14 pages, submitted to IEEE Transactions on Computers journal