Search | arXiv e-print repository

doi 10.1145/3719276.3725186

Ramping Up Open-Source RISC-V Cores: Assessing the Energy Efficiency of Superscalar, Out-of-Order Execution

Authors: Zexin Fu, Riccardo Tedeschi, Gianmarco Ottavi, Nils Wistoff, César Fuguet, Davide Rossi, Luca Benini

Abstract: Open-source RISC-V cores are increasingly demanded in domains like automotive and space, where achieving high instructions per cycle (IPC) through superscalar and out-of-order (OoO) execution is crucial. However, high-performance open-source RISC-V cores face adoption challenges: some (e.g. BOOM, Xiangshan) are developed in Chisel with limited support from industrial electronic design automation (… ▽ More Open-source RISC-V cores are increasingly demanded in domains like automotive and space, where achieving high instructions per cycle (IPC) through superscalar and out-of-order (OoO) execution is crucial. However, high-performance open-source RISC-V cores face adoption challenges: some (e.g. BOOM, Xiangshan) are developed in Chisel with limited support from industrial electronic design automation (EDA) tools. Others, like the XuanTie C910 core, use proprietary interfaces and protocols, including non-standard AXI protocol extensions, interrupts, and debug support. In this work, we present a modified version of the OoO C910 core to achieve full RISC-V standard compliance in its debug, interrupt, and memory interfaces. We also introduce CVA6S+, an enhanced version of the dual-issue, industry-supported open-source CVA6 core. CVA6S+ achieves 34.4% performance improvement over CVA6 core. We conduct a detailed performance, area, power, and energy analysis on the superscalar out-of-order C910, superscalar in-order CVA6S+ and vanilla, single-issue in-order CVA6, all implemented in a 22nm technology and integrated into Cheshire, an open-source modular SoC. We examine the performance and efficiency of different microarchitectures using the same ISA, SoC, and implementation with identical technology, tools, and methodologies. The area and performance rankings of CVA6, CVA6S+, and C910 follow expected trends: compared to the scalar CVA6, CVA6S+ shows an area increase of 6% and an IPC improvement of 34.4%, while C910 exhibits a 75% increase in area and a 119.5% improvement in IPC. However, efficiency analysis reveals that CVA6S+ leads in area efficiency (GOPS/mm2), while the C910 is highly competitive in energy efficiency (GOPS/W). This challenges the common belief that high performance in superscalar and out-of-order cores inherently comes at a significant cost in area and energy efficiency. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.03762 [pdf, other]

CVA6S+: A Superscalar RISC-V Core with High-Throughput Memory Architecture

Authors: Riccardo Tedeschi, Gianmarco Ottavi, Côme Allart, Nils Wistoff, Zexin Fu, Filippo Grillotti, Fabio De Ambroggi, Elio Guidetti, Jean-Baptiste Rigaud, Olivier Potin, Jean Roch Coulon, César Fuguet, Luca Benini, Davide Rossi

Abstract: Open-source RISC-V cores are increasingly adopted in high-end embedded domains such as automotive, where maximizing instructions per cycle (IPC) is becoming critical. Building on the industry-supported open-source CVA6 core and its superscalar variant, CVA6S, we introduce CVA6S+, an enhanced version incorporating improved branch prediction, register renaming and enhanced operand forwarding. These… ▽ More Open-source RISC-V cores are increasingly adopted in high-end embedded domains such as automotive, where maximizing instructions per cycle (IPC) is becoming critical. Building on the industry-supported open-source CVA6 core and its superscalar variant, CVA6S, we introduce CVA6S+, an enhanced version incorporating improved branch prediction, register renaming and enhanced operand forwarding. These optimizations enable CVA6S+ to achieve a 43.5% performance improvement over the scalar configuration and 10.9% over CVA6S, with an area overhead of just 9.30% over the scalar core (CVA6). Furthermore, we integrate CVA6S+ with the OpenHW Core-V High-Performance L1 Dcache (HPDCache) and report a 74.1% bandwidth improvement over the legacy CVA6 cache subsystem. △ Less

Submitted 8 May, 2025; v1 submitted 20 April, 2025; originally announced May 2025.

Comments: 3 pages, 1 figure

arXiv:2504.10345 [pdf, other]

doi 10.1145/3706594.3726974

AraOS: Analyzing the Impact of Virtual Memory Management on Vector Unit Performance

Authors: Matteo Perotti, Vincenzo Maisto, Moritz Imfeld, Nils Wistoff, Alessandro Cilardo, Luca Benini

Abstract: Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly e… ▽ More Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: Submitted to CF25-OSHW: Workshop on Open-Source Hardware (3rd Edition), co-located with Computing Frontiers 2025

arXiv:2502.18953 [pdf, other]

A Reliable, Time-Predictable Heterogeneous SoC for AI-Enhanced Mixed-Criticality Edge Applications

Authors: Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, Davide Rossi, Luca Benini

Abstract: Next-generation mixed-criticality Systems-on-chip (SoCs) for robotics, automotive, and space must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks sharing resources with non-critical tasks, while also fitting within a sub-2W power envelope. To tackle these multi-dimensional challenges, in this brief, w… ▽ More Next-generation mixed-criticality Systems-on-chip (SoCs) for robotics, automotive, and space must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks sharing resources with non-critical tasks, while also fitting within a sub-2W power envelope. To tackle these multi-dimensional challenges, in this brief, we present a 16nm, reliable, time-predictable heterogeneous SoC with multiple programmable accelerators. Within a 1.2W power envelope, the SoC integrates software-configurable hardware IPs to ensure predictable access to shared resources, such as the on-chip interconnect and memory system, leading to tight upper bounds on execution times of critical applications. To accelerate mixed-precision mission-critical AI, the SoC integrates a reliable multi-core accelerator achieving 304.9 GOPS peak performance at 1.6 TOPS/W energy efficiency. Non-critical, compute-intensive, floating-point workloads are accelerated by a dual-core vector cluster, achieving 121.8 GFLOPS at 1.1 TFLOPS/W and 106.8 GFLOPS/mm2. △ Less

Submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.02626 [pdf, other]

ArtistIC: An Open-Source Toolchain for Top-Metal IC Art and Ultra-High-Fidelity GDSII Renders

Authors: Thomas Benz, Paul Scheffler, Nils Wistoff, Philippe Sauter, Beat Muheim, Luca Benini

Abstract: Open-source projects require outreach material to grow their community, secure funds, and strengthen their influence. Numbers, specifications, and facts alone are intangible to uninvolved people; using a clear brand and appealing visual material is thus ample to reach a broad audience. This is especially true for application-specific integrated circuits (ASICs) during the early stages of the devel… ▽ More Open-source projects require outreach material to grow their community, secure funds, and strengthen their influence. Numbers, specifications, and facts alone are intangible to uninvolved people; using a clear brand and appealing visual material is thus ample to reach a broad audience. This is especially true for application-specific integrated circuits (ASICs) during the early stages of the development cycle without running prototype systems. This work presents ArtistIC, an open-source framework to brand ASICs with top-metal art and to render GDSII layouts with ultra-high fidelity reaching render densities below 25 nm/px and gigapixels-scale resolutions. △ Less

Submitted 4 February, 2025; originally announced February 2025.

Comments: 2 pages, 3 figures

arXiv:2501.07330 [pdf, other]

doi 10.1109/JSSC.2025.3529249

Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET

Authors: Paul Scheffler, Thomas Benz, Viviane Potocnik, Tim Fischer, Luca Colagrande, Nils Wistoff, Yichao Zhang, Luca Bertaccini, Gianmarco Ottavi, Manuel Eggimann, Matheus Cavalcante, Gianna Paulin, Frank K. Gürkaynak, Davide Rossi, Luca Benini

Abstract: ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-cor… ▽ More ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license. △ Less

Submitted 13 January, 2025; originally announced January 2025.

Comments: 16 pages, 13 figures, 1 table. Accepted for publication in IEEE JSSC

arXiv:2410.07798 [pdf, other]

vCLIC: Towards Fast Interrupt Handling in Virtualized RISC-V Mixed-criticality Systems

Authors: Enrico Zelioli, Alessandro Ottaviano, Robert Balas, Nils Wistoff, Angelo Garofalo, Luca Benini

Abstract: The widespread diffusion of compute-intensive edge-AI workloads and the stringent demands of modern autonomous systems require advanced heterogeneous embedded architectures. Such architectures must support high-performance and reliable execution of parallel tasks with different levels of criticality. Hardware-assisted virtualization is crucial for isolating applications concurrently executing thes… ▽ More The widespread diffusion of compute-intensive edge-AI workloads and the stringent demands of modern autonomous systems require advanced heterogeneous embedded architectures. Such architectures must support high-performance and reliable execution of parallel tasks with different levels of criticality. Hardware-assisted virtualization is crucial for isolating applications concurrently executing these tasks under real-time constraints, but interrupt virtualization poses challenges in ensuring transparency to virtual guests while maintaining real-time system features, such as interrupt vectoring, nesting, and tail-chaining. Despite its rapid advancement to address virtualization needs for mixed-criticality systems, the RISC-V ecosystem still lacks interrupt controllers with integrated virtualization and real-time features, currently relying on non-deterministic, bus-mediated message-signaled interrupts (MSIs) for virtualization. To overcome this limitation, we present the design, implementation, and in-system assessment of vCLIC, a virtualization extension to the RISC-V CLIC fast interrupt controller. Our approach achieves 20x interrupt latency speed-up over the software emulation required for handling non-virtualization-aware systems, reduces response latency by 15% compared to existing MSI-based approaches, and is free from interference from the system bus, at an area cost of just 8kGE when synthesized in an advanced 16nm FinFet technology. △ Less

Submitted 10 October, 2024; originally announced October 2024.

Comments: 4 pages, 4 figures, accepted for presentation at the 42nd IEEE International Conference on Computer Design (ICCD 2024)

arXiv:2409.07576 [pdf, other]

fence.t.s: Closing Timing Channels in High-Performance Out-of-Order Cores through ISA-Supported Temporal Partitioning

Authors: Nils Wistoff, Gernot Heiser, Luca Benini

Abstract: Microarchitectural timing channels exploit information leakage between security domains that should be isolated, bypassing the operating system's security boundaries. These channels result from contention for shared microarchitectural state. In the RISC-V instruction set, the temporal fence instruction (fence.t) was proposed to close timing channels by providing an operating system with the means… ▽ More Microarchitectural timing channels exploit information leakage between security domains that should be isolated, bypassing the operating system's security boundaries. These channels result from contention for shared microarchitectural state. In the RISC-V instruction set, the temporal fence instruction (fence.t) was proposed to close timing channels by providing an operating system with the means to temporally partition microarchitectural state inexpensively in simple in-order cores. This work explores challenges with fence.t in superscalar out-of-order cores featuring large and pervasive microarchitectural state. To overcome these challenges, we propose a novel SW-supported temporal fence (fence.t.s), which reuses existing mechanisms and supports advanced microarchitectural features, enabling full timing channel protection of an exemplary out-of-order core (OpenC910) at negligible hardware costs and a minimal performance impact of 1.0 %. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: 8 pages, 3 figures, 1 algorithm, 1 listing. Accepted at the 2024 International Conference on Applications in Electronics Pervading Industry, Environment and Society (APPLEPIES 2024)

arXiv:2407.19895 [pdf]

Culsans: An Efficient Snoop-based Coherency Unit for the CVA6 Open Source RISC-V application processor

Authors: Riccardo Tedeschi, Luca Valente, Gianmarco Ottavi, Enrico Zelioli, Nils Wistoff, Massimiliano Giacometti, Abdul Basit Sajjad, Luca Benini, Davide Rossi

Abstract: Symmetric Multi-Processing (SMP) based on cache coherency is crucial for high-end embedded systems like automotive applications. RISC-V is gaining traction, and open-source hardware (OSH) platforms offer solutions to issues such as IP costs and vendor dependency. Existing multi-core cache-coherent RISC-V platforms are complex and not efficient for small embedded core clusters. We propose an open-s… ▽ More Symmetric Multi-Processing (SMP) based on cache coherency is crucial for high-end embedded systems like automotive applications. RISC-V is gaining traction, and open-source hardware (OSH) platforms offer solutions to issues such as IP costs and vendor dependency. Existing multi-core cache-coherent RISC-V platforms are complex and not efficient for small embedded core clusters. We propose an open-source SystemVerilog implementation of a lightweight snoop-based cache-coherent cluster of Linux-capable CVA6 cores. Our design uses the MOESI protocol via the Arm's AMBA ACE protocol. Evaluated with Splash-3 benchmarks, our solution shows up to 32.87% faster performance in a dual-core setup and an average improvement of 15.8% over OpenPiton. Synthesized using GF 22nm FDSOI technology, the Cache Coherency Unit occupies only 1.6% of the system area. △ Less

Submitted 12 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

Comments: 4 pages, 4 figures, DSD2024 and SEAA2024 Works in Progress Session AUG 2024; Updated the acknowledgments

arXiv:2406.15068 [pdf, other]

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Authors: Gianna Paulin, Paul Scheffler, Thomas Benz, Matheus Cavalcante, Tim Fischer, Manuel Eggimann, Yichao Zhang, Nils Wistoff, Luca Bertaccini, Luca Colagrande, Gianmarco Ottavi, Frank K. Gürkaynak, Davide Rossi, Luca Benini

Abstract: We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stenc… ▽ More We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 2 pages, 7 figures. Accepted at the 2024 IEEE Symposium on VLSI Technology & Circuits

arXiv:2401.03531 [pdf, other]

doi 10.1109/TCSI.2024.3359044

A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation

Authors: Luca Valente, Alessandro Nadalini, Asif Veeran, Mattia Sinigaglia, Bruno Sa, Nils Wistoff, Yvan Tortorella, Simone Benatti, Rafail Psiakis, Ari Kulmala, Baker Mohammad, Sandro Pinto, Daniele Palossi, Luca Benini, Davide Rossi

Abstract: The rapid advancement of energy-efficient parallel ultra-low-power (ULP) ucontrollers units (MCUs) is enabling the development of autonomous nano-sized unmanned aerial vehicles (nano-UAVs). These sub-10cm drones represent the next generation of unobtrusive robotic helpers and ubiquitous smart sensors. However, nano-UAVs face significant power and payload constraints while requiring advanced comput… ▽ More The rapid advancement of energy-efficient parallel ultra-low-power (ULP) ucontrollers units (MCUs) is enabling the development of autonomous nano-sized unmanned aerial vehicles (nano-UAVs). These sub-10cm drones represent the next generation of unobtrusive robotic helpers and ubiquitous smart sensors. However, nano-UAVs face significant power and payload constraints while requiring advanced computing capabilities akin to standard drones, including real-time Machine Learning (ML) performance and the safe co-existence of general-purpose and real-time OSs. Although some advanced parallel ULP MCUs offer the necessary ML computing capabilities within the prescribed power limits, they rely on small main memories (<1MB) and ucontroller-class CPUs with no virtualization or security features, and hence only support simple bare-metal runtimes. In this work, we present Shaheen, a 9mm2 200mW SoC implemented in 22nm FDX technology. Differently from state-of-the-art MCUs, Shaheen integrates a Linux-capable RV64 core, compliant with the v1.0 ratified Hypervisor extension and equipped with timing channel protection, along with a low-cost and low-power memory controller exposing up to 512MB of off-chip low-cost low-power HyperRAM directly to the CPU. At the same time, it integrates a fully programmable energy- and area-efficient multi-core cluster of RV32 cores optimized for general-purpose DSP as well as reduced- and mixed-precision ML. To the best of the authors' knowledge, it is the first silicon prototype of a ULP SoC coupling the RV64 and RV32 cores in a heterogeneous host+accelerator architecture fully based on the RISC-V ISA. We demonstrate the capabilities of the proposed SoC on a wide range of benchmarks relevant to nano-UAV applications. The cluster can deliver up to 90GOp/s and up to 1.8TOp/s/W on 2-bit integer kernels and up to 7.9GFLOp/s and up to 150GFLOp/s/W on 16-bit FP kernels. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2310.17046 [pdf, other]

Proving the Absence of Microarchitectural Timing Channels

Authors: Scott Buckley, Robert Sison, Nils Wistoff, Curtis Millar, Toby Murray, Gerwin Klein, Gernot Heiser

Abstract: Microarchitectural timing channels are a major threat to computer security. A set of OS mechanisms called time protection was recently proposed as a principled way of preventing information leakage through such channels and prototyped in the seL4 microkernel. We formalise time protection and the underlying hardware mechanisms in a way that allows linking them to the information-flow proofs that sh… ▽ More Microarchitectural timing channels are a major threat to computer security. A set of OS mechanisms called time protection was recently proposed as a principled way of preventing information leakage through such channels and prototyped in the seL4 microkernel. We formalise time protection and the underlying hardware mechanisms in a way that allows linking them to the information-flow proofs that showed the absence of storage channels in seL4. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Scott Buckley and Robert Sison were joint lead authors

ACM Class: D.4.6; D.2.4; F.3.1

arXiv:2307.04148 [pdf, ps, other]

doi 10.1109/MECO58584.2023.10154913

Towards a RISC-V Open Platform for Next-generation Automotive ECUs

Authors: Luca Cuomo, Claudio Scordino, Alessandro Ottaviano, Nils Wistoff, Robert Balas, Luca Benini, Errico Guidieri, Ida Maria Savino

Abstract: The complexity of automotive systems is increasing quickly due to the integration of novel functionalities such as assisted or autonomous driving. However, increasing complexity poses considerable challenges to the automotive supply chain since the continuous addition of new hardware and network cabling is not considered tenable. The availability of modern heterogeneous multi-processor chips repre… ▽ More The complexity of automotive systems is increasing quickly due to the integration of novel functionalities such as assisted or autonomous driving. However, increasing complexity poses considerable challenges to the automotive supply chain since the continuous addition of new hardware and network cabling is not considered tenable. The availability of modern heterogeneous multi-processor chips represents a unique opportunity to reduce vehicle costs by integrating multiple functionalities into fewer Electronic Control Units (ECUs). In addition, the recent improvements in open-hardware technology allow to further reduce costs by avoiding lock-in solutions. This paper presents a mixed-criticality multi-OS architecture for automotive ECUs based on open hardware and open-source technologies. Safety-critical functionalities are executed by an AUTOSAR OS running on a RISC-V processor, while the Linux OS executes more advanced functionalities on a multi-core ARM CPU. Besides presenting the implemented stack and the communication infrastructure, this paper provides a quantitative gap analysis between an HW/SW optimized version of the RISC-V processor and a COTS Arm Cortex-R in terms of real-time features, confirming that RISC-V is a valuable candidate for running AUTOSAR Classic stacks of next-generation automotive MCUs. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 8 pages, 2023 12th Mediterranean Conference on Embedded Computing (MECO)

Journal ref: 2023 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 2023, pp. 1-8

arXiv:2210.08882 [pdf, other]

doi 10.1109/ASAP54787.2022.00017

A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design

Authors: Matteo Perotti, Matheus Cavalcante, Nils Wistoff, Renzo Andri, Lukas Cavigelli, Luca Benini

Abstract: Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification… ▽ More Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels. △ Less

Submitted 9 January, 2025; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Accepted version of the article published in "2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)"

arXiv:2205.12580 [pdf, other]

doi 10.1109/ISVLSI54635.2022.00089

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

Authors: Michael Rogenmoser, Nils Wistoff, Pirmin Vogel, Frank Gürkaynak, Luca Benini

Abstract: With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V c… ▽ More With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V cluster with a novel On-Demand Redundancy Grouping (ODRG) scheme. ODRG allows the cluster to operate either as two fault-tolerant cores, or six individual cores for high-performance, with limited overhead to switch between these modes during run-time. The ODRG unit adds less than 11% of a core's area for a three-core group, or a total of 1% of the cluster area, and shows negligible timing increase, which compares favorably to a commercial state-of-the-art implementation, and is 2.5$\times$ faster in fault recovery re-synchronization. Furthermore, when redundancy is not necessary, the ODRG approach allows the redundant cores to be used for independent computation, allowing up to 2.96$\times$ increase in performance for selected applications. △ Less

Submitted 3 October, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Journal ref: ISVLSI (2022) 398-401

arXiv:2202.12029 [pdf, other]

Systematic Prevention of On-Core Timing Channels by Full Temporal Partitioning

Authors: Nils Wistoff, Moritz Schneider, Frank K. Gürkaynak, Gernot Heiser, Luca Benini

Abstract: Microarchitectural timing channels enable unwanted information flow across security boundaries, violating fundamental security assumptions. They leverage timing variations of several state-holding microarchitectural components and have been demonstrated across instruction set architectures and hardware implementations. Analogously to memory protection, Ge et al. have proposed time protection for p… ▽ More Microarchitectural timing channels enable unwanted information flow across security boundaries, violating fundamental security assumptions. They leverage timing variations of several state-holding microarchitectural components and have been demonstrated across instruction set architectures and hardware implementations. Analogously to memory protection, Ge et al. have proposed time protection for preventing information leakage via timing channels. They also showed that time protection calls for hardware support. This work leverages the open and extensible RISC-V instruction set architecture (ISA) to introduce the temporal fence instruction fence.t, which provides the required mechanisms by clearing vulnerable microarchitectural state and guaranteeing a history-independent context-switch latency. We propose and discuss three different implementations of fence.t and implement them on an experimental version of the seL4 microkernel and CVA6, an open-source, in-order, application class, 64-bit RISC-V core. We find that a complete, systematic, ISA-supported erasure of all non-architectural core components is the most effective implementation while featuring a low implementation effort, a minimal performance overhead of approximately 2%, and negligible hardware costs. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: This work has been submitted to the IEEE for possible publication. arXiv admin note: text overlap with arXiv:2005.02193

arXiv:2005.02193 [pdf, other]

Prevention of Microarchitectural Covert Channels on an Open-Source 64-bit RISC-V Core

Authors: Nils Wistoff, Moritz Schneider, Frank K. Gürkaynak, Luca Benini, Gernot Heiser

Abstract: Covert channels enable information leakage across security boundaries of the operating system. Microarchitectural covert channels exploit changes in execution timing resulting from competing access to limited hardware resources. We use the recent experimental support for time protection, aimed at preventing covert channels, in the seL4 microkernel and evaluate the efficacy of the mechanisms agains… ▽ More Covert channels enable information leakage across security boundaries of the operating system. Microarchitectural covert channels exploit changes in execution timing resulting from competing access to limited hardware resources. We use the recent experimental support for time protection, aimed at preventing covert channels, in the seL4 microkernel and evaluate the efficacy of the mechanisms against five known channels on Ariane, an open-source 64-bit application-class RISC-V core. We confirm that without hardware support, these defences are expensive and incomplete. We show that the addition of a single-instruction extension to the RISC-V ISA, that flushes microarchitectural state, can enable the OS to close all five evaluated covert channels with low increase in context switch costs and negligible hardware overhead. We conclude that such a mechanism is essential for security. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: 6 pages, 7 figures, submitted to CARRV '20, additional appendix

Showing 1–17 of 17 results for author: Wistoff, N