-
ShadowBinding: Realizing Effective Microarchitectures for In-Core Secure Speculation Schemes
Authors:
Amund Bergland Kvalsvik,
Magnus Själander
Abstract:
Secure speculation schemes have shown great promise in the war against speculative side-channel attacks, and will be a key building block for developing secure, high-performance architectures moving forward. As the field matures, the need for rigorous microarchitectures, and corresponding performance and cost analysis, become critical for evaluating secure schemes and for enabling their future ado…
▽ More
Secure speculation schemes have shown great promise in the war against speculative side-channel attacks, and will be a key building block for developing secure, high-performance architectures moving forward. As the field matures, the need for rigorous microarchitectures, and corresponding performance and cost analysis, become critical for evaluating secure schemes and for enabling their future adoption.
In ShadowBinding, we present effective microarchitectures for two state-of-the-art secure schemes, uncovering and mitigating fundamental microarchitectural limitations within the analyzed schemes, and provide important design characteristics. We uncover that Speculative Taint Tracking's (STT's) rename-based taint computation must be completed in a single cycle, creating an expensive dependency chain which greatly limits performance for wider processor cores. We also introduce a novel michroarchitectural approach for STT, named STT-Issue, which, by delaying the taint computation to the issue stage, eliminates the dependency chain, achieving better instructions per cycle (IPC), timing, area, and performance results.
Through a comprehensive evaluation of our STT and Non-Speculative Data Access (NDA) microarchitectural designs on the RISC-V Berkeley Out-of-Order Machine, we find that the IPC impact of in-core secure schemes is higher than previously estimated, close to 20% for the highest performance core. With insights into timing from our RTL evaluation, the performance loss, created by the combined impact of IPC and timing, becomes even greater, at 35%, 27%, and 22% for STT-Rename, STT-Issue, and NDA, respectively. If these trends were to hold for leading processor core designs, the performance impact would be well over 30%, even for the best-performing scheme.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Optimizing Energy Efficiency in Subthreshold RISC-V Cores
Authors:
Asbjørn Djupdal,
Magnus Själander,
Magnus Jahre,
Snorre Aunet,
Trond Ytterdal
Abstract:
Our goal in this paper is to understand how to maximize energy efficiency when designing standard-ISA processor cores for subthreshold operation. We hence develop a custom subthreshold library and use it to synthesize the open-source RISC-V cores SERV, QERV, PicoRV32, Ibex, Rocket, and two variants of Vex, targeting a supply voltage of 300 mV in a commercial 130 nm process. SERV, QERV, and PicoRV3…
▽ More
Our goal in this paper is to understand how to maximize energy efficiency when designing standard-ISA processor cores for subthreshold operation. We hence develop a custom subthreshold library and use it to synthesize the open-source RISC-V cores SERV, QERV, PicoRV32, Ibex, Rocket, and two variants of Vex, targeting a supply voltage of 300 mV in a commercial 130 nm process. SERV, QERV, and PicoRV32 are multi-cycle architectures, while Ibex, Vex, and Rocket are pipelined architectures.
We find that SERV, QERV, PicoRV32, and Vex are Pareto optimal in one or more of performance, power, and area. The 2-stage Vex (Vex-2) is the most energy efficient core overall, mainly because it uses fewer cycles per instruction than multi-cycle SERV, QERV, and PicoRV32 while retaining similar power consumption. Pipelining increases core area, and we observe that for subthreshold operation, the longer wires of pipelined designs require adding buffers to maintain a cycle time that is low enough to achieve high energy efficiency. These buffers limit the performance gains achievable by deeper pipelining because they result in cycle time no longer scaling proportionally with pipeline stages. The added buffers and the additional area required for pipelining logic however increase power consumption, and Vex-2 therefore provides similar performance and lower power consumption than the 5-stage cores Vex-5 and Rocket. A key contribution of this paper is therefore to demonstrate that limited-depth pipelined RISC-V designs hit the sweet spot in balancing performance and power consumption when optimizing for energy efficiency in subthreshold operation.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges
Authors:
David Metz,
Nico Reissmann,
Magnus Själander
Abstract:
Dynamically scheduled hardware enables high-level synthesis (HLS) for applications with irregular control flow and latencies, which perform poorly with conventional statically scheduled approaches. Since dynamically scheduled hardware is inherently data flow based, it is beneficial to have an intermediate representation (IR) that captures the global data flow to enable easier transformations. Stat…
▽ More
Dynamically scheduled hardware enables high-level synthesis (HLS) for applications with irregular control flow and latencies, which perform poorly with conventional statically scheduled approaches. Since dynamically scheduled hardware is inherently data flow based, it is beneficial to have an intermediate representation (IR) that captures the global data flow to enable easier transformations. State-of-the-art dynamic HLS utilize control flow based IRs, which model data flow only at the basic block level, requiring the rediscovery of inter-block parallelism. The Regionalized Value State Dependence Graph (RVSDG) is an IR that models (1) control flow as part of the global data flow utilizing regions and (2) memory dependencies using state edges.
We propose R-HLS, a new RVSDG dialect targeted for dynamic high-level synthesis. R-HLS explicitly models control flow decisions, routing, and memory, which are only abstractly represented in the RVSDG. Expressing the control flow as part of the data flow reduces the need for complex optimizations to extract performance and enables easy conversion to parallel circuits. Furthermore, we present a distributed memory disambiguation optimization that leverages memory state edges to decouple address generation from data accesses, resulting in resource efficient out-of-program-order execution of memory operations. Our results show that R-HLS effectively exposes parallelism, resulting in fewer executed cycles and a 10% speedup on average, compared to the state-of-the-art in dynamic HLS with optimized memory disambiguation. These results are achieved with a significant reduction in resource utilization, such as a 79% reduction in lookup-tables and 22% reduction in flip-flops, on average.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization
Authors:
Jacob O. Tørring,
Carl Hvarfner,
Luigi Nardi,
Magnus Själander
Abstract:
Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures the c…
▽ More
Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures the complexities of compiler autotuning, ranging from discrete, conditional, and permutation parameter types to known and unknown binary constraints, as well as both multi-fidelity and multi-objective evaluations. The benchmarks in CATBench span a range of machine learning-oriented computations, from tensor algebra to image processing and clustering, and uses state-of-the-art compilers, such as TACO and RISE/ELEVATE. CATBench offers a unified interface for evaluating Bayesian optimization algorithms, promoting reproducibility and innovation through an easy-to-use, fully containerized setup of both surrogate and real-world compiler optimization tasks. We validate CATBench on several state-of-the-art algorithms, revealing their strengths and weaknesses and demonstrating the suite's potential for advancing both Bayesian optimization and compiler autotuning research.
△ Less
Submitted 8 April, 2025; v1 submitted 24 June, 2024;
originally announced June 2024.
-
"It's a Trap!"-How Speculation Invariance Can Be Abused with Forward Speculative Interference
Authors:
Pavlos Aimoniotis,
Christos Sakalis,
Magnus Själander,
Stefanos Kaxiras
Abstract:
Speculative side-channel attacks access sensitive data and use transmitters to leak the data during wrong-path execution. Various defenses have been proposed to prevent such information leakage. However, not all speculatively executed instructions are unsafe: Recent work demonstrates that speculation invariant instructions are independent of speculative control-flow paths and are guaranteed to eve…
▽ More
Speculative side-channel attacks access sensitive data and use transmitters to leak the data during wrong-path execution. Various defenses have been proposed to prevent such information leakage. However, not all speculatively executed instructions are unsafe: Recent work demonstrates that speculation invariant instructions are independent of speculative control-flow paths and are guaranteed to eventually commit, regardless of the speculation outcome. Compile-time information coupled with run-time mechanisms can then selectively lift defenses for speculation invariant instructions, reclaiming some of the lost performance.
Unfortunately, speculation invariant instructions can easily be manipulated by a form of speculative interference to leak information via a new side-channel that we introduce in this paper. We show that forward speculative interference whereolder speculative instructions interfere with younger speculation invariant instructions effectively turns them into transmitters for secret data accessed during speculation. We demonstrate forward speculative interference on actual hardware, by selectively filling the reorder buffer (ROB) with instructions, pushing speculative invariant instructions in-or-out of the ROB on demand, based on a speculatively accessed secret. This reveals the speculatively accessed secret, as the occupancy of the ROB itself becomes a new speculative side-channel.
△ Less
Submitted 2 December, 2022; v1 submitted 22 September, 2021;
originally announced September 2021.
-
Selectively Delaying Instructions to Prevent Microarchitectural Replay Attacks
Authors:
Christos Sakalis,
Stefanos Kaxiras,
Magnus Själander
Abstract:
MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack security critical trusted execution environments (s…
▽ More
MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack security critical trusted execution environments (secure enclaves), even under conditions where a side-channel attack would not be possible. At the same time, unlike speculative side-channel attacks, MicroScope can be used to amplify the correct path of execution, rendering many existing speculative side-channel defences ineffective.
In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defence against them. We make the observation that such attacks rely on repeated squashes of so-called "replay handles" and that the instructions causing the side-channel must reside in the same reorder buffer window as the handles. We propose Delay-on-Squash, a technique for tracking squashed instructions and preventing them from being replayed by speculative replay handles. Our evaluation shows that it is possible to achieve full security against microarchitectural replay attacks with very modest hardware requirements, while still maintaining 97% of the insecure baseline performance.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
On Value Recomputation to Accelerate Invisible Speculation
Authors:
Christos Sakalis,
Zamshed I. Chowdhury,
Shayne Wadle,
Ismail Akturk,
Alberto Ros,
Magnus Själander,
Stefanos Kaxiras,
Ulya R. Karpuzcu
Abstract:
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs perform…
▽ More
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energy-expensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delay-on-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.
-
EPIC: An Energy-Efficient, High-Performance GPGPU Computing Research Infrastructure
Authors:
Magnus Själander,
Magnus Jahre,
Gunnar Tufte,
Nico Reissmann
Abstract:
The pursuit of many research questions requires massive computational resources. State-of-the-art research in physical processes using simulations, the training of neural networks for deep learning, or the analysis of big data are all dependent on the availability of sufficient and performant computational resources. For such research, access to a high-performance computing infrastructure is indis…
▽ More
The pursuit of many research questions requires massive computational resources. State-of-the-art research in physical processes using simulations, the training of neural networks for deep learning, or the analysis of big data are all dependent on the availability of sufficient and performant computational resources. For such research, access to a high-performance computing infrastructure is indispensable. Many scientific workloads from such research domains are inherently parallel and can benefit from the data-parallel architecture of general purpose graphics processing units (GPGPUs). However, GPGPU resources are scarce at Norway's national infrastructure. EPIC is a GPGPU enabled computing research infrastructure at NTNU. It enables NTNU's researchers to perform experiments that otherwise would be impossible, as time-to-solution would simply take too long.
△ Less
Submitted 5 July, 2024; v1 submitted 12 December, 2019;
originally announced December 2019.
-
RVSDG: An Intermediate Representation for Optimizing Compilers
Authors:
Nico Reissmann,
Jan Christian Meyer,
Helge Bahmann,
Magnus Själander
Abstract:
Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses requi…
▽ More
Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the MLIR proposed by Google. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations. We present the Regionalized Value State Dependence Graph (RVSDG) IR for optimizing compilers. The RVSDG is a data flow centric IR where nodes represent computations, edges represent computational dependencies, and regions capture the hierarchical structure of programs. It represents programs in demand-dependence form, implicitly supports structured control flow, and models entire programs within a single IR. We provide a complete specification of the RVSDG, construction and destruction methods, as well as exemplify its utility by presenting Dead Node and Common Node Elimination optimizations. We implemented a prototype compiler and evaluate it in terms of performance, code size, compilation time, and representational overhead. Our results indicate that the RVSDG can serve as a competitive IR in optimizing compilers while reducing complexity.
△ Less
Submitted 17 March, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
Authors:
Yaman Umuroglu,
Davide Conficconi,
Lahiru Rasnayake,
Thomas B. Preusser,
Magnus Sjalander
Abstract:
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offerin…
▽ More
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes 6-LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.
△ Less
Submitted 11 June, 2019; v1 submitted 2 January, 2019;
originally announced January 2019.
-
BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
Authors:
Yaman Umuroglu,
Lahiru Rasnayake,
Magnus Sjalander
Abstract:
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offerin…
▽ More
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.
△ Less
Submitted 22 June, 2018;
originally announced June 2018.