-
Targeted Wearout Attacks in Microprocessor Cores
Authors:
Joshua Mashburn,
Johann Knechtel,
Florian Klemme,
Hussam Amrouch,
Ozgur Sinanoglu,
Paul V. Gratz
Abstract:
Negative-Bias Temperature Instability is a dominant aging mechanism in nanoscale CMOS circuits such as microprocessors. With this aging mechanism, the rate of device aging is dependent not only on overall operating conditions, such as heat, but also on user controllable inputs to the transistors. This dependence on input implies a possible timing fault-injection attack wherein a targeted path of l…
▽ More
Negative-Bias Temperature Instability is a dominant aging mechanism in nanoscale CMOS circuits such as microprocessors. With this aging mechanism, the rate of device aging is dependent not only on overall operating conditions, such as heat, but also on user controllable inputs to the transistors. This dependence on input implies a possible timing fault-injection attack wherein a targeted path of logic is intentionally degraded through the purposeful, software-driven actions of an attacker, rendering a targeted bit effectively stuck.
In this work, we describe such an attack mechanism, which we dub a "$\textbf{Targeted Wearout Attack}$", wherein an attacker with sufficient knowledge of the processor core, executing a carefully crafted software program with only user privilege, is able to degrade a functional unit within the processor with the aim of eliciting a particular desired incorrect calculation in a victim application. Here we give a general methodology for the attack. We then demonstrate a case study where a targeted path within the fused multiply-add pipeline in a RISC-V CPU sees a $>7x$ increase in wear over time than would be experienced under typical workloads. We show that an attacker could leverage such an attack, leading to targeted and silent data corruption in a co-running victim application using the same unit.
△ Less
Submitted 22 August, 2025;
originally announced August 2025.
-
Exposing Shadow Branches
Authors:
Chrysanthos Pepi,
Bhargav Reddy Godala,
Krishnam Tibrewala,
Gino Chacon,
Paul V. Gratz,
Daniel A. Jiménez,
Gilles A. Pokam,
David I. August
Abstract:
Modern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints a…
▽ More
Modern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints also grow, resulting in an increase in Branch Target Buffer (BTB) misses. FDIP can alleviate L1I cache misses, but when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched but, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (the taken branch). Branch instructions present in the ignored portion of the cache line we call them "Shadow Branches". Here we present Skeia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skeia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB across 16 front-end bound applications. Since many branches stored in the SBB are unique compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skeia across all examined sizes until saturation.
△ Less
Submitted 19 December, 2024; v1 submitted 22 August, 2024;
originally announced August 2024.
-
Correct Wrong Path
Authors:
Bhargav Reddy Godala,
Sankara Prasad Ramesh,
Krishnam Tibrewala,
Chrysanthos Pepi,
Gino Chacon,
Svilen Kanev,
Gilles A. Pokam,
Daniel A. Jiménez,
Paul V. Gratz,
David I. August
Abstract:
Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster th…
▽ More
Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite this strong benefit of trace-driven simulation, these often fail to adequately model the consequences of wrong path because obtaining them is nontrivial. Prior works consider either a positive or negative impact of wrong path but not both. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from $-3.05$\% to $20.9$\% when ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.
△ Less
Submitted 11 August, 2024;
originally announced August 2024.
-
Flow Correlator: A Flow Table Cache Management Strategy
Authors:
Luke McHale,
Paul V Gratz,
Alex Sprintson
Abstract:
Switching, routing, and security functions are the backbone of packet processing networks. Fast and efficient processing of packets requires maintaining the state of a large number of transient network connections. In particular, modern stateful firewalls, security monitoring devices, and software-defined networking (SDN) programmable dataplanes require maintaining stateful flow tables. These flow…
▽ More
Switching, routing, and security functions are the backbone of packet processing networks. Fast and efficient processing of packets requires maintaining the state of a large number of transient network connections. In particular, modern stateful firewalls, security monitoring devices, and software-defined networking (SDN) programmable dataplanes require maintaining stateful flow tables. These flow tables often grow much larger than can be expected to fit within on-chip memory, requiring a managed caching layer to maintain performance. This paper focuses on improving the efficiency of caching, an important architectural component of the packet processing data planes. We present a novel predictive approach to network flow table cache management. Our approach leverages a Hashed Perceptron binary classifier as well as an iterative approach to feature selection and ranking to improve the reliability and performance of the data plane caches. We validate the efficiency of the proposed techniques through extensive experimentation using real-world data sets. Our numerical results demonstrate that our techniques improve the reliability and performance of flow-centric packet processing architectures.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
The Championship Simulator: Architectural Simulation for Education and Competition
Authors:
Nathan Gober,
Gino Chacon,
Lei Wang,
Paul V. Gratz,
Daniel A. Jimenez,
Elvira Teran,
Seth Pugsley,
Jinchun Kim
Abstract:
Recent years have seen a dramatic increase in the microarchitectural complexity of processors. This increase in complexity presents a twofold challenge for the field of computer architecture. First, no individual architect can fully comprehend the complexity of the entire microarchitecture of the core. This leads to increasingly specialized architects, who treat parts of the core outside their par…
▽ More
Recent years have seen a dramatic increase in the microarchitectural complexity of processors. This increase in complexity presents a twofold challenge for the field of computer architecture. First, no individual architect can fully comprehend the complexity of the entire microarchitecture of the core. This leads to increasingly specialized architects, who treat parts of the core outside their particular expertise as black boxes. Second, with increasing complexity, the field becomes decreasingly accessible to new students of the field. When learning core microarchitecture, new students must first learn the big picture of how the system works in order to understand how the pieces all fit together. The tools used to study microarchitecture experience a similar struggle. As with the microarchitectures they simulate, an increase in complexity reduces accessibility to new users.
In this work, we present ChampSim. ChampSim uses a modular design and configurable structure to achieve a low barrier to entry into the field of microarchitecural simulation. ChampSim has shown itself to be useful in multiple areas of research, competition, and education. In this way, we seek to promote access and inclusion despite the increasing complexity of the field of computer architecture.
△ Less
Submitted 25 October, 2022;
originally announced October 2022.
-
Hardware Trojan Threats to Cache Coherence in Modern 2.5D Chiplet Systems
Authors:
Gino A. Chacon,
Charles Williams,
Johann Knechtel,
Ozgur Sinanoglu,
Paul V. Gratz
Abstract:
As industry moves toward chiplet-based designs, the insertion of hardware Trojans poses a significant threat to the security of these systems. These systems rely heavily on cache coherence for coherent data communication, making coherence an attractive target. Critically, unlike prior work, which focuses only on malicious packet modifications, a Trojan attack that exploits coherence can modify dat…
▽ More
As industry moves toward chiplet-based designs, the insertion of hardware Trojans poses a significant threat to the security of these systems. These systems rely heavily on cache coherence for coherent data communication, making coherence an attractive target. Critically, unlike prior work, which focuses only on malicious packet modifications, a Trojan attack that exploits coherence can modify data in memory that was never touched and is not owned by the chiplet which contains the Trojan. Further, the Trojan need not even be physically between the victim and the memory controller to attack the victim's memory transactions. Here, we explore the fundamental attack vectors possible in chiplet-based systems and provide an example Trojan implementation capable of directly modifying victim data in memory. This work aims to highlight the need for developing mechanisms that can protect and secure the coherence scheme from these forms of attacks.
△ Less
Submitted 30 September, 2022;
originally announced October 2022.
-
FPGA-based Hyrbid Memory Emulation System
Authors:
Fei Wen,
Mian Qin,
Paul V. Gratz,
A. L. Narasimha Reddy
Abstract:
Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as…
▽ More
Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory.
In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance test typically have much larger working sets, thus taking even longer simulation warm-up period.
In this paper, we propose a FPGA-based hybrid memory system emulation platform. We target at the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. Here, because the focus of our platform is on the design of the hybrid memory system, we leverage the on-board hard IP ARM processors to both improve simulation performance while improving accuracy of the results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart Gem5.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.