-
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics
Authors:
Subho S. Banerjee,
Saurabh Jha,
Zbigniew T. Kalbarczyk,
Ravishankar K. Iyer
Abstract:
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in…
▽ More
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.
-
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices
Authors:
Haoran Qiu,
Subho S. Banerjee,
Saurabh Jha,
Zbigniew T. Kalbarczyk,
Ravishankar K. Iyer
Abstract:
Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user reque…
▽ More
Modern user-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x.
△ Less
Submitted 19 October, 2020; v1 submitted 19 August, 2020;
originally announced August 2020.
-
ML-driven Malware that Targets AV Safety
Authors:
Saurabh Jha,
Shengkun Cui,
Subho S. Banerjee,
Timothy Tsai,
Zbigniew Kalbarczyk,
Ravi Iyer
Abstract:
Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor's ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is c…
▽ More
Ensuring the safety of autonomous vehicles (AVs) is critical for their mass deployment and public adoption. However, security attacks that violate safety constraints and cause accidents are a significant deterrent to achieving public trust in AVs, and that hinders a vendor's ability to deploy AVs. Creating a security hazard that results in a severe safety compromise (for example, an accident) is compelling from an attacker's perspective. In this paper, we introduce an attack model, a method to deploy the attack in the form of smart malware, and an experimental evaluation of its impact on production-grade autonomous driving software. We find that determining the time interval during which to launch the attack is{ critically} important for causing safety hazards (such as collisions) with a high degree of success. For example, the smart malware caused 33X more forced emergency braking than random attacks did, and accidents in 52.6% of the driving simulations.
△ Less
Submitted 12 June, 2020; v1 submitted 24 April, 2020;
originally announced April 2020.
-
Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters
Authors:
Subho S Banerjee,
Saurabh Jha,
Zbigniew T. Kalbarczyk,
Ravishankar K. Iyer
Abstract:
The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significa…
▽ More
The problem of scheduling of workloads onto heterogeneous processors (e.g., CPUs, GPUs, FPGAs) is of fundamental importance in modern data centers. Current system schedulers rely on application/system-specific heuristics that have to be built on a case-by-case basis. Recent work has demonstrated ML techniques for automating the heuristic search by using black-box approaches which require significant training data and time, which make them challenging to use in practice. This paper presents Symphony, a scheduling framework that addresses the challenge in two ways: (i) a domain-driven Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture; and (ii) a sampling-based technique to compute the gradients of a Bayesian model without performing full probabilistic inference. Together, these techniques reduce both the amount of training data and the time required to produce scheduling policies that significantly outperform black-box approaches by up to 2.2x.
△ Less
Submitted 30 June, 2020; v1 submitted 4 September, 2019;
originally announced September 2019.
-
ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection
Authors:
Saurabh Jha,
Subho S. Banerjee,
Timothy Tsai,
Siva K. S. Hari,
Michael B. Sullivan,
Zbigniew T. Kalbarczyk,
Stephen W. Keckler,
Ravishankar K. Iyer
Abstract:
The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault inj…
▽ More
The safety and resilience of fully autonomous vehicles (AVs) are of significant concern, as exemplified by several headline-making accidents. While AV development today involves verification, validation, and testing, end-to-end assessment of AV systems under accidental faults in realistic driving scenarios has been largely unexplored. This paper presents DriveFI, a machine learning-based fault injection engine, which can mine situations and faults that maximally impact AV safety, as demonstrated on two industry-grade AV technology stacks (from NVIDIA and Baidu). For example, DriveFI found 561 safety-critical faults in less than 4 hours. In comparison, random injection experiments executed over several weeks could not find any safety-critical faults
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
ASAP: Accelerated Short-Read Alignment on Programmable Hardware
Authors:
Subho S. Banerjee,
Mohamed El-Hadedy,
Jong Bin Lim,
Zbigniew T. Kalbarczyk,
Deming Chen,
Steve Lumetta,
Ravishankar K. Iyer
Abstract:
The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preproc…
▽ More
The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short period of time. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This paper explores the use of hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preprocessing sequenced genomes. We focus on the Levenshtein distance (edit-distance) computation kernel and propose the ASAP accelerator, which utilizes the intrinsic delay of circuits for edit-distance computation elements as a proxy for computation. Our design is implemented on an Xilinx Virtex 7 FPGA in an IBM POWER8 system that uses the CAPI interface for cache coherence across the CPU and FPGA. Our design is $200\times$ faster than the equivalent C implementation of the kernel running on the host processor and $2.2\times$ faster for an end-to-end alignment tool for 120-150 base-pair short-read sequences. Further the design represents a $3760\times$ improvement over the CPU in performance/Watt terms.
△ Less
Submitted 23 May, 2018; v1 submitted 6 March, 2018;
originally announced March 2018.