-
DMSConfig: Automated Configuration Tuning for Distributed IoT Message Systems Using Deep Reinforcement Learning
Authors:
Zhuangwei Kang,
Yogesh D. Barve,
Shunxing Bao,
Abhishek Dubey,
Aniruddha Gokhale
Abstract:
The Distributed Messaging Systems (DMSs) used in IoT systems require timely and reliable data dissemination, which can be achieved through configurable parameters. However, the high-dimensional configuration space makes it difficult for users to find the best options that maximize application throughput while meeting specific latency constraints. Existing approaches to automatic software profiling…
▽ More
The Distributed Messaging Systems (DMSs) used in IoT systems require timely and reliable data dissemination, which can be achieved through configurable parameters. However, the high-dimensional configuration space makes it difficult for users to find the best options that maximize application throughput while meeting specific latency constraints. Existing approaches to automatic software profiling have limitations, such as only optimizing throughput, not guaranteeing explicit latency limitations, and resulting in local optima due to discretizing parameter ranges. To overcome these challenges, a novel configuration tuning system called DMSConfig is proposed that uses machine learning and deep reinforcement learning. DMSConfig interacts with a data-driven environment prediction model, avoiding the cost of online interactions with the production environment. DMSConfig employs the deep deterministic policy gradient (DDPG) method and a custom reward mechanism to make configuration decisions based on predicted DMS states and performance. Experiments show that DMSConfig performs significantly better than the default configuration, is highly adaptive to serve tuning requests with different latency boundaries, and has similar throughput to prevalent parameter tuning tools with fewer latency violations.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Risk-Aware Scene Sampling for Dynamic Assurance of Autonomous Systems
Authors:
Shreyas Ramakrishna,
Baiting Luo,
Yogesh Barve,
Gabor Karsai,
Abhishek Dubey
Abstract:
Autonomous Cyber-Physical Systems must often operate under uncertainties like sensor degradation and shifts in the operating conditions, which increases its operational risk. Dynamic Assurance of these systems requires designing runtime safety components like Out-of-Distribution detectors and risk estimators, which require labeled data from different operating modes of the system that belong to sc…
▽ More
Autonomous Cyber-Physical Systems must often operate under uncertainties like sensor degradation and shifts in the operating conditions, which increases its operational risk. Dynamic Assurance of these systems requires designing runtime safety components like Out-of-Distribution detectors and risk estimators, which require labeled data from different operating modes of the system that belong to scenes with adverse operating conditions, sensors, and actuator faults. Collecting real-world data of these scenes can be expensive and sometimes not feasible. So, scenario description languages with samplers like random and grid search are available to generate synthetic data from simulators, replicating these real-world scenes. However, we point out three limitations in using these conventional samplers. First, they are passive samplers, which do not use the feedback of previous results in the sampling process. Second, the variables to be sampled may have constraints that are often not included. Third, they do not balance the tradeoff between exploration and exploitation, which we hypothesize is necessary for better search space coverage. We present a scene generation approach with two samplers called Random Neighborhood Search (RNS) and Guided Bayesian Optimization (GBO), which extend the conventional random search and Bayesian Optimization search to include the limitations. Also, to facilitate the samplers, we use a risk-based metric that evaluates how risky the scene was for the system. We demonstrate our approach using an Autonomous Vehicle example in CARLA simulation. To evaluate our samplers, we compared them against the baselines of random search, grid search, and Halton sequence search. Our samplers of RNS and GBO sampled a higher percentage of high-risk scenes of 83% and 92%, compared to 56%, 66% and 71% of the grid, random and Halton samplers, respectively.
△ Less
Submitted 27 February, 2022;
originally announced February 2022.
-
FECBench: A Holistic Interference-aware Approach for Application Performance Modeling
Authors:
Yogesh D. Barve,
Shashank Shekhar,
Ajay Dev Chhokra,
Shweta Khare,
Anirban Bhattacharjee,
Zhuangwei Kang,
Hongyang Sun,
Aniruddha Gokhale
Abstract:
Services hosted in multi-tenant cloud platforms often encounter performance interference due to contention for non-partitionable resources, which in turn causes unpredictable behavior and degradation in application performance. To grapple with these problems and to define effective resource management solutions for their services, providers often must expend significant efforts and incur prohibiti…
▽ More
Services hosted in multi-tenant cloud platforms often encounter performance interference due to contention for non-partitionable resources, which in turn causes unpredictable behavior and degradation in application performance. To grapple with these problems and to define effective resource management solutions for their services, providers often must expend significant efforts and incur prohibitive costs in developing performance models of their services under a variety of interference scenarios on different hardware. This is a hard problem due to the wide range of possible co-located services and their workloads, and the growing heterogeneity in the runtime platforms including the use of fog and edge-based resources, not to mention the accidental complexity in performing application profiling under a variety of scenarios. To address these challenges, we present FECBench, a framework to guide providers in building performance interference prediction models for their services without incurring undue costs and efforts. The contributions of the paper are as follows. First, we developed a technique to build resource stressors that can stress multiple system resources all at once in a controlled manner to gain insights about the interference on an application's performance. Second, to overcome the need for exhaustive application profiling, FECBench intelligently uses the design of experiments (DoE) approach to enable users to build surrogate performance models of their services. Third, FECBench maintains an extensible knowledge base of application combinations that create resource stresses across the multi-dimensional resource design space. Empirical results using real-world scenarios to validate the efficacy of FECBench show that the predicted application performance has a median error of only 7.6% across all test cases, with 5.4% in the best case and 13.5% in the worst case.
△ Less
Submitted 12 April, 2019; v1 submitted 11 April, 2019;
originally announced April 2019.
-
CloudCAMP: Automating Cloud Services Deployment and Management
Authors:
Anirban Bhattacharjee,
Yogesh Barve,
Aniruddha Gokhale,
Takayuki Kuroda
Abstract:
Users of cloud platforms often must expend significant manual efforts in the deployment and orchestration of their services on cloud platforms due primarily to having to deal with the high variabilities in the configuration options for virtualized environment setup and meeting the software dependencies for each service. Despite the emergence of many DevOps cloud automation and orchestration tools,…
▽ More
Users of cloud platforms often must expend significant manual efforts in the deployment and orchestration of their services on cloud platforms due primarily to having to deal with the high variabilities in the configuration options for virtualized environment setup and meeting the software dependencies for each service. Despite the emergence of many DevOps cloud automation and orchestration tools, users must still rely on specifying low-level scripting details for service deployment and management using Infrastructure-as-Code (IAC). Using these tools required domain expertise along with a steep learning curve. To address these challenges in a tool-and-technology agnostic manner, which helps promote interoperability and portability of services hosted across cloud platforms, we present initial ideas on a GUI based cloud automation and orchestration framework called CloudCAMP. It incorporates domain-specific modeling so that the specifications and dependencies imposed by the cloud platform and application architecture can be specified at an intuitive, higher level of abstraction without the need for domain expertise using Model-Driven Engineering(MDE) paradigm. CloudCAMP transforms the partial specifications into deployable Infrastructure-as-Code (IAC) using the Transformational-Generative paradigm and by leveraging an extensible and reusable knowledge base. The auto-generated IAC can be handled by existing tools to provision the services components automatically. We validate our approach quantitatively by showing a comparative study of savings in manual and scripting efforts versus using CloudCAMP.
△ Less
Submitted 8 April, 2019; v1 submitted 3 April, 2019;
originally announced April 2019.
-
Stratum: A Serverless Framework for Lifecycle Management of Machine Learning based Data Analytics Tasks
Authors:
Anirban Bhattacharjee,
Yogesh Barve,
Shweta Khare,
Shunxing Bao,
Aniruddha Gokhale,
Thomas Damiano
Abstract:
With the proliferation of machine learning (ML) libraries and frameworks, and the programming languages that they use, along with operations of data loading, transformation, preparation and mining, ML model development is becoming a daunting task. Furthermore, with a plethora of cloud-based ML model development platforms, heterogeneity in hardware, increased focus on exploiting edge computing reso…
▽ More
With the proliferation of machine learning (ML) libraries and frameworks, and the programming languages that they use, along with operations of data loading, transformation, preparation and mining, ML model development is becoming a daunting task. Furthermore, with a plethora of cloud-based ML model development platforms, heterogeneity in hardware, increased focus on exploiting edge computing resources for low-latency prediction serving and often a lack of a complete understanding of resources required to execute ML workflows efficiently, ML model deployment demands expertise for managing the lifecycle of ML workflows efficiently and with minimal cost. To address these challenges, we propose an end-to-end data analytics, a serverless platform called Stratum. Stratum can deploy, schedule and dynamically manage data ingestion tools, live streaming apps, batch analytics tools, ML-as-a-service (for inference jobs), and visualization tools across the cloud-fog-edge spectrum. This paper describes the Stratum architecture highlighting the problems it resolves.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.