Towards Building Autonomous Data Services on Azure
Authors:
Yiwen Zhu,
Yuanyuan Tian,
Joyce Cahoon,
Subru Krishnan,
Ankita Agarwal,
Rana Alotaibi,
Jesús Camacho-Rodríguez,
Bibin Chundatt,
Andrew Chung,
Niharika Dutta,
Andrew Fogarty,
Anja Gruenheid,
Brandon Haynes,
Matteo Interlandi,
Minu Iyer,
Nick Jurgens,
Sumeet Khushalani,
Brian Kroth,
Manoj Kumar,
Jyoti Leeka,
Sergiy Matusevych,
Minni Mittal,
Andreas Mueller,
Kartheek Muthyala,
Harsha Nagulapalli
, et al. (13 additional authors not shown)
Abstract:
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to…
▽ More
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to gain the most value from them. For cloud providers, managing every aspect of an ever-increasing set of data services, while meeting customer SLAs and minimizing operational cost is becoming more challenging. Cloud technology enables the collection of significant amounts of workload traces and system telemetry. With the progress in data science (DS) and machine learning (ML), it is feasible and desirable to utilize a data-driven, ML-based approach to automate various aspects of data services, resulting in the creation of autonomous data services. This paper presents our perspectives and insights on creating autonomous data services on Azure. It also covers the future endeavors we plan to undertake and unresolved issues that still need attention.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
KEA: Tuning an Exabyte-Scale Data Infrastructure
Authors:
Yiwen Zhu,
Subru Krishnan,
Konstantinos Karanasos,
Isha Tarte,
Conor Power,
Abhishek Modi,
Manoj Kumar,
Deli Zhang,
Kartheek Muthyala,
Nick Jurgens,
Sarvesh Sakalanaga,
Sudhir Darbha,
Minu Iyer,
Ankita Agarwal,
Carlo Curino
Abstract:
Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficienc…
▽ More
Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued.
In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine "observational" tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of "flighting" (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper.
KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.