-
Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Ecosystem: Can One QO Rule Them All?
Authors:
Rana Alotaibi,
Yuanyuan Tian,
Stefan Grafberger,
Jesús Camacho-Rodríguez,
Nicolas Bruno,
Brian Kroth,
Sergiy Matusevych,
Ashvin Agrawal,
Mahesh Behera,
Ashit Gosalia,
Cesar Galindo-Legaria,
Milind Joshi,
Milan Potocnik,
Beysim Sezgin,
Xiaoyu Li,
Carlo Curino
Abstract:
Customer demand, regulatory pressure, and engineering efficiency are the driving forces behind the industry-wide trend of moving from siloed engines and services that are optimized in isolation to highly integrated solutions. This is confirmed by the wide adoption of open formats, shared component libraries, and the meteoric success of integrated data lake experiences such as Microsoft Fabric.
I…
▽ More
Customer demand, regulatory pressure, and engineering efficiency are the driving forces behind the industry-wide trend of moving from siloed engines and services that are optimized in isolation to highly integrated solutions. This is confirmed by the wide adoption of open formats, shared component libraries, and the meteoric success of integrated data lake experiences such as Microsoft Fabric.
In this paper, we study the implications of this trend to Query Optimizer (QO) and discuss our experience of building Calcite and extending Cascades into QO components of Microsoft SQL Server, Fabric Data Warehouse (DW), and SCOPE. We weigh the pros and cons of a drastic change in direction: moving from bespoke QOs or library-sharing (à la Calcite) to rewriting the QO stack and fully embracing Query Optimizer as a Service (QOaaS). We report on some early successes and stumbles as we explore these ideas with prototypes compatible with Fabric DW and Spark. The benefits include centralized workload-level optimizations, multi-engine federation, and accelerated feature creation, but the challenges are equally daunting. We plan to engage CIDR audience in a debate on this exciting topic.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
Towards Building Autonomous Data Services on Azure
Authors:
Yiwen Zhu,
Yuanyuan Tian,
Joyce Cahoon,
Subru Krishnan,
Ankita Agarwal,
Rana Alotaibi,
Jesús Camacho-Rodríguez,
Bibin Chundatt,
Andrew Chung,
Niharika Dutta,
Andrew Fogarty,
Anja Gruenheid,
Brandon Haynes,
Matteo Interlandi,
Minu Iyer,
Nick Jurgens,
Sumeet Khushalani,
Brian Kroth,
Manoj Kumar,
Jyoti Leeka,
Sergiy Matusevych,
Minni Mittal,
Andreas Mueller,
Kartheek Muthyala,
Harsha Nagulapalli
, et al. (13 additional authors not shown)
Abstract:
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to…
▽ More
Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to gain the most value from them. For cloud providers, managing every aspect of an ever-increasing set of data services, while meeting customer SLAs and minimizing operational cost is becoming more challenging. Cloud technology enables the collection of significant amounts of workload traces and system telemetry. With the progress in data science (DS) and machine learning (ML), it is feasible and desirable to utilize a data-driven, ML-based approach to automate various aspects of data services, resulting in the creation of autonomous data services. This paper presents our perspectives and insights on creating autonomous data services on Azure. It also covers the future endeavors we plan to undertake and unresolved issues that still need attention.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Sibyl: Forecasting Time-Evolving Query Workloads
Authors:
Hanxian Huang,
Tarique Siddiqui,
Rana Alotaibi,
Carlo Curino,
Jyoti Leeka,
Alekh Jindal,
Jishen Zhao,
Jesus Camacho-Rodriguez,
Yuanyuan Tian
Abstract:
Database systems often rely on historical query traces to perform workload-based performance tuning. However, real production workloads are time-evolving, making historical queries ineffective for optimizing future workloads. To address this challenge, we propose SIBYL, an end-to-end machine learning-based framework that accurately forecasts a sequence of future queries, with the entire query stat…
▽ More
Database systems often rely on historical query traces to perform workload-based performance tuning. However, real production workloads are time-evolving, making historical queries ineffective for optimizing future workloads. To address this challenge, we propose SIBYL, an end-to-end machine learning-based framework that accurately forecasts a sequence of future queries, with the entire query statements, in various prediction windows. Drawing insights from real-workloads, we propose template-based featurization techniques and develop a stacked-LSTM with an encoder-decoder architecture for accurate forecasting of query workloads. We also develop techniques to improve forecasting accuracy over large prediction windows and achieve high scalability over large workloads with high variability in arrival rates of queries. Finally, we propose techniques to handle workload drifts. Our evaluation on four real workloads demonstrates that SIBYL can forecast workloads with an $87.3\%$ median F1 score, and can result in $1.7\times$ and $1.3\times$ performance improvement when applied to materialized view selection and index selection applications, respectively.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
GEqO: ML-Accelerated Semantic Equivalence Detection
Authors:
Brandon Haynes,
Rana Alotaibi,
Anna Pavlenko,
Jyoti Leeka,
Alekh Jindal,
Yuanyuan Tian
Abstract:
Large scale analytics engines have become a core dependency for modern data-driven enterprises to derive business insights and drive actions. These engines support a large number of analytic jobs processing huge volumes of data on a daily basis, and workloads are often inundated with overlapping computations across multiple jobs. Reusing common computation is crucial for efficient cluster resource…
▽ More
Large scale analytics engines have become a core dependency for modern data-driven enterprises to derive business insights and drive actions. These engines support a large number of analytic jobs processing huge volumes of data on a daily basis, and workloads are often inundated with overlapping computations across multiple jobs. Reusing common computation is crucial for efficient cluster resource utilization and reducing job execution time. Detecting common computation is the first and key step for reducing this computational redundancy. However, detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated. In addition, to maximize computation reuse, equivalence needs to be detected at the semantic level instead of just the syntactic level (i.e., the ability to detect semantic equivalence of seemingly different-looking queries). Unfortunately, existing solutions fall short of satisfying these requirements.
In this paper, we take a major step towards filling this gap by proposing GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale. GEqO introduces two machine-learning-based filters that quickly prune out nonequivalent subexpressions and employs a semi-supervised learning feedback loop to iteratively improve its model with an intelligent sampling mechanism. Further, with its novel database-agnostic featurization method, GEqO can transfer the learning from one workload and database to another. Our extensive empirical evaluation shows that, on TPC-DS-like queries, GEqO yields significant performance gains-up to 200x faster than automated verifiers-and finds up to 2x more equivalences than optimizer and signature-based equivalence detection approaches.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Hybrid Classifiers for Spatio-temporal Real-time Abnormal Behaviors Detection, Tracking, and Recognition in Massive Hajj Crowds
Authors:
Tarik Alafif,
Anas Hadi,
Manal Allahyani,
Bander Alzahrani,
Areej Alhothali,
Reem Alotaibi,
Ahmed Barnawi
Abstract:
Individual abnormal behaviors vary depending on crowd sizes, contexts, and scenes. Challenges such as partial occlusions, blurring, large-number abnormal behavior, and camera viewing occur in large-scale crowds when detecting, tracking, and recognizing individuals with abnormal behaviors. In this paper, our contribution is twofold. First, we introduce an annotated and labeled large-scale crowd abn…
▽ More
Individual abnormal behaviors vary depending on crowd sizes, contexts, and scenes. Challenges such as partial occlusions, blurring, large-number abnormal behavior, and camera viewing occur in large-scale crowds when detecting, tracking, and recognizing individuals with abnormal behaviors. In this paper, our contribution is twofold. First, we introduce an annotated and labeled large-scale crowd abnormal behaviors Hajj dataset (HAJJv2). Second, we propose two methods of hybrid Convolutional Neural Networks (CNNs) and Random Forests (RFs) to detect and recognize Spatio-temporal abnormal behaviors in small and large-scales crowd videos. In small-scale crowd videos, a ResNet-50 pre-trained CNN model is fine-tuned to verify whether every frame is normal or abnormal in the spatial domain. If anomalous behaviors are observed, a motion-based individuals detection method based on the magnitudes and orientations of Horn-Schunck optical flow is used to locate and track individuals with abnormal behaviors. A Kalman filter is employed in large-scale crowd videos to predict and track the detected individuals in the subsequent frames. Then, means, variances, and standard deviations statistical features are computed and fed to the RF to classify individuals with abnormal behaviors in the temporal domain. In large-scale crowds, we fine-tune the ResNet-50 model using YOLOv2 object detection technique to detect individuals with abnormal behaviors in the spatial domain.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries (Extended Version)
Authors:
Rana Alotaibi,
Bogdan Cautis,
Alin Deutsch,
Ioana Manolescu
Abstract:
Hybrid complex analytics workloads typically include (i) data management tasks (joins, selections, etc. ), easily expressed using relational algebra (RA)-based languages, and (ii) complex analytics tasks (regressions, matrix decompositions, etc.), mostly expressed in linear algebra (LA) expressions. Such workloads are common in many application areas, including scientific computing, web analytics,…
▽ More
Hybrid complex analytics workloads typically include (i) data management tasks (joins, selections, etc. ), easily expressed using relational algebra (RA)-based languages, and (ii) complex analytics tasks (regressions, matrix decompositions, etc.), mostly expressed in linear algebra (LA) expressions. Such workloads are common in many application areas, including scientific computing, web analytics, and business recommendation. Existing solutions for evaluating hybrid analytical tasks - ranging from LA-oriented systems, to relational systems (extended to handle LA operations), to hybrid systems - either optimize data management and complex tasks separately, exploit RA properties only while leaving LA-specific optimization opportunities unexploited, or focus heavily on physical optimization, leaving semantic query optimization opportunities unexplored. Additionally, they are not able to exploit precomputed (materialized) results to avoid recomputing (part of) a given mixed (RA and/or LA) computation. In this paper, we take a major step towards filling this gap by proposing HADAD, an extensible lightweight approach for optimizing hybrid complex analytics queries, based on a common abstraction that facilitates unified reasoning: a relational model endowed with integrity constraints. Our solution can be naturally and portably applied on top of pure LA and hybrid RA-LA platforms without modifying their internals. An extensive empirical evaluation shows that HADAD yields significant performance gains on diverse workloads, ranging from LA-centered to hybrid.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Property Graph Schema Optimization for Domain-Specific Knowledge Graphs
Authors:
Chuan Lei,
Rana Alotaibi,
Abdul Quamar,
Vasilis Efthymiou,
Fatma Özcan
Abstract:
Enterprises are creating domain-specific knowledge graphs by curating and integrating their business data from multiple sources. The data in these knowledge graphs can be described using ontologies, which provide a semantic abstraction to define the content in terms of the entities and the relationships of the domain. The rich semantic relationships in an ontology contain a variety of opportunitie…
▽ More
Enterprises are creating domain-specific knowledge graphs by curating and integrating their business data from multiple sources. The data in these knowledge graphs can be described using ontologies, which provide a semantic abstraction to define the content in terms of the entities and the relationships of the domain. The rich semantic relationships in an ontology contain a variety of opportunities to reduce edge traversals and consequently improve the graph query performance. Although there has been a lot of effort to build systems that enable efficient querying over knowledge graphs, the problem of schema optimization for query performance has been largely ignored in the graph setting. In this work, we show that graph schema design has significant impact on query performance, and then propose optimization algorithms that exploit the opportunities from the domain ontology to generate efficient property graph schemas. To the best of our knowledge, we are the first to present an ontology-driven approach for property graph schema optimization. We conduct empirical evaluations with two real-world knowledge graphs from medical and financial domains. The results show that the schemas produced by the optimization algorithms achieve up to 2 orders of magnitude speed-up compared to the baseline approach.
△ Less
Submitted 3 October, 2020; v1 submitted 25 March, 2020;
originally announced March 2020.
-
Arabic Text Watermarking: A Review
Authors:
Reem Ahmed Alotaibi,
Lamiaa A. Elrefaei
Abstract:
The using of the internet with its technologies and applications have been increased rapidly. So, protecting the text from illegal use is too needed . Text watermarking is used for this purpose. Arabic text has many characteristics such existing of diacritics , kashida (extension character) and points above or under its letters .Each of Arabic letters can take different shapes with different Unico…
▽ More
The using of the internet with its technologies and applications have been increased rapidly. So, protecting the text from illegal use is too needed . Text watermarking is used for this purpose. Arabic text has many characteristics such existing of diacritics , kashida (extension character) and points above or under its letters .Each of Arabic letters can take different shapes with different Unicode. These characteristics are utilized in the watermarking process. In this paper, several methods are discussed in the area of Arabic text watermarking with its advantages and disadvantages .Comparison of these methods is done in term of capacity, robustness and Imperceptibility.
△ Less
Submitted 6 August, 2015;
originally announced August 2015.