-
In-memory Incremental Maintenance of Provenance Sketches [extended version]
Authors:
Pengyuan Li,
Boris Glavic,
Dieter Gawlick,
Vasudha Krishnaswamy,
Zhen Hua Liu,
Danica Porobic,
Xing Niu
Abstract:
Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this…
▽ More
Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Beyond the Classroom: Bridging the Gap Between Academia and Industry with a Hands-on Learning Approach
Authors:
Mingyang Xu,
Ryan Zheng He Liu,
Mark Stoodley,
Ladan Tahvildari
Abstract:
Modern software systems require various capabilities to meet architectural and operational demands, such as the ability to scale automatically and recover from sudden failures. Self-adaptive software systems have emerged as a critical focus in software design and operation due to their capacity to autonomously adapt to changing environments. However, educating students on this topic is scarce in a…
▽ More
Modern software systems require various capabilities to meet architectural and operational demands, such as the ability to scale automatically and recover from sudden failures. Self-adaptive software systems have emerged as a critical focus in software design and operation due to their capacity to autonomously adapt to changing environments. However, educating students on this topic is scarce in academia, and a survey among practitioners identified that the lack of knowledgeable individuals has hindered its adoption in the industry. In this paper, we present our experience teaching a course on self-adaptive software systems that integrates theoretical knowledge and hands-on learning with industry-relevant technologies. To close the gap between academic education and industry practices, we incorporated guest lectures from experts and showcases featuring industry professionals as judges, improving technical and communication skills for our students. Feedback based on surveys from 21 students indicates significant improvements in their understanding of self-adaptive systems. The empirical analysis of the developed course demonstrates the effectiveness of the proposed course syllabus and teaching methodology. In addition, we provide a summary of the educational challenges of running this unique course, including balancing theory and practice, addressing the diverse backgrounds and motivations of students, and integrating the industry-relevant technologies. We believe these insights can provide valuable guidance for educating students in other emerging topics within software engineering.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests
Authors:
Shizhe Lin,
Ryan Zheng He Liu,
Ladan Tahvildari
Abstract:
Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for…
▽ More
Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for specific types of flaky tests. This research work proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate prediction of the category of a given flaky test that reflects its root cause. Sampling techniques are applied to address the imbalance between flaky test categories in the International Dataset of Flaky Test (IDoFT). A new evaluation metric, called Flakiness Detection Capacity (FDC), is proposed for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final FDC results are also in agreement with F1 score regarding which classifier yields the best flakiness classification.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
MultiCategory: Multi-model Query Processing Meets Category Theory and Functional Programming
Authors:
Valter Uotila,
Jiaheng Lu,
Dieter Gawlick,
Zhen Hua Liu,
Souripriya Das,
Gregory Pogossiants
Abstract:
The variety of data is one of the important issues in the era of Big Data. The data are naturally organized in different formats and models, including structured data, semi-structured data, and unstructured data. Prior research has envisioned an approach to abstract multi-model data with a schema category and an instance category by using category theory. In this paper, we demonstrate a system, ca…
▽ More
The variety of data is one of the important issues in the era of Big Data. The data are naturally organized in different formats and models, including structured data, semi-structured data, and unstructured data. Prior research has envisioned an approach to abstract multi-model data with a schema category and an instance category by using category theory. In this paper, we demonstrate a system, called MultiCategory, which processes multi-model queries based on category theory and functional programming. This demo is centered around four main scenarios to show a tangible system. First, we show how to build a schema category and an instance category by loading different models of data, including relational, XML, key-value, and graph data. Second, we show a few examples of query processing by using the functional programming language Haskell. Third, we demo the flexible outputs with different models of data for the same input query. Fourth, to better understand the category theoretical structure behind the queries, we offer a variety of graphical hooks to explore and visualize queries as graphs with respect to the schema category, as well as the query processing procedure with Haskell.
△ Less
Submitted 30 August, 2021;
originally announced September 2021.
-
Heuristic and Cost-based Optimization for Diverse Provenance Tasks
Authors:
Xing Niu,
Raghav Kapoor,
Boris Glavic,
Dieter Gawlick,
Zhen Hua Liu,
Vasudha Krishnaswamy,
Venkatesh Radhakrishnan
Abstract:
A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at inst…
▽ More
A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at instrumented queries and alternative ways of instrumenting queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations. Our experiments confirm that these optimizations are highly effective, improving performance by several orders of magnitude for diverse provenance tasks.
△ Less
Submitted 17 April, 2018;
originally announced April 2018.
-
Debugging Transactions and Tracking their Provenance with Reenactment
Authors:
Xing Niu,
Bahareh Sadat Arab,
Seokki Lee,
Su Feng,
Xun Zou,
Dieter Gawlick,
Vasudha Krishnaswamy,
Zhen Hua Liu,
Boris Glavic
Abstract:
Debugging transactions and understanding their execution are of immense importance for developing OLAP applications, to trace causes of errors in production systems, and to audit the operations of a database. However, debugging transactions is hard for several reasons: 1) after the execution of a transaction, its input is no longer available for debugging, 2) internal states of a transaction are t…
▽ More
Debugging transactions and understanding their execution are of immense importance for developing OLAP applications, to trace causes of errors in production systems, and to audit the operations of a database. However, debugging transactions is hard for several reasons: 1) after the execution of a transaction, its input is no longer available for debugging, 2) internal states of a transaction are typically not accessible, and 3) the execution of a transaction may be affected by concurrently running transactions. We present a debugger for transactions that enables non-invasive, post-mortem debugging of transactions with provenance tracking and supports what-if scenarios (changes to transaction code or data). Using reenactment, a declarative replay technique we have developed, a transaction is replayed over the state of the DB seen by its original execution including all its interactions with concurrently executed transactions from the history. Importantly, our approach uses the temporal database and audit logging capabilities available in many DBMS and does not require any modifications to the underlying database system nor transactional workload.
△ Less
Submitted 31 July, 2017;
originally announced July 2017.
-
UDBMS: Road to Unification for Multi-model Data Management
Authors:
Jiaheng Lu,
Zhen Hua Liu,
Pengfei Xu,
Chao Zhang
Abstract:
A traditional database systems is organized around a single data model that determines how data can be organized, stored and manipulated. But the vision of this paper is to develop new principles and techniques to manage multiple data models against a single, integrated backend. For example, semi-structured, graph and relational models are examples of data models that may be supported by a new sys…
▽ More
A traditional database systems is organized around a single data model that determines how data can be organized, stored and manipulated. But the vision of this paper is to develop new principles and techniques to manage multiple data models against a single, integrated backend. For example, semi-structured, graph and relational models are examples of data models that may be supported by a new system. Having a single data platform for managing both well-structured data and NoSQL data is beneficial to users; this approach significantly reduces integration, migration, development, maintenance and operational issues. The problem is challenging: the existing database principles mainly work for a single model and the research on multi-model data management is still at an early stage. In this paper, we envision a UDBMS (Unified Database Management System) for multi-model data management in one platform. UDBMS will provide several new features such as unified data model and flexible schema, unified query processing, unified index structure and cross-model transaction guarantees. We discuss our vision as well as present multiple research challenges that we need to address.
△ Less
Submitted 23 December, 2016;
originally announced December 2016.
-
Mimir: Bringing CTables into Practice
Authors:
Arindam Nandi,
Ying Yang,
Oliver Kennedy,
Boris Glavic,
Ronny Fehling,
Zhen Hua Liu,
Dieter Gawlick
Abstract:
The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the…
▽ More
The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the design of a system, called Mimir, around a recently introduced class of probabilistic pay-as-you-go data cleaning operators called Lenses. Mimir wraps around any deterministic database engine using JDBC, extending it with support for probabilistic query processing. Queries processed through Mimir produce uncertainty-annotated result cursors that allow client applications to quickly assess result quality and provenance. We also present a GUI that provides analysts with an interactive tool for exploring the uncertainty exposed by the system. Finally, we present optimizations that make Lenses scalable, and validate this claim through experimental evidence.
△ Less
Submitted 1 January, 2016;
originally announced January 2016.