Search | arXiv e-print repository

arXiv:2505.20683 [pdf, ps, other]

In-memory Incremental Maintenance of Provenance Sketches [extended version]

Authors: Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Danica Porobic, Xing Niu

Abstract: Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this… ▽ More Provenance-based data skipping compactly over-approximates the provenance of a query using so-called provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations. We experimentally demonstrate that IMP significantly reduces the cost of sketch maintenance, thereby enabling the use of provenance sketches for a broad range of workloads that involve updates. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2504.10726 [pdf, other]

Beyond the Classroom: Bridging the Gap Between Academia and Industry with a Hands-on Learning Approach

Authors: Mingyang Xu, Ryan Zheng He Liu, Mark Stoodley, Ladan Tahvildari

Abstract: Modern software systems require various capabilities to meet architectural and operational demands, such as the ability to scale automatically and recover from sudden failures. Self-adaptive software systems have emerged as a critical focus in software design and operation due to their capacity to autonomously adapt to changing environments. However, educating students on this topic is scarce in a… ▽ More Modern software systems require various capabilities to meet architectural and operational demands, such as the ability to scale automatically and recover from sudden failures. Self-adaptive software systems have emerged as a critical focus in software design and operation due to their capacity to autonomously adapt to changing environments. However, educating students on this topic is scarce in academia, and a survey among practitioners identified that the lack of knowledgeable individuals has hindered its adoption in the industry. In this paper, we present our experience teaching a course on self-adaptive software systems that integrates theoretical knowledge and hands-on learning with industry-relevant technologies. To close the gap between academic education and industry practices, we incorporated guest lectures from experts and showcases featuring industry professionals as judges, improving technical and communication skills for our students. Feedback based on surveys from 21 students indicates significant improvements in their understanding of self-adaptive systems. The empirical analysis of the developed course demonstrates the effectiveness of the proposed course syllabus and teaching methodology. In addition, we provide a summary of the educational challenges of running this unique course, including balancing theory and practice, addressing the diverse backgrounds and motivations of students, and integrating the industry-relevant technologies. We believe these insights can provide valuable guidance for educating students in other emerging topics within software engineering. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: Accepted by the 2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T)

arXiv:2403.01003 [pdf, other]

FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests

Authors: Shizhe Lin, Ryan Zheng He Liu, Ladan Tahvildari

Abstract: Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for… ▽ More Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for specific types of flaky tests. This research work proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate prediction of the category of a given flaky test that reflects its root cause. Sampling techniques are applied to address the imbalance between flaky test categories in the International Dataset of Flaky Test (IDoFT). A new evaluation metric, called Flakiness Detection Capacity (FDC), is proposed for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final FDC results are also in agreement with F1 score regarding which classifier yields the best flakiness classification. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2109.00929 [pdf, other]

doi 10.14778/3476311.3476314

MultiCategory: Multi-model Query Processing Meets Category Theory and Functional Programming

Authors: Valter Uotila, Jiaheng Lu, Dieter Gawlick, Zhen Hua Liu, Souripriya Das, Gregory Pogossiants

Abstract: The variety of data is one of the important issues in the era of Big Data. The data are naturally organized in different formats and models, including structured data, semi-structured data, and unstructured data. Prior research has envisioned an approach to abstract multi-model data with a schema category and an instance category by using category theory. In this paper, we demonstrate a system, ca… ▽ More The variety of data is one of the important issues in the era of Big Data. The data are naturally organized in different formats and models, including structured data, semi-structured data, and unstructured data. Prior research has envisioned an approach to abstract multi-model data with a schema category and an instance category by using category theory. In this paper, we demonstrate a system, called MultiCategory, which processes multi-model queries based on category theory and functional programming. This demo is centered around four main scenarios to show a tangible system. First, we show how to build a schema category and an instance category by loading different models of data, including relational, XML, key-value, and graph data. Second, we show a few examples of query processing by using the functional programming language Haskell. Third, we demo the flexible outputs with different models of data for the same input query. Fourth, to better understand the category theoretical structure behind the queries, we offer a variety of graphical hooks to explore and visualize queries as graphs with respect to the schema category, as well as the query processing procedure with Haskell. △ Less

Submitted 30 August, 2021; originally announced September 2021.

Comments: VLDB'21 Demonstration paper, 4 pages, 6 figures

Journal ref: Proceedings of the VLDB Endowment, Vol. 14, No. 12: 2663- 2666, 2021

arXiv:1804.07156 [pdf, other]

Heuristic and Cost-based Optimization for Diverse Provenance Tasks

Authors: Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, Venkatesh Radhakrishnan

Abstract: A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at inst… ▽ More A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at instrumented queries and alternative ways of instrumenting queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations. Our experiments confirm that these optimizations are highly effective, improving performance by several orders of magnitude for diverse provenance tasks. △ Less

Submitted 17 April, 2018; originally announced April 2018.

Comments: IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018, long version, 31 pages. arXiv admin note: substantial text overlap with arXiv:1701.05513

Journal ref: IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018

arXiv:1707.09930 [pdf, other]

Debugging Transactions and Tracking their Provenance with Reenactment

Authors: Xing Niu, Bahareh Sadat Arab, Seokki Lee, Su Feng, Xun Zou, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Boris Glavic

Abstract: Debugging transactions and understanding their execution are of immense importance for developing OLAP applications, to trace causes of errors in production systems, and to audit the operations of a database. However, debugging transactions is hard for several reasons: 1) after the execution of a transaction, its input is no longer available for debugging, 2) internal states of a transaction are t… ▽ More Debugging transactions and understanding their execution are of immense importance for developing OLAP applications, to trace causes of errors in production systems, and to audit the operations of a database. However, debugging transactions is hard for several reasons: 1) after the execution of a transaction, its input is no longer available for debugging, 2) internal states of a transaction are typically not accessible, and 3) the execution of a transaction may be affected by concurrently running transactions. We present a debugger for transactions that enables non-invasive, post-mortem debugging of transactions with provenance tracking and supports what-if scenarios (changes to transaction code or data). Using reenactment, a declarative replay technique we have developed, a transaction is replayed over the state of the DB seen by its original execution including all its interactions with concurrently executed transactions from the history. Importantly, our approach uses the temporal database and audit logging capabilities available in many DBMS and does not require any modifications to the underlying database system nor transactional workload. △ Less

Submitted 31 July, 2017; originally announced July 2017.

Comments: to appear as "Debugging Transactions and Tracking their Provenance with Reenactment" in PVDLB 2017, vol 10., nr. 12

ACM Class: H.2

arXiv:1612.08050 [pdf, other]

UDBMS: Road to Unification for Multi-model Data Management

Authors: Jiaheng Lu, Zhen Hua Liu, Pengfei Xu, Chao Zhang

Abstract: A traditional database systems is organized around a single data model that determines how data can be organized, stored and manipulated. But the vision of this paper is to develop new principles and techniques to manage multiple data models against a single, integrated backend. For example, semi-structured, graph and relational models are examples of data models that may be supported by a new sys… ▽ More A traditional database systems is organized around a single data model that determines how data can be organized, stored and manipulated. But the vision of this paper is to develop new principles and techniques to manage multiple data models against a single, integrated backend. For example, semi-structured, graph and relational models are examples of data models that may be supported by a new system. Having a single data platform for managing both well-structured data and NoSQL data is beneficial to users; this approach significantly reduces integration, migration, development, maintenance and operational issues. The problem is challenging: the existing database principles mainly work for a single model and the research on multi-model data management is still at an early stage. In this paper, we envision a UDBMS (Unified Database Management System) for multi-model data management in one platform. UDBMS will provide several new features such as unified data model and flexible schema, unified query processing, unified index structure and cross-model transaction guarantees. We discuss our vision as well as present multiple research challenges that we need to address. △ Less

Submitted 23 December, 2016; originally announced December 2016.

arXiv:1601.00073 [pdf, other]

Mimir: Bringing CTables into Practice

Authors: Arindam Nandi, Ying Yang, Oliver Kennedy, Boris Glavic, Ronny Fehling, Zhen Hua Liu, Dieter Gawlick

Abstract: The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the… ▽ More The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the design of a system, called Mimir, around a recently introduced class of probabilistic pay-as-you-go data cleaning operators called Lenses. Mimir wraps around any deterministic database engine using JDBC, extending it with support for probabilistic query processing. Queries processed through Mimir produce uncertainty-annotated result cursors that allow client applications to quickly assess result quality and provenance. We also present a GUI that provides analysts with an interactive tool for exploring the uncertainty exposed by the system. Finally, we present optimizations that make Lenses scalable, and validate this claim through experimental evidence. △ Less

Submitted 1 January, 2016; originally announced January 2016.

Comments: Under submission; The first two authors should be considered a joint first-author

Showing 1–8 of 8 results for author: Liu, Z H