-
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (Extended version)
Authors:
Su Feng,
Boris Glavic,
Oliver Kennedy
Abstract:
Uncertainty arises naturally inmany application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results…
▽ More
Uncertainty arises naturally inmany application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible datamodels. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL.We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
△ Less
Submitted 3 May, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Computing expected multiplicities for bag-TIDBs with bounded multiplicities
Authors:
Su Feng,
Boris Glavic,
Aaron Huber,
Oliver Kennedy,
Atri Rudra
Abstract:
In this work, we study the problem of computing a tuple's expected multiplicity over probabilistic databases with bag semantics (where each tuple is associated with a multiplicity) exactly and approximately. We consider bag-TIDBs where we have a bound $c$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as c-TIDBs. We are specifi…
▽ More
In this work, we study the problem of computing a tuple's expected multiplicity over probabilistic databases with bag semantics (where each tuple is associated with a multiplicity) exactly and approximately. We consider bag-TIDBs where we have a bound $c$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as c-TIDBs. We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms -- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases. Unfortunately, our results imply that computing expected multiplicities for c-TIDBs based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures). We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($RA^+$) over c-TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop a sampling algorithm that computes a 1$\pmε$ approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any $RA^+$ query.
△ Less
Submitted 1 July, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
TreeToaster: Towards an IVM-Optimized Compiler
Authors:
Darshana Balakrishnan,
Carl Nuessle,
Oliver Kennedy,
Lukasz Ziarek
Abstract:
A compiler's optimizer operates over abstract syntax trees (ASTs), continuously applying rewrite rules to replace subtrees of the AST with more efficient ones. Especially on large source repositories, even simply finding opportunities for a rewrite can be expensive, as optimizer traverses the AST naively. In this paper, we leverage the need to repeatedly find rewrites, and explore options for maki…
▽ More
A compiler's optimizer operates over abstract syntax trees (ASTs), continuously applying rewrite rules to replace subtrees of the AST with more efficient ones. Especially on large source repositories, even simply finding opportunities for a rewrite can be expensive, as optimizer traverses the AST naively. In this paper, we leverage the need to repeatedly find rewrites, and explore options for making the search faster through indexing and incremental view maintenance (IVM). Concretely, we consider bolt-on approaches that make use of embedded IVM systems like DBToaster, as well as two new approaches: Label-indexing and TreeToaster, an AST-specialized form of IVM. We integrate these approaches into an existing just-in-time data structure compiler and show experimentally that TreeToaster can significantly improve performance with minimal memory overheads.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.
-
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds (extended version)
Authors:
Su Feng,
Aaron Huber,
Boris Glavic,
Oliver Kennedy
Abstract:
Certain answers are a principled method for coping with the uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Prior work introduced Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers. UA-DBs combine the reliability of certain answers based o…
▽ More
Certain answers are a principled method for coping with the uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Prior work introduced Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers. UA-DBs combine the reliability of certain answers based on incomplete K-relations with the performance of classical deterministic database systems. However, UA-DBs only support a limited class of queries and do not support attribute-level uncertainty which can lead to inaccurate under-approximations of certain answers. In this paper, we introduce attribute-annotated uncertain databases (AU-DBs) which extend the UA-DB model with attribute-level annotations that record bounds on the values of an attribute across all possible worlds. This enables more precise approximations of incomplete databases. Furthermore, we extend UA-DBs to encode an compact over-approximation of possible answers which is necessary to support non-monotone queries including aggregation and set difference. We prove that query processing over AU-DBs preserves the bounds of certain and possible answers and investigate algorithms for compacting intermediate results to retain efficiency. Through an compact encoding of possible answers, our approach also provides a solid foundation for handling missing data. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results. Furthermore, it significantly outperforms alternative methods for uncertain data management.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers (extended version)
Authors:
Su Feng,
Aaron Huber,
Boris Glavic,
Oliver Kennedy
Abstract:
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve the uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-app…
▽ More
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve the uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notions of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
△ Less
Submitted 30 March, 2019;
originally announced April 2019.
-
Just-in-Time Index Compilation
Authors:
Darshana Balakrishnan,
Lukasz Ziarek,
Oliver Kennedy
Abstract:
Creating or modifying a primary index is a time-consuming process, as the index typically needs to be rebuilt from scratch. In this paper, we explore a more graceful "just-in-time" approach to index reorganization, where small changes are dynamically applied in the background. To enable this type of reorganization, we formalize a composable organizational grammar, expressive enough to capture inst…
▽ More
Creating or modifying a primary index is a time-consuming process, as the index typically needs to be rebuilt from scratch. In this paper, we explore a more graceful "just-in-time" approach to index reorganization, where small changes are dynamically applied in the background. To enable this type of reorganization, we formalize a composable organizational grammar, expressive enough to capture instances of not only existing index structures, but arbitrary hybrids as well. We introduce an algebra of rewrite rules for such structures, and a framework for defining and optimizing policies for just-in-time rewriting. Our experimental analysis shows that the resulting index structure is flexible enough to adapt to a variety of performance goals, while also remaining competitive with existing structures like the C++ standard template library map.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Query Log Compression for Workload Analytics
Authors:
Ting Xie,
Oliver Kennedy,
Varun Chandola
Abstract:
Analyzing database access logs is a key part of performance tuning, intrusion detection, benchmark development, and many other database administration tasks. Unfortunately, it is common for production databases to deal with millions or even more queries each day, so these logs must be summarized before they can be used. Designing an appropriate summary encoding requires trading off between concise…
▽ More
Analyzing database access logs is a key part of performance tuning, intrusion detection, benchmark development, and many other database administration tasks. Unfortunately, it is common for production databases to deal with millions or even more queries each day, so these logs must be summarized before they can be used. Designing an appropriate summary encoding requires trading off between conciseness and information content. For example: simple workload sampling may miss rare, but high impact queries. In this paper, we present LogR, a lossy log compression scheme suitable use for many automated log analytics tools, as well as for human inspection. We formalize and analyze the space/fidelity trade-off in the context of a broader family of "pattern" and "pattern mixture" log encodings to which LogR belongs. We show through a series of experiments that LogR compressed encodings can be created efficiently, come with provable information-theoretic bounds on their accuracy, and outperform state-of-art log summarization strategies.
△ Less
Submitted 29 September, 2018; v1 submitted 2 September, 2018;
originally announced September 2018.
-
Summarizing Large Query Logs in Ettu
Authors:
Gokhan Kul,
Duc Luong,
Ting Xie,
Patrick Coonan,
Varun Chandola,
Oliver Kennedy,
Shambhu Upadhyaya
Abstract:
Database access logs are large, unwieldy, and hard for humans to inspect and summarize. In spite of this, they remain the canonical go-to resource for tasks ranging from performance tuning to security auditing. In this paper, we address the challenge of compactly encoding large sequences of SQL queries for presentation to a human user. Our approach is based on the Weisfeiler-Lehman (WL) approximat…
▽ More
Database access logs are large, unwieldy, and hard for humans to inspect and summarize. In spite of this, they remain the canonical go-to resource for tasks ranging from performance tuning to security auditing. In this paper, we address the challenge of compactly encoding large sequences of SQL queries for presentation to a human user. Our approach is based on the Weisfeiler-Lehman (WL) approximate graph isomorphism algorithm, which identifies salient features of a graph or in our case of an abstract syntax tree. Our generalization of WL allows us to define a distance metric for SQL queries, which in turn permits automated clustering of queries. We also present two techniques for visualizing query clusters, and an algorithm that allows these visualizations to be constructed at interactive speeds. Finally, we evaluate our algorithms in the context of a motivating example: insider threat detection at a large US bank. We show experimentally on real world query logs that (a) our distance metric captures a meaningful notion of similarity, and (b) the log summarization process is scalable and performant.
△ Less
Submitted 2 August, 2016;
originally announced August 2016.
-
Communicating Data Quality in On-Demand Curation
Authors:
Poonam Kumari,
Said Achmiz,
Oliver Kennedy
Abstract:
On-demand curation (ODC) tools like Paygo, KATARA, and Mimir allow users to defer expensive curation effort until it is necessary. In contrast to classical databases that do not respond to queries over potentially erroneous data, ODC systems instead answer with guesses or approximations. The quality and scope of these guesses may vary and it is critical that an ODC system be able to communicate th…
▽ More
On-demand curation (ODC) tools like Paygo, KATARA, and Mimir allow users to defer expensive curation effort until it is necessary. In contrast to classical databases that do not respond to queries over potentially erroneous data, ODC systems instead answer with guesses or approximations. The quality and scope of these guesses may vary and it is critical that an ODC system be able to communicate this information to an end-user. The central contribution of this paper is a preliminary user study evaluating the cognitive burden and expressiveness of four representations of "attribute-level" uncertainty. The study shows (1) insignificant differences in time taken for users to interpret the four types of uncertainty tested, and (2) that different presentations of uncertainty change the way people interpret and react to data. Ultimately, we show that a set of UI design guidelines and best practices for conveying uncertainty will be necessary for ODC tools to be effective. This paper represents the first step towards establishing such guidelines.
△ Less
Submitted 7 June, 2016;
originally announced June 2016.
-
The Exception that Improves the Rule
Authors:
Juliana Freire,
Boris Glavic,
Oliver Kennedy,
Heiko Mueller
Abstract:
The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more. Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way. Thus, analysts continue to rely on tools such as spreadsheets, imperat…
▽ More
The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more. Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way. Thus, analysts continue to rely on tools such as spreadsheets, imperative languages, and notebook style programming environments like Jupyter for data curation. In this work, we explore the integration of spreadsheets, notebooks, and relational databases. We focus on a key advantage that both spreadsheets and imperative notebook environments have over classical relational databases: ease of exception. By relying on set-at-a-time operations, relational databases sacrifice the ability to easily define singleton operations, exceptions to a normal data processing workflow that affect query processing for a fixed set of explicitly targeted records. In comparison, a spreadsheet user can easily change the formula for just one cell, while a notebook user can add an imperative operation to her notebook that alters an output 'view'. We believe that enabling such idiosyncratic manual transformations in a classical relational database is critical for curation, as curation operations that are easy to declare for individual values can often be extremely challenging to generalize. We explore the challenges of enabling singletons in relational databases, propose a hybrid spreadsheet/relational notebook environment for data curation, and present our vision of Vizier, a system that exposes data curation through such an interface.
△ Less
Submitted 31 May, 2016;
originally announced June 2016.
-
Mimir: Bringing CTables into Practice
Authors:
Arindam Nandi,
Ying Yang,
Oliver Kennedy,
Boris Glavic,
Ronny Fehling,
Zhen Hua Liu,
Dieter Gawlick
Abstract:
The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the…
▽ More
The present state of the art in analytics requires high upfront investment of human effort and computational resources to curate datasets, even before the first query is posed. So-called pay-as-you-go data curation techniques allow these high costs to be spread out, first by enabling queries over uncertain and incomplete data, and then by assessing the quality of the query results. We describe the design of a system, called Mimir, around a recently introduced class of probabilistic pay-as-you-go data cleaning operators called Lenses. Mimir wraps around any deterministic database engine using JDBC, extending it with support for probabilistic query processing. Queries processed through Mimir produce uncertainty-annotated result cursors that allow client applications to quickly assess result quality and provenance. We also present a GUI that provides analysts with an interactive tool for exploring the uncertainty exposed by the system. Finally, we present optimizations that make Lenses scalable, and validate this claim through experimental evidence.
△ Less
Submitted 1 January, 2016;
originally announced January 2016.
-
BarQL: Collaborating Through Change
Authors:
Oliver Kennedy,
Lukasz Ziarek
Abstract:
Applications such as Google Docs, Office 365, and Dropbox show a growing trend towards incorporating multi-user live collaboration functionality into web applications. These collaborative applications share a need to efficiently express shared state, and a common strategy for doing so is a shared log abstraction. Extensive research efforts on log abstractions by the database, programming languages…
▽ More
Applications such as Google Docs, Office 365, and Dropbox show a growing trend towards incorporating multi-user live collaboration functionality into web applications. These collaborative applications share a need to efficiently express shared state, and a common strategy for doing so is a shared log abstraction. Extensive research efforts on log abstractions by the database, programming languages, and distributed systems communities have identified a variety of optimization techniques based on the algebraic properties of updates (i.e., pairwise commutativity, subsumption, and idempotence). Although these techniques have been applied to specific applications and use-cases, to the best of our knowledge, no attempt has been made to create a general framework for such optimizations in the context of a non-trivial update language. In this paper, we introduce mutation languages, a low-level framework for reasoning about the algebraic properties of state updates, or mutations. We define BarQL, a general purpose state-update language, and show how mutation languages allow us to reason about the algebraic properties of updates expressed in BarQ L .
△ Less
Submitted 18 March, 2013;
originally announced March 2013.
-
DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views
Authors:
Yanif Ahmad,
Oliver Kennedy,
Christoph Koch,
Milos Nikolic
Abstract:
Applications ranging from algorithmic trading to scientific data analysis require realtime analytics based on views over databases that change at very high rates. Such views have to be kept fresh at low maintenance cost and latencies. At the same time, these views have to support classical SQL, rather than window semantics, to enable applications that combine current with aged or historical data.…
▽ More
Applications ranging from algorithmic trading to scientific data analysis require realtime analytics based on views over databases that change at very high rates. Such views have to be kept fresh at low maintenance cost and latencies. At the same time, these views have to support classical SQL, rather than window semantics, to enable applications that combine current with aged or historical data. In this paper, we present viewlet transforms, a recursive finite differencing technique applied to queries. The viewlet transform materializes a query and a set of its higher-order deltas as views. These views support each other's incremental maintenance, leading to a reduced overall view maintenance cost. The viewlet transform of a query admits efficient evaluation, the elimination of certain expensive query operations, and aggressive parallelization. We develop viewlet transforms into a workable query execution technique, present a heuristic and cost-based optimization framework, and report on experiments with a prototype dynamic data management system that combines viewlet transforms with an optimizing compilation technique. The system supports tens of thousands of complete view refreshes a second for a wide range of queries.
△ Less
Submitted 30 June, 2012;
originally announced July 2012.
-
Inventory Allocation for Online Graphical Display Advertising
Authors:
Jian Yang,
Erik Vee,
Sergei Vassilvitskii,
John Tomlin,
Jayavel Shanmugasundaram,
Tasos Anastasakos,
Oliver Kennedy
Abstract:
We discuss a multi-objective/goal programming model for the allocation of inventory of graphical advertisements. The model considers two types of campaigns: guaranteed delivery (GD), which are sold months in advance, and non-guaranteed delivery (NGD), which are sold using real-time auctions. We investigate various advertiser and publisher objectives such as (a) revenue from the sale of impressions…
▽ More
We discuss a multi-objective/goal programming model for the allocation of inventory of graphical advertisements. The model considers two types of campaigns: guaranteed delivery (GD), which are sold months in advance, and non-guaranteed delivery (NGD), which are sold using real-time auctions. We investigate various advertiser and publisher objectives such as (a) revenue from the sale of impressions, clicks and conversions, (b) future revenue from the sale of NGD inventory, and (c) "fairness" of allocation. While the first two objectives are monetary, the third is not. This combination of demand types and objectives leads to potentially many variations of our model, which we delineate and evaluate. Our experimental results, which are based on optimization runs using real data sets, demonstrate the effectiveness and flexibility of the proposed model.
△ Less
Submitted 20 August, 2010;
originally announced August 2010.
-
Dynamic Approaches to In-Network Aggregation
Authors:
Oliver Kennedy,
Christoph Koch,
Al Demers
Abstract:
Collaboration between small-scale wireless devices hinges on their ability to infer properties shared across multiple nearby nodes. Wireless-enabled mobile devices in particular create a highly dynamic environment not conducive to distributed reasoning about such global properties. This paper addresses a specific instance of this problem: distributed aggregation. We present extensions to existin…
▽ More
Collaboration between small-scale wireless devices hinges on their ability to infer properties shared across multiple nearby nodes. Wireless-enabled mobile devices in particular create a highly dynamic environment not conducive to distributed reasoning about such global properties. This paper addresses a specific instance of this problem: distributed aggregation. We present extensions to existing unstructured aggregation protocols that enable estimation of count, sum, and average aggregates in highly dynamic environments. With the modified protocols, devices with only limited connectivity can maintain estimates of the aggregate, despite \textit{unexpected} peer departures and arrivals. Our analysis of these aggregate maintenance extensions demonstrates their effectiveness in unstructured environments despite high levels of node mobility.
△ Less
Submitted 17 October, 2008;
originally announced October 2008.