-
Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems
Authors:
Alexander W. Lee,
Justin Chan,
Michael Fu,
Nicolas Kim,
Akshay Mehta,
Deepti Raghavan,
Ugur Cetintemel
Abstract:
AI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is fundamentally challenged by the potential for LLMs to produce errors, limiting their adoption in critical domains. To help address this reliability bottleneck,…
▽ More
AI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is fundamentally challenged by the potential for LLMs to produce errors, limiting their adoption in critical domains. To help address this reliability bottleneck, we introduce semantic integrity constraints (SICs) -- a declarative abstraction for specifying and enforcing correctness conditions over LLM outputs in semantic queries. SICs generalize traditional database integrity constraints to semantic settings, supporting common types of constraints, such as grounding, soundness, and exclusion, with both proactive and reactive enforcement strategies.
We argue that SICs provide a foundation for building reliable and auditable AI-augmented data systems. Specifically, we present a system design for integrating SICs into query planning and runtime execution and discuss its realization in AI-augmented DPSs. To guide and evaluate the vision, we outline several design goals -- covering criteria around expressiveness, runtime semantics, integration, performance, and enterprise-scale applicability -- and discuss how our framework addresses each, along with open research challenges.
△ Less
Submitted 1 June, 2025; v1 submitted 1 March, 2025;
originally announced March 2025.
-
DBPal: Weak Supervision for Learning a Natural Language Interface to Databases
Authors:
Nathaniel Weir,
Andrew Crotty,
Alex Galakatos,
Amir Ilkhechi,
Shekar Ramaswamy,
Rohin Bhushan,
Ugur Cetintemel,
Prasetya Utama,
Nadja Geisler,
Benjamin Hättasch,
Steffen Eger,
Carsten Binnig
Abstract:
This paper describes DBPal, a new system to translate natural language utterances into SQL statements using a neural machine translation model. While other recent approaches use neural machine translation to implement a Natural Language Interface to Databases (NLIDB), existing techniques rely on supervised learning with manually curated training data, which results in substantial overhead for supp…
▽ More
This paper describes DBPal, a new system to translate natural language utterances into SQL statements using a neural machine translation model. While other recent approaches use neural machine translation to implement a Natural Language Interface to Databases (NLIDB), existing techniques rely on supervised learning with manually curated training data, which results in substantial overhead for supporting each new database schema. In order to avoid this issue, DBPal implements a novel training pipeline based on weak supervision that synthesizes all training data from a given database schema. In our evaluation, we show that DBPal can outperform existing rule-based NLIDBs while achieving comparable performance to other NLIDBs that leverage deep neural network models without relying on manually curated training data for every new database schema.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
An End-to-end Neural Natural Language Interface for Databases
Authors:
Prasetya Utama,
Nathaniel Weir,
Fuat Basik,
Carsten Binnig,
Ugur Cetintemel,
Benjamin Hättasch,
Amir Ilkhechi,
Shekar Ramaswamy,
Arif Usta
Abstract:
The ability to extract insights from new data sets is critical for decision making. Visual interactive tools play an important role in data exploration since they provide non-technical users with an effective way to visually compose queries and comprehend the results. Natural language has recently gained traction as an alternative query interface to databases with the potential to enable non-exper…
▽ More
The ability to extract insights from new data sets is critical for decision making. Visual interactive tools play an important role in data exploration since they provide non-technical users with an effective way to visually compose queries and comprehend the results. Natural language has recently gained traction as an alternative query interface to databases with the potential to enable non-expert users to formulate complex questions and information needs efficiently and effectively. However, understanding natural language questions and translating them accurately to SQL is a challenging task, and thus Natural Language Interfaces for Databases (NLIDBs) have not yet made their way into practical tools and commercial products.
In this paper, we present DBPal, a novel data exploration tool with a natural language interface. DBPal leverages recent advances in deep models to make query understanding more robust in the following ways: First, DBPal uses a deep model to translate natural language statements to SQL, making the translation process more robust to paraphrasing and other linguistic variations. Second, to support the users in phrasing questions without knowing the database schema and the query features, DBPal provides a learned auto-completion model that suggests partial query extensions to users during query formulation and thus helps to write complex queries.
△ Less
Submitted 2 April, 2018;
originally announced April 2018.
-
Revisiting Reuse in Main Memory Database Systems
Authors:
Kayhan Dursun,
Carsten Binnig,
Ugur Cetintemel,
Tim Kraska
Abstract:
Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern…
▽ More
Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern main memory DBMSs are typically limited by the bandwidth of the memory bus, thus query execution is heavily optimized to keep tuples in the CPU caches and registers. To that end, adding additional materialization operations into a query plan not only add additional traffic to the memory bus but more importantly prevent the important cache- and register-locality opportunities resulting in high performance penalties.
In this paper we study a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our HashStash prototype demonstrate performance gains of $2\times$ for typical analytical workloads with no additional overhead for materializing intermediates.
△ Less
Submitted 19 August, 2016;
originally announced August 2016.
-
S-Store: Streaming Meets Transaction Processing
Authors:
John Meehan,
Nesime Tatbul,
Stan Zdonik,
Cansu Aslantas,
Ugur Cetintemel,
Jiang Du,
Tim Kraska,
Samuel Madden,
David Maier,
Andrew Pavlo,
Michael Stonebraker,
Kristin Tufte,
Hao Wang
Abstract:
Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and…
▽ More
Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system. We chose to build S-Store as an extension of H-Store, an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use of the transaction processing facilities that H-Store already supports, and we can concentrate on the additional implementation features that are needed to support streaming. Similar implementations could be done using other main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show how this can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm, and show how S-Store matches and sometimes exceeds their performance while providing stronger transactional guarantees.
△ Less
Submitted 10 March, 2015; v1 submitted 3 March, 2015;
originally announced March 2015.
-
Tupleware: Redefining Modern Analytics
Authors:
Andrew Crotty,
Alex Galakatos,
Kayhan Dursun,
Tim Kraska,
Ugur Cetintemel,
Stan Zdonik
Abstract:
There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few d…
▽ More
There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems.
△ Less
Submitted 30 July, 2014; v1 submitted 25 June, 2014;
originally announced June 2014.
-
The VC-Dimension of Queries and Selectivity Estimation Through Sampling
Authors:
Matteo Riondato,
Mert Akdere,
Ugur Cetintemel,
Stanley B. Zdonik,
Eli Upfal
Abstract:
We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space…
▽ More
We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. We leverage on this result and develop a method that, given a class of queries, builds a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.
△ Less
Submitted 11 August, 2011; v1 submitted 30 January, 2011;
originally announced January 2011.