Search | arXiv e-print repository

Unveiling Challenges for LLMs in Enterprise Data Engineering

Authors: Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, Carsten Binnig

Abstract: Large Language Models (LLMs) have demonstrated significant potential for automating data engineering tasks on tabular data, giving enterprises a valuable opportunity to reduce the high costs associated with manual data handling. However, the enterprise domain introduces unique challenges that existing LLM-based approaches for data engineering often overlook, such as large table sizes, more complex… ▽ More Large Language Models (LLMs) have demonstrated significant potential for automating data engineering tasks on tabular data, giving enterprises a valuable opportunity to reduce the high costs associated with manual data handling. However, the enterprise domain introduces unique challenges that existing LLM-based approaches for data engineering often overlook, such as large table sizes, more complex tasks, and the need for internal knowledge. To bridge these gaps, we identify key enterprise-specific challenges related to data, tasks, and background knowledge and conduct a comprehensive study of their impact on recent LLMs for data engineering. Our analysis reveals that LLMs face substantial limitations in real-world enterprise scenarios, resulting in significant accuracy drops. Our findings contribute to a systematic understanding of LLMs for enterprise data engineering to support their adoption in industry. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10704 [pdf, other]

PDSP-Bench: A Benchmarking System for Parallel and Distributed Stream Processing

Authors: Pratyush Agnihotri, Boris Koldehofe, Roman Heinrich, Carsten Binnig, Manisha Luthra

Abstract: The paper introduces PDSP-Bench, a novel benchmarking system designed for a systematic understanding of performance of parallel stream processing in a distributed environment. Such an understanding is essential for determining how Stream Processing Systems (SPS) use operator parallelism and the available resources to process massive workloads of modern applications. Existing benchmarking systems f… ▽ More The paper introduces PDSP-Bench, a novel benchmarking system designed for a systematic understanding of performance of parallel stream processing in a distributed environment. Such an understanding is essential for determining how Stream Processing Systems (SPS) use operator parallelism and the available resources to process massive workloads of modern applications. Existing benchmarking systems focus on analyzing SPS using queries with sequential operator pipelines within a homogeneous centralized environment. Quite differently, PDSP-Bench emphasizes the aspects of parallel stream processing in a distributed heterogeneous environment and simultaneously allows the integration of machine learning models for SPS workloads. In our results, we benchmark a well-known SPS, Apache Flink, using parallel query structures derived from real-world applications and synthetic queries to show the capabilities of PDSP-Bench towards parallel stream processing. Moreover, we compare different learned cost models using generated SPS workloads on PDSP-Bench by showcasing their evaluations on model and training efficiency. We present key observations from our experiments using PDSP-Bench that highlight interesting trends given different query workloads, such as non-linearity and paradoxical effects of parallelism on the performance. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: 22

arXiv:2503.23863 [pdf, other]

GRACEFUL: A Learned Cost Estimator For UDFs

Authors: Johannes Wehrstein, Tiemo Bang, Roman Heinrich, Carsten Binnig

Abstract: User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assum… ▽ More User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assumptions, resulting in suboptimal performance for queries involving UDFs. In this paper, we introduce GRACEFUL, a novel learned cost model to make accurate cost predictions of query plans with UDFs enabling optimization decisions for UDFs in DBMS. For example, as we show in our evaluation, using our cost model, we can achieve 50x speedups through informed pull-up/push-down filter decisions of the UDF compared to the standard case where always a filter push-down is applied. Additionally, we release a synthetic dataset of over 90,000 UDF queries to promote further research in this area. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: The paper has been accepted by ICDE 2025

arXiv:2503.20932 [pdf, other]

Reflex: Speeding Up SMPC Query Execution through Efficient and Flexible Intermediate Result Size Trimming

Authors: Long Gu, Shaza Zeitouni, Carsten Binnig, Zsolt István

Abstract: There is growing interest in Secure Analytics, but fully oblivious query execution in Secure Multi-Party Computation (MPC) settings is often prohibitively expensive. Recent related works propose different approaches to trimming the size of intermediate results between query operators, resulting in significant speedups at the cost of some information leakage. In this work, we generalize these ideas… ▽ More There is growing interest in Secure Analytics, but fully oblivious query execution in Secure Multi-Party Computation (MPC) settings is often prohibitively expensive. Recent related works propose different approaches to trimming the size of intermediate results between query operators, resulting in significant speedups at the cost of some information leakage. In this work, we generalize these ideas into a method of flexible and efficient trimming of operator outputs that can be added to MPC operators easily. This allows for precisely controlling the security/performance trade-off on a per-operator and per-query basis. We demonstrate that our work is practical by porting a state-of-the-art trimming approach to it, resulting in a faster runtime and increased security. Our work lays down the foundation for a future MPC query planner that can pick different performance and security targets when composing physical query plans. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2502.01229 [pdf, other]

How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks

Authors: Roman Heinrich, Manisha Luthra, Johannes Wehrstein, Harald Kornmayer, Carsten Binnig

Abstract: Traditionally, query optimizers rely on cost models to choose the best execution plan from several candidates, making precise cost estimates critical for efficient query execution. In recent years, cost models based on machine learning have been proposed to overcome the weaknesses of traditional cost models. While these models have been shown to provide better prediction accuracy, only limited eff… ▽ More Traditionally, query optimizers rely on cost models to choose the best execution plan from several candidates, making precise cost estimates critical for efficient query execution. In recent years, cost models based on machine learning have been proposed to overcome the weaknesses of traditional cost models. While these models have been shown to provide better prediction accuracy, only limited efforts have been made to investigate how well Learned Cost Models (LCMs) actually perform in query optimization and how they affect overall query performance. In this paper, we address this by a systematic study evaluating LCMs on three of the core query optimization tasks: join ordering, access path selection, and physical operator selection. In our study, we compare seven state-of-the-art LCMs to a traditional cost model and, surprisingly, find that the traditional model often still outperforms LCMs in these tasks. We conclude by highlighting major takeaways and recommendations to guide future research toward making LCMs more effective for query optimization. △ Less

Submitted 3 February, 2025; originally announced February 2025.

Comments: The paper has been accepted by SIGMOD 2025

arXiv:2410.22522 [pdf, other]

Efficient Learned Query Execution over Text and Tables [Technical Report]

Authors: Matthias Urban, Carsten Binnig

Abstract: In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While large language models (LLM) such as… ▽ More In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While large language models (LLM) such as GPT-4 are interesting candidates to enable such learned multimodal operations, we deliberately do not follow this trend to enable MMOps, since it would result in high overhead at query runtime. Instead, to enable MMOps, ELEET comes with a more efficient small language model (SLM) that is targeted to extract structured data from text. Thanks to our novel architecture and pre-training procedure, the ELEET-model enables high-accuracy extraction with low overheads. In our evaluation, we compare query execution based on ELEET to baselines leveraging LLMs such as GPT-4 and show that ELEET can speed up multi-modal queries over tables and text by up to 575x without sacrificing accuracy. △ Less

Submitted 29 October, 2024; originally announced October 2024.

arXiv:2408.16170 [pdf, other]

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Authors: Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

Abstract: Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a… ▽ More Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2403.11874 [pdf, other]

doi 10.48786/EDBT.2025.41

Benchmarking Analytical Query Processing in Intel SGXv2

Authors: Adrian Lutsch, Muhammad El-Hindi, Matthias Heinrich, Daniel Ritter, Zsolt István, Carsten Binnig

Abstract: Trusted Execution Environments (TEEs), such as Intel's Software Guard Extensions (SGX), are increasingly being adopted to address trust and compliance issues in the public cloud. Intel SGX's second generation (SGXv2) addresses many limitations of its predecessor (SGXv1), offering the potential for secure and efficient analytical cloud DBMSs. We assess this potential and conduct the first in-depth… ▽ More Trusted Execution Environments (TEEs), such as Intel's Software Guard Extensions (SGX), are increasingly being adopted to address trust and compliance issues in the public cloud. Intel SGX's second generation (SGXv2) addresses many limitations of its predecessor (SGXv1), offering the potential for secure and efficient analytical cloud DBMSs. We assess this potential and conduct the first in-depth evaluation study of analytical query processing algorithms inside SGXv2. Our study reveals that, unlike SGXv1, state-of-the-art algorithms like radix joins and SIMD-based scans are a good starting point for achieving high-performance query processing inside SGXv2. However, subtle hardware and software differences still influence code execution inside SGX enclaves and cause substantial overheads. We investigate these differences and propose new optimizations to bring the performance inside enclaves on par with native code execution outside enclaves. △ Less

Submitted 14 October, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: 13 pages, 17 figures; To be published in EDBT 2025; Code available at: https://github.com/DataManagementLab/sgxv2-analytical-query-processing-benchmarks; major changes: overhauled figures, improved section 4.2, added new section 7, improved experiments, conference template

ACM Class: H.2; B.8

Journal ref: Proceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025. Page 516-528

arXiv:2403.08444 [pdf, other]

doi 10.1109/ICDE60146.2024.00015

COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments

Authors: Roman Heinrich, Carsten Binnig, Harald Kornmayer, Manisha Luthra

Abstract: In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrat… ▽ More In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrate that COSTREAM can produce highly accurate cost estimates for the initial operator placement and even generalize to unseen placements, queries, and hardware. When using COSTREAM to optimize the placements of streaming operators, a median speed-up of around 21x can be achieved compared to baselines. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: This paper has been accepted by IEEE ICDE 2024

arXiv:2310.13581 [pdf, other]

SPARE: A Single-Pass Neural Model for Relational Databases

Authors: Benjamin Hilprecht, Kristian Kersting, Carsten Binnig

Abstract: While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple ro… ▽ More While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple rounds of training and potentially large and inefficient representations. Hence, in this paper we propose SPARE (Single-Pass Relational models), a new class of neural models that can be trained efficiently on RDBs while providing similar accuracies as GNNs. For enabling efficient training, different from GNNs, SPARE makes use of the fact that data in RDBs has a regular structure, which allows one to train these models in a single pass while exploiting symmetries at the same time. Our extensive empirical evaluation demonstrates that SPARE can significantly speedup both training and inference while offering competitive predictive performance over numerous baselines. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2308.03424 [pdf, other]

CAESURA: Language Models as Multi-Modal Query Planners

Authors: Matthias Urban, Carsten Binnig

Abstract: Traditional query planners translate SQL queries into query plans to be executed over relational data. However, it is impossible to query other data modalities, such as images, text, or video stored in modern data systems such as data lakes using these query planners. In this paper, we propose Language-Model-Driven Query Planning, a new paradigm of query planning that uses Language Models to trans… ▽ More Traditional query planners translate SQL queries into query plans to be executed over relational data. However, it is impossible to query other data modalities, such as images, text, or video stored in modern data systems such as data lakes using these query planners. In this paper, we propose Language-Model-Driven Query Planning, a new paradigm of query planning that uses Language Models to translate natural language queries into executable query plans. Different from relational query planners, the resulting query plans can contain complex operators that are able to process arbitrary modalities. As part of this paper, we present a first GPT-4 based prototype called CEASURA and show the general feasibility of this idea on two datasets. Finally, we discuss several ideas to improve the query planning capabilities of today's Language Models. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 6 pages, 4 figures

arXiv:2305.15321 [pdf, other]

Towards Foundation Models for Relational Databases [Vision Paper]

Authors: Liane Vogel, Benjamin Hilprecht, Carsten Binnig

Abstract: Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in sca… ▽ More Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in scale, which prevents that they learn from large databases. In this paper, we thus introduce our vision of relational representation learning, that can not only learn from the full relational structure, but also can scale to larger database sizes that are commonly found in real-world. Moreover, we also discuss opportunities and challenges we see along the way to enable this vision and present initial very promising results. Overall, we argue that this direction can lead to foundation models for relational databases that are today only available for text and images. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at the Tabular Representation Learning Workshop at NeurIPS 2022 (TRL@NeurIPS2022)

arXiv:2304.13559 [pdf, other]

Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

Authors: Matthias Urban, Carsten Binnig

Abstract: In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend relational databases with so-called multi-modal operators (MMOps) which are based on the advances of recent large language models such as GPT-3. The main idea of… ▽ More In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend relational databases with so-called multi-modal operators (MMOps) which are based on the advances of recent large language models such as GPT-3. The main idea of MMOps is that they allow text collections to be treated as tables without the need to manually transform the data. As we show in our evaluation, our MMDB prototype can not only outperform state-of-the-art approaches such as text-to-table in terms of accuracy and performance but it also requires significantly less training data to fine-tune the model for an unseen text collection. △ Less

Submitted 28 April, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

arXiv:2207.03823 [pdf, other]

doi 10.1145/3524860.3539639

Zero-Shot Cost Models for Distributed Stream Processing

Authors: Roman Heinrich, Manisha Luthra, Harald Kornmayer, Carsten Binnig

Abstract: This paper proposes a learned cost estimation model for Distributed Stream Processing Systems (DSPS) with an aim to provide accurate cost predictions of executing queries. A major premise of this work is that the proposed learned model can generalize to the dynamics of streaming workloads out-of-the-box. This means a model once trained can accurately predict performance metrics such as latency and… ▽ More This paper proposes a learned cost estimation model for Distributed Stream Processing Systems (DSPS) with an aim to provide accurate cost predictions of executing queries. A major premise of this work is that the proposed learned model can generalize to the dynamics of streaming workloads out-of-the-box. This means a model once trained can accurately predict performance metrics such as latency and throughput even if the characteristics of the data and workload or the deployment of operators to hardware changes at runtime. That way, the model can be used to solve tasks such as optimizing the placement of operators to minimize the end-to-end latency of a streaming query or maximize its throughput even under varying conditions. Our evaluation on a well-known DSPS, Apache Storm, shows that the model can predict accurately for unseen workloads and queries while generalizing across real-world benchmarks. △ Less

Submitted 8 July, 2022; originally announced July 2022.

Comments: To appear in the Proceedings of The 16th ACM International Conference on Distributed and Event-based Systems (DEBS `22), June 27-30, 2022, Copenhagen, Denmark

arXiv:2207.01269 [pdf, other]

DiffML: End-to-end Differentiable ML Pipelines

Authors: Benjamin Hilprecht, Christian Hammacher, Eduardo Reis, Mohamed Abdelaal, Carsten Binnig

Abstract: In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differ… ▽ More In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differentiable way such that the entire pipeline can be trained using backpropagation. However, this is a non-trivial problem and opens up many new research questions. To show the feasibility of this direction, we demonstrate initial ideas and a general principle of how typical preprocessing steps such as data cleaning, feature selection and dataset selection can be formulated as differentiable programs and jointly learned with the ML model. Moreover, we discuss a research roadmap and core challenges that have to be systematically tackled to enable fully differentiable ML pipelines. △ Less

Submitted 5 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.00623 [pdf, other]

P4DB -- The Case for In-Network OLTP (Extended Technical Report)

Authors: Matthias Jasny, Lasse Thostrup, Tobias Ziegler, Carsten Binnig

Abstract: In this paper we present a new approach for distributed DBMSs called P4DB, that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a transaction processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. In our… ▽ More In this paper we present a new approach for distributed DBMSs called P4DB, that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a transaction processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. In our experiments, we show that P4DB hence provides significant benefits compared to traditional DBMS architectures and can achieve a speedup of up to 8x. △ Less

Submitted 1 June, 2022; originally announced June 2022.

Comments: Extended Technical Report for: P4DB - The Case for In-Network OLTP

arXiv:2203.14144 [pdf, other]

Demonstrating CAT: Synthesizing Data-Aware Conversational Agents for Transactional Databases

Authors: Marius Gassen, Benjamin Hättasch, Benjamin Hilprecht, Nadja Geisler, Alexander Fraser, Carsten Binnig

Abstract: Databases for OLTP are often the backbone for applications such as hotel room or cinema ticket booking applications. However, developing a conversational agent (i.e., a chatbot-like interface) to allow end-users to interact with an application using natural language requires both immense amounts of training data and NLP expertise. This motivates CAT, which can be used to easily create conversation… ▽ More Databases for OLTP are often the backbone for applications such as hotel room or cinema ticket booking applications. However, developing a conversational agent (i.e., a chatbot-like interface) to allow end-users to interact with an application using natural language requires both immense amounts of training data and NLP expertise. This motivates CAT, which can be used to easily create conversational agents for transactional databases. The main idea is that, for a given OLTP database, CAT uses weak supervision to synthesize the required training data to train a state-of-the-art conversational agent, allowing users to interact with the OLTP database. Furthermore, CAT provides an out-of-the-box integration of the resulting agent with the database. As a major difference to existing conversational agents, agents synthesized by CAT are data-aware. This means that the agent decides which information should be requested from the user based on the current data distributions in the database, which typically results in markedly more efficient dialogues compared with non-data-aware agents. We publish the code for CAT as open source. △ Less

Submitted 26 March, 2022; originally announced March 2022.

Comments: Submitted as demonstration proposal to VLDB 2022

arXiv:2203.04663 [pdf, other]

ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]

Authors: Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig

Abstract: In this paper, we propose a new system called ASET that allows users to perform structured explorations of text collections in an ad-hoc manner. The main idea of ASET is to use a new two-phase approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers and then matches the extractions to a structured table definition as r… ▽ More In this paper, we propose a new system called ASET that allows users to perform structured explorations of text collections in an ad-hoc manner. The main idea of ASET is to use a new two-phase approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers and then matches the extractions to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that ASET is thus able to extract structured data from real-world text collections in high quality without the need to design extraction pipelines upfront. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: Accepted at the 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

arXiv:2203.04366 [pdf, other]

It's AI Match: A Two-Step Approach for Schema Matching Using Embeddings

Authors: Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig

Abstract: Since data is often stored in different sources, it needs to be integrated to gather a global view that is required in order to create value and derive knowledge from it. A critical step in data integration is schema matching which aims to find semantic correspondences between elements of two schemata. In order to reduce the manual effort involved in schema matching, many solutions for the automat… ▽ More Since data is often stored in different sources, it needs to be integrated to gather a global view that is required in order to create value and derive knowledge from it. A critical step in data integration is schema matching which aims to find semantic correspondences between elements of two schemata. In order to reduce the manual effort involved in schema matching, many solutions for the automatic determination of schema correspondences have already been developed. In this paper, we propose a novel end-to-end approach for schema matching based on neural embeddings. The main idea is to use a two-step approach consisting of a table matching step followed by an attribute matching step. In both steps we use embeddings on different levels either representing the whole table or single attributes. Our results show that our approach is able to determine correspondences in a robust and reliable way and compared to traditional schema matching approaches can find non-trivial correspondences. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted to the 2nd International Workshop on Applied AI for Database Systems and Applications (AIDB'20), August 31, 2020, Tokyo, Japan

arXiv:2201.00561 [pdf, other]

Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction

Authors: Benjamin Hilprecht, Carsten Binnig

Abstract: In this paper, we introduce zero-shot cost models which enable learned cost estimation that generalizes to unseen databases. In contrast to state-of-the-art workload-driven approaches which require to execute a large set of training queries on every new database, zero-shot cost models thus allow to instantiate a learned cost model out-of-the-box without expensive training data collection. To enabl… ▽ More In this paper, we introduce zero-shot cost models which enable learned cost estimation that generalizes to unseen databases. In contrast to state-of-the-art workload-driven approaches which require to execute a large set of training queries on every new database, zero-shot cost models thus allow to instantiate a learned cost model out-of-the-box without expensive training data collection. To enable such zero-shot cost models, we suggest a new learning paradigm based on pre-trained cost models. As core contributions to support the transfer of such a pre-trained cost model to unseen databases, we introduce a new model architecture and representation technique for encoding query workloads as input to those models. As we will show in our evaluation, zero-shot cost estimation can provide more accurate cost estimates than state-of-the-art models for a wide range of (real-world) databases without requiring any query executions on unseen databases. Furthermore, we show that zero-shot cost models can be used in a few-shot mode that further improves their quality by retraining them just with a small number of additional training queries on the unseen database. △ Less

Submitted 3 January, 2022; originally announced January 2022.

arXiv:2105.12457 [pdf, other]

ReStore -- Neural Data Completion for Relational Databases

Authors: Benjamin Hilprecht, Carsten Binnig

Abstract: Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but… ▽ More Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this paper, we propose an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering. △ Less

Submitted 26 May, 2021; originally announced May 2021.

arXiv:2105.00642 [pdf, other]

One Model to Rule them All: Towards Zero-Shot Learning for Databases

Authors: Benjamin Hilprecht, Carsten Binnig

Abstract: In this paper, we present our vision of so called zero-shot learning for databases which is a new learning approach for database components. Zero-shot learning for databases is inspired by recent advances in transfer learning of models such as GPT-3 and can support a new database out-of-the box without the need to train a new model. Furthermore, it can easily be extended to few-shot learning by fu… ▽ More In this paper, we present our vision of so called zero-shot learning for databases which is a new learning approach for database components. Zero-shot learning for databases is inspired by recent advances in transfer learning of models such as GPT-3 and can support a new database out-of-the box without the need to train a new model. Furthermore, it can easily be extended to few-shot learning by further retraining the model on the unseen database. As a first concrete contribution in this paper, we show the feasibility of zero-shot learning for the task of physical cost estimation and present very promising initial results. Moreover, as a second contribution we discuss the core challenges related to zero-shot learning for databases and present a roadmap to extend zero-shot learning towards many other tasks beyond cost estimation or even beyond classical database systems and workloads. △ Less

Submitted 3 January, 2022; v1 submitted 3 May, 2021; originally announced May 2021.

arXiv:2009.09433 [pdf, other]

On the Throughput Optimization in Large-Scale Batch-Processing Systems

Authors: Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, Amr Rizk

Abstract: We analyze a data-processing system with $n$ clients producing jobs which are processed in \textit{batches} by $m$ parallel servers; the system throughput critically depends on the batch size and a corresponding sub-additive speedup function. In practice, throughput optimization relies on numerical searches for the optimal batch size, a process that can take up to multiple days in existing commerc… ▽ More We analyze a data-processing system with $n$ clients producing jobs which are processed in \textit{batches} by $m$ parallel servers; the system throughput critically depends on the batch size and a corresponding sub-additive speedup function. In practice, throughput optimization relies on numerical searches for the optimal batch size, a process that can take up to multiple days in existing commercial systems. In this paper, we model the system in terms of a closed queueing network; a standard Markovian analysis yields the optimal throughput in $ω\left(n^4\right)$ time. Our main contribution is a mean-field model of the system for the regime where the system size is large. We show that the mean-field model has a unique, globally attractive stationary point which can be found in closed form and which characterizes the asymptotic throughput of the system as a function of the batch size. Using this expression we find the \textit{asymptotically} optimal throughput in $O(1)$ time. Numerical settings from a large commercial system reveal that this asymptotic optimum is accurate in practical finite regimes. △ Less

Submitted 20 September, 2020; originally announced September 2020.

Comments: 15 pages

arXiv:2009.02258 [pdf, other]

AnyDB: An Architecture-less DBMS for Any Workload

Authors: Tiemo Bang, Norman May, Ilia Petrov, Carsten Binnig

Abstract: In this paper, we propose a radical new approach for scale-out distributed DBMSs. Instead of hard-baking an architectural model, such as a shared-nothing architecture, into the distributed DBMS design, we aim for a new class of so-called architecture-less DBMSs. The main idea is that an architecture-less DBMS can mimic any architecture on a per-query basis on-the-fly without any additional overhea… ▽ More In this paper, we propose a radical new approach for scale-out distributed DBMSs. Instead of hard-baking an architectural model, such as a shared-nothing architecture, into the distributed DBMS design, we aim for a new class of so-called architecture-less DBMSs. The main idea is that an architecture-less DBMS can mimic any architecture on a per-query basis on-the-fly without any additional overhead for reconfiguration. Our initial results show that our architecture-less DBMS AnyDB can provide significant speed-ups across varying workloads compared to a traditional DBMS implementing a static architecture. △ Less

Submitted 4 September, 2020; originally announced September 2020.

Comments: Submitted to 11th Annual Conference on Innovative Data Systems Research (CIDR 21)

arXiv:1909.06182 [pdf, other]

DBPal: Weak Supervision for Learning a Natural Language Interface to Databases

Authors: Nathaniel Weir, Andrew Crotty, Alex Galakatos, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Ugur Cetintemel, Prasetya Utama, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Carsten Binnig

Abstract: This paper describes DBPal, a new system to translate natural language utterances into SQL statements using a neural machine translation model. While other recent approaches use neural machine translation to implement a Natural Language Interface to Databases (NLIDB), existing techniques rely on supervised learning with manually curated training data, which results in substantial overhead for supp… ▽ More This paper describes DBPal, a new system to translate natural language utterances into SQL statements using a neural machine translation model. While other recent approaches use neural machine translation to implement a Natural Language Interface to Databases (NLIDB), existing techniques rely on supervised learning with manually curated training data, which results in substantial overhead for supporting each new database schema. In order to avoid this issue, DBPal implements a novel training pipeline based on weak supervision that synthesizes all training data from a given database schema. In our evaluation, we show that DBPal can outperform existing rule-based NLIDBs while achieving comparable performance to other NLIDBs that leverage deep neural network models without relying on manually curated training data for every new database schema. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: arXiv admin note: text overlap with arXiv:1804.00401

arXiv:1909.00607 [pdf, other]

DeepDB: Learn from Data, not from Queries!

Authors: Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig

Abstract: The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has t… ▽ More The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has to be recollected when the workload and the data changes. To overcome these limitations, we take a different route: we propose to learn a pure data-driven model that can be used for different tasks such as query answering or cardinality estimation. This data-driven model also supports ad-hoc queries and updates of the data without the need of full retraining when the workload or data changes. Indeed, one may now expect that this comes at a price of lower accuracy since workload-driven models can make use of more information. However, this is not the case. The results of our empirical evaluation demonstrate that our data-driven approach not only provides better accuracy than state-of-the-art learned components but also generalizes better to unseen queries. △ Less

Submitted 2 September, 2019; originally announced September 2019.

arXiv:1904.01279 [pdf, other]

Learning a Partitioning Advisor with Deep Reinforcement Learning

Authors: Benjamin Hilprecht, Carsten Binnig, Uwe Roehm

Abstract: Commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to-use scale-out database solutions for OLAP-style workloads in the cloud. While the provisioning of a database cluster is usually fully automated by cloud providers, customers typically still have to make important design decisions which were traditionally made by the database administra… ▽ More Commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to-use scale-out database solutions for OLAP-style workloads in the cloud. While the provisioning of a database cluster is usually fully automated by cloud providers, customers typically still have to make important design decisions which were traditionally made by the database administrator such as selecting the partitioning schemes. In this paper we introduce a learned partitioning advisor for analytical OLAP-style workloads based on Deep Reinforcement Learning (DRL). The main idea is that a DRL agent learns its decisions based on experience by monitoring the rewards for different workloads and partitioning schemes. We evaluate our learned partitioning advisor in an experimental evaluation with different databases schemata and workloads of varying complexity. In the evaluation, we show that our advisor is not only able to find partitionings that outperform existing approaches for automated partitioning design but that it also can easily adjust to different deployments. This is especially important in cloud setups where customers can easily migrate their cluster to a new set of (virtual) machines. △ Less

Submitted 2 April, 2019; originally announced April 2019.

arXiv:1812.08032 [pdf, other]

Progressive Data Science: Potential and Challenges

Authors: Cagatay Turkay, Nicola Pezzotti, Carsten Binnig, Hendrik Strobelt, Barbara Hammer, Daniel A. Keim, Jean-Daniel Fekete, Themis Palpanas, Yunhai Wang, Florin Rusu

Abstract: Data science requires time-consuming iterative manual activities. In particular, activities such as data selection, preprocessing, transformation, and mining, highly depend on iterative trial-and-error processes that could be sped-up significantly by providing quick feedback on the impact of changes. The idea of progressive data science is to compute the results of changes in a progressive manner,… ▽ More Data science requires time-consuming iterative manual activities. In particular, activities such as data selection, preprocessing, transformation, and mining, highly depend on iterative trial-and-error processes that could be sped-up significantly by providing quick feedback on the impact of changes. The idea of progressive data science is to compute the results of changes in a progressive manner, returning a first approximation of results quickly and allow iterative refinements until converging to a final result. Enabling the user to interact with the intermediate results allows an early detection of erroneous or suboptimal choices, the guided definition of modifications to the pipeline and their quick assessment. In this paper, we discuss the progressiveness challenges arising in different steps of the data science pipeline. We describe how changes in each step of the pipeline impact the subsequent steps and outline why progressive data science will help to make the process more effective. Computing progressive approximations of outcomes resulting from changes creates numerous research challenges, especially if the changes are made in the early steps of the pipeline. We discuss these challenges and outline first steps towards progressiveness, which, we argue, will ultimately help to significantly speed-up the overall data science process. △ Less

Submitted 12 September, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

ACM Class: H.5.2; H.3.m; I.2.m; I.3.m

arXiv:1811.12204 [pdf, other]

doi 10.1145/3318464.3389724

Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

Authors: Erfan Zamanian, Julian Shun, Carsten Binnig, Tim Kraska

Abstract: Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact… ▽ More Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact, recent work has shown that distributed databases can scale even when the majority of transactions are cross-partition. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions. Finally, we evaluate Chiller using various workloads, and show that our partitioning and execution strategy outperforms traditional partitioning techniques which try to avoid distributed transactions, by up to a factor of 2. △ Less

Submitted 16 April, 2020; v1 submitted 29 November, 2018; originally announced November 2018.

arXiv:1811.06224 [pdf, other]

Model-based Approximate Query Processing

Authors: Moritz Kulessa, Alejandro Molina, Carsten Binnig, Benjamin Hilprecht, Kristian Kersting

Abstract: Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer fr… ▽ More Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based AQP that leverages generative models learned over the complete database to answer SQL queries at interactive speeds. Different from classical AQP approaches, generative models allow us to compute responses to ad-hoc queries and deliver high-quality estimates also over rare subpopulations at the same time. In our experiments with real and synthetic data sets, we show that Model-based AQP can in many scenarios return more accurate results in a shorter runtime. Furthermore, we think that our techniques of using generative models presented in this paper can not only be used for AQP in databases but also has applications for other database problems including Query Optimization as well as Data Cleaning. △ Less

Submitted 15 November, 2018; originally announced November 2018.

arXiv:1804.02593 [pdf, other]

IDEBench: A Benchmark for Interactive Data Exploration

Authors: Philipp Eichmann, Carsten Binnig, Tim Kraska, Emanuel Zgraggen

Abstract: Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where… ▽ More Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where most queries are ad-hoc, not based on predefined reports, and built incrementally. As a main contribution, we present a novel benchmark called IDEBench that can be used to evaluate the performance of database systems for IDE workloads. As opposed to traditional benchmarks for analytical database systems, our goal is to provide more meaningful workloads and datasets that can be used to benchmark IDE query engines, with a particular focus on metrics that capture the trade-off between query performance and quality of the result. As a second contribution, this paper evaluates and discusses the performance results of selected IDE query engines using our benchmark. The study includes two commercial systems, as well as two research prototypes (IDEA, approXimateDB/XDB), and one traditional analytical database system (MonetDB). △ Less

Submitted 7 April, 2018; originally announced April 2018.

arXiv:1804.00401 [pdf, other]

An End-to-end Neural Natural Language Interface for Databases

Authors: Prasetya Utama, Nathaniel Weir, Fuat Basik, Carsten Binnig, Ugur Cetintemel, Benjamin Hättasch, Amir Ilkhechi, Shekar Ramaswamy, Arif Usta

Abstract: The ability to extract insights from new data sets is critical for decision making. Visual interactive tools play an important role in data exploration since they provide non-technical users with an effective way to visually compose queries and comprehend the results. Natural language has recently gained traction as an alternative query interface to databases with the potential to enable non-exper… ▽ More The ability to extract insights from new data sets is critical for decision making. Visual interactive tools play an important role in data exploration since they provide non-technical users with an effective way to visually compose queries and comprehend the results. Natural language has recently gained traction as an alternative query interface to databases with the potential to enable non-expert users to formulate complex questions and information needs efficiently and effectively. However, understanding natural language questions and translating them accurately to SQL is a challenging task, and thus Natural Language Interfaces for Databases (NLIDBs) have not yet made their way into practical tools and commercial products. In this paper, we present DBPal, a novel data exploration tool with a natural language interface. DBPal leverages recent advances in deep models to make query understanding more robust in the following ways: First, DBPal uses a deep model to translate natural language statements to SQL, making the translation process more robust to paraphrasing and other linguistic variations. Second, to support the users in phrasing questions without knowing the database schema and the query features, DBPal provides a learned auto-completion model that suggests partial query extensions to users during query formulation and thus helps to write complex queries. △ Less

Submitted 2 April, 2018; originally announced April 2018.

arXiv:1801.10207 [pdf, other]

doi 10.1145/3299869.3319860

FITing-Tree: A Data-aware Index Structure

Authors: Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska

Abstract: Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a… ▽ More Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a modern DBMS. This overhead consumes valuable and expensive main memory, and limits the amount of space available to store new data or process existing data. In this paper, we present FITing-Tree, a novel form of a learned index which uses piece-wise linear functions with a bounded error specified at construction time. This error knob provides a tunable parameter that allows a DBA to FIT an index to a dataset and workload by being able to balance lookup performance and space consumption. To navigate this tradeoff, we provide a cost model that helps determine an appropriate error parameter given either (1) a lookup latency requirement (e.g., 500ns) or (2) a storage budget (e.g., 100MB). Using a variety of real-world datasets, we show that our index is able to provide performance that is comparable to full index structures while reducing the storage footprint by orders of magnitude. △ Less

Submitted 25 March, 2020; v1 submitted 30 January, 2018; originally announced January 2018.

Comments: 18 pages

Journal ref: SIGMOD (2019) 1189-1206

arXiv:1612.01040 [pdf, other]

Controlling False Discoveries During Interactive Data Exploration

Authors: Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska

Abstract: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypot… ▽ More Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypothesis testing control into interactive data exploration tools. A key insight is that existing methods for controlling the false discovery rate (such as FDR) are not directly applicable for interactive data exploration. We therefore discuss a set of new control procedures that are better suited and integrated them in our system called Aware. By means of extensive experiments using both real-world and synthetic data sets we demonstrate how Aware can help experts and novice users alike to efficiently control false discoveries. △ Less

Submitted 3 December, 2016; originally announced December 2016.

arXiv:1608.05678 [pdf, ps, other]

Revisiting Reuse in Main Memory Database Systems

Authors: Kayhan Dursun, Carsten Binnig, Ugur Cetintemel, Tim Kraska

Abstract: Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern… ▽ More Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern main memory DBMSs are typically limited by the bandwidth of the memory bus, thus query execution is heavily optimized to keep tuples in the CPU caches and registers. To that end, adding additional materialization operations into a query plan not only add additional traffic to the memory bus but more importantly prevent the important cache- and register-locality opportunities resulting in high performance penalties. In this paper we study a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our HashStash prototype demonstrate performance gains of $2\times$ for typical analytical workloads with no additional overhead for materializing intermediates. △ Less

Submitted 19 August, 2016; originally announced August 2016.

Comments: 13 Pages, 11 Figures

arXiv:1607.00655 [pdf, other]

The End of a Myth: Distributed Transactions Can Scale

Authors: Erfan Zamanian, Carsten Binnig, Tim Kraska, Tim Harris

Abstract: The common wisdom is that distributed transactions do not scale. But what if distributed transactions could be made scalable using the next generation of networks and a redesign of distributed databases? There would be no need for developers anymore to worry about co-partitioning schemes to achieve decent performance. Application development would become easier as data placement would no longer de… ▽ More The common wisdom is that distributed transactions do not scale. But what if distributed transactions could be made scalable using the next generation of networks and a redesign of distributed databases? There would be no need for developers anymore to worry about co-partitioning schemes to achieve decent performance. Application development would become easier as data placement would no longer determine how scalable an application is. Hardware provisioning would be simplified as the system administrator can expect a linear scale-out when adding more machines rather than some complex sub-linear function, which is highly application specific. In this paper, we present the design of our novel scalable database system NAM-DB and show that distributed transactions with the very common Snapshot Isolation guarantee can indeed scale using the next generation of RDMA-enabled network technology without any inherent bottlenecks. Our experiments with the TPC-C benchmark show that our system scales linearly to over 6.5 million new-order (14.5 million total) distributed transactions per second on 56 machines. △ Less

Submitted 21 November, 2016; v1 submitted 3 July, 2016; originally announced July 2016.

Comments: 12 pages

arXiv:1507.05591 [pdf, other]

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Authors: Yeounoh Chung, Michael Lind Mortensen, Carsten Binnig, Tim Kraska

Abstract: It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate… ▽ More It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources. △ Less

Submitted 26 December, 2015; v1 submitted 20 July, 2015; originally announced July 2015.

arXiv:1504.01048 [pdf, other]

The End of Slow Networks: It's Time for a Redesign

Authors: Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, Erfan Zamanian

Abstract: Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand… ▽ More Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved. △ Less

Submitted 19 December, 2015; v1 submitted 4 April, 2015; originally announced April 2015.

Showing 1–38 of 38 results for author: Binnig, C