Skip to main content

Showing 1–38 of 38 results for author: Binnig, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.10950  [pdf, other

    cs.DB

    Unveiling Challenges for LLMs in Enterprise Data Engineering

    Authors: Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, Carsten Binnig

    Abstract: Large Language Models (LLMs) have demonstrated significant potential for automating data engineering tasks on tabular data, giving enterprises a valuable opportunity to reduce the high costs associated with manual data handling. However, the enterprise domain introduces unique challenges that existing LLM-based approaches for data engineering often overlook, such as large table sizes, more complex… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  2. arXiv:2504.10704  [pdf, other

    cs.DC cs.DB

    PDSP-Bench: A Benchmarking System for Parallel and Distributed Stream Processing

    Authors: Pratyush Agnihotri, Boris Koldehofe, Roman Heinrich, Carsten Binnig, Manisha Luthra

    Abstract: The paper introduces PDSP-Bench, a novel benchmarking system designed for a systematic understanding of performance of parallel stream processing in a distributed environment. Such an understanding is essential for determining how Stream Processing Systems (SPS) use operator parallelism and the available resources to process massive workloads of modern applications. Existing benchmarking systems f… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 22

  3. arXiv:2503.23863  [pdf, other

    cs.DB

    GRACEFUL: A Learned Cost Estimator For UDFs

    Authors: Johannes Wehrstein, Tiemo Bang, Roman Heinrich, Carsten Binnig

    Abstract: User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assum… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: The paper has been accepted by ICDE 2025

  4. arXiv:2503.20932  [pdf, other

    cs.DB cs.CR

    Reflex: Speeding Up SMPC Query Execution through Efficient and Flexible Intermediate Result Size Trimming

    Authors: Long Gu, Shaza Zeitouni, Carsten Binnig, Zsolt István

    Abstract: There is growing interest in Secure Analytics, but fully oblivious query execution in Secure Multi-Party Computation (MPC) settings is often prohibitively expensive. Recent related works propose different approaches to trimming the size of intermediate results between query operators, resulting in significant speedups at the cost of some information leakage. In this work, we generalize these ideas… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  5. arXiv:2502.01229  [pdf, other

    cs.DB

    How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks

    Authors: Roman Heinrich, Manisha Luthra, Johannes Wehrstein, Harald Kornmayer, Carsten Binnig

    Abstract: Traditionally, query optimizers rely on cost models to choose the best execution plan from several candidates, making precise cost estimates critical for efficient query execution. In recent years, cost models based on machine learning have been proposed to overcome the weaknesses of traditional cost models. While these models have been shown to provide better prediction accuracy, only limited eff… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: The paper has been accepted by SIGMOD 2025

  6. arXiv:2410.22522  [pdf, other

    cs.DB

    Efficient Learned Query Execution over Text and Tables [Technical Report]

    Authors: Matthias Urban, Carsten Binnig

    Abstract: In this paper, we present ELEET, a novel execution engine that allows one to seamlessly query and process text as a first-class citizen along with tables. To enable such a seamless integration of text and tables, ELEET leverages learned multi-modal operators (MMOps) such as joins and unions that seamlessly combine structured with unstructured textual data. While large language models (LLM) such as… ▽ More

    Submitted 29 October, 2024; originally announced October 2024.

  7. arXiv:2408.16170  [pdf, other

    cs.DB cs.LG

    CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

    Authors: Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

    Abstract: Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  8. Benchmarking Analytical Query Processing in Intel SGXv2

    Authors: Adrian Lutsch, Muhammad El-Hindi, Matthias Heinrich, Daniel Ritter, Zsolt István, Carsten Binnig

    Abstract: Trusted Execution Environments (TEEs), such as Intel's Software Guard Extensions (SGX), are increasingly being adopted to address trust and compliance issues in the public cloud. Intel SGX's second generation (SGXv2) addresses many limitations of its predecessor (SGXv1), offering the potential for secure and efficient analytical cloud DBMSs. We assess this potential and conduct the first in-depth… ▽ More

    Submitted 14 October, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: 13 pages, 17 figures; To be published in EDBT 2025; Code available at: https://github.com/DataManagementLab/sgxv2-analytical-query-processing-benchmarks; major changes: overhauled figures, improved section 4.2, added new section 7, improved experiments, conference template

    ACM Class: H.2; B.8

    Journal ref: Proceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025. Page 516-528

  9. COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments

    Authors: Roman Heinrich, Carsten Binnig, Harald Kornmayer, Manisha Luthra

    Abstract: In this work, we present COSTREAM, a novel learned cost model for Distributed Stream Processing Systems that provides accurate predictions of the execution costs of a streaming query in an edge-cloud environment. The cost model can be used to find an initial placement of operators across heterogeneous hardware, which is particularly important in these environments. In our evaluation, we demonstrat… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted by IEEE ICDE 2024

  10. arXiv:2310.13581  [pdf, other

    cs.DB cs.AI

    SPARE: A Single-Pass Neural Model for Relational Databases

    Authors: Benjamin Hilprecht, Kristian Kersting, Carsten Binnig

    Abstract: While there has been extensive work on deep neural networks for images and text, deep learning for relational databases (RDBs) is still a rather unexplored field. One direction that recently gained traction is to apply Graph Neural Networks (GNNs) to RBDs. However, training GNNs on large relational databases (i.e., data stored in multiple database tables) is rather inefficient due to multiple ro… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  11. arXiv:2308.03424  [pdf, other

    cs.DB

    CAESURA: Language Models as Multi-Modal Query Planners

    Authors: Matthias Urban, Carsten Binnig

    Abstract: Traditional query planners translate SQL queries into query plans to be executed over relational data. However, it is impossible to query other data modalities, such as images, text, or video stored in modern data systems such as data lakes using these query planners. In this paper, we propose Language-Model-Driven Query Planning, a new paradigm of query planning that uses Language Models to trans… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 6 pages, 4 figures

  12. arXiv:2305.15321  [pdf, other

    cs.DB cs.CL

    Towards Foundation Models for Relational Databases [Vision Paper]

    Authors: Liane Vogel, Benjamin Hilprecht, Carsten Binnig

    Abstract: Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in sca… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted at the Tabular Representation Learning Workshop at NeurIPS 2022 (TRL@NeurIPS2022)

  13. arXiv:2304.13559  [pdf, other

    cs.DB cs.CL

    Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

    Authors: Matthias Urban, Carsten Binnig

    Abstract: In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class of database systems that can seamlessly query text and tables using SQL. To enable seamless querying of textual data using SQL in an MMDB, we propose to extend relational databases with so-called multi-modal operators (MMOps) which are based on the advances of recent large language models such as GPT-3. The main idea of… ▽ More

    Submitted 28 April, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

  14. Zero-Shot Cost Models for Distributed Stream Processing

    Authors: Roman Heinrich, Manisha Luthra, Harald Kornmayer, Carsten Binnig

    Abstract: This paper proposes a learned cost estimation model for Distributed Stream Processing Systems (DSPS) with an aim to provide accurate cost predictions of executing queries. A major premise of this work is that the proposed learned model can generalize to the dynamics of streaming workloads out-of-the-box. This means a model once trained can accurately predict performance metrics such as latency and… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: To appear in the Proceedings of The 16th ACM International Conference on Distributed and Event-based Systems (DEBS `22), June 27-30, 2022, Copenhagen, Denmark

  15. arXiv:2207.01269  [pdf, other

    cs.DB cs.LG

    DiffML: End-to-end Differentiable ML Pipelines

    Authors: Benjamin Hilprecht, Christian Hammacher, Eduardo Reis, Mohamed Abdelaal, Carsten Binnig

    Abstract: In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differ… ▽ More

    Submitted 5 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

  16. arXiv:2206.00623  [pdf, other

    cs.DB

    P4DB -- The Case for In-Network OLTP (Extended Technical Report)

    Authors: Matthias Jasny, Lasse Thostrup, Tobias Ziegler, Carsten Binnig

    Abstract: In this paper we present a new approach for distributed DBMSs called P4DB, that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a transaction processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. In our… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: Extended Technical Report for: P4DB - The Case for In-Network OLTP

  17. arXiv:2203.14144  [pdf, other

    cs.DB cs.CL

    Demonstrating CAT: Synthesizing Data-Aware Conversational Agents for Transactional Databases

    Authors: Marius Gassen, Benjamin Hättasch, Benjamin Hilprecht, Nadja Geisler, Alexander Fraser, Carsten Binnig

    Abstract: Databases for OLTP are often the backbone for applications such as hotel room or cinema ticket booking applications. However, developing a conversational agent (i.e., a chatbot-like interface) to allow end-users to interact with an application using natural language requires both immense amounts of training data and NLP expertise. This motivates CAT, which can be used to easily create conversation… ▽ More

    Submitted 26 March, 2022; originally announced March 2022.

    Comments: Submitted as demonstration proposal to VLDB 2022

  18. arXiv:2203.04663  [pdf, other

    cs.CL cs.DB

    ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]

    Authors: Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig

    Abstract: In this paper, we propose a new system called ASET that allows users to perform structured explorations of text collections in an ad-hoc manner. The main idea of ASET is to use a new two-phase approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers and then matches the extractions to a structured table definition as r… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: Accepted at the 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

  19. arXiv:2203.04366  [pdf, other

    cs.DB cs.CL

    It's AI Match: A Two-Step Approach for Schema Matching Using Embeddings

    Authors: Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig

    Abstract: Since data is often stored in different sources, it needs to be integrated to gather a global view that is required in order to create value and derive knowledge from it. A critical step in data integration is schema matching which aims to find semantic correspondences between elements of two schemata. In order to reduce the manual effort involved in schema matching, many solutions for the automat… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

    Comments: Accepted to the 2nd International Workshop on Applied AI for Database Systems and Applications (AIDB'20), August 31, 2020, Tokyo, Japan

  20. arXiv:2201.00561  [pdf, other

    cs.DB cs.AI

    Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction

    Authors: Benjamin Hilprecht, Carsten Binnig

    Abstract: In this paper, we introduce zero-shot cost models which enable learned cost estimation that generalizes to unseen databases. In contrast to state-of-the-art workload-driven approaches which require to execute a large set of training queries on every new database, zero-shot cost models thus allow to instantiate a learned cost model out-of-the-box without expensive training data collection. To enabl… ▽ More

    Submitted 3 January, 2022; originally announced January 2022.

  21. arXiv:2105.12457  [pdf, other

    cs.DB

    ReStore -- Neural Data Completion for Relational Databases

    Authors: Benjamin Hilprecht, Carsten Binnig

    Abstract: Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

  22. arXiv:2105.00642  [pdf, other

    cs.DB cs.AI

    One Model to Rule them All: Towards Zero-Shot Learning for Databases

    Authors: Benjamin Hilprecht, Carsten Binnig

    Abstract: In this paper, we present our vision of so called zero-shot learning for databases which is a new learning approach for database components. Zero-shot learning for databases is inspired by recent advances in transfer learning of models such as GPT-3 and can support a new database out-of-the box without the need to train a new model. Furthermore, it can easily be extended to few-shot learning by fu… ▽ More

    Submitted 3 January, 2022; v1 submitted 3 May, 2021; originally announced May 2021.

  23. arXiv:2009.09433  [pdf, other

    cs.PF

    On the Throughput Optimization in Large-Scale Batch-Processing Systems

    Authors: Sounak Kar, Robin Rehrmann, Arpan Mukhopadhyay, Bastian Alt, Florin Ciucu, Heinz Koeppl, Carsten Binnig, Amr Rizk

    Abstract: We analyze a data-processing system with $n$ clients producing jobs which are processed in \textit{batches} by $m$ parallel servers; the system throughput critically depends on the batch size and a corresponding sub-additive speedup function. In practice, throughput optimization relies on numerical searches for the optimal batch size, a process that can take up to multiple days in existing commerc… ▽ More

    Submitted 20 September, 2020; originally announced September 2020.

    Comments: 15 pages

  24. arXiv:2009.02258  [pdf, other

    cs.DB cs.LG

    AnyDB: An Architecture-less DBMS for Any Workload

    Authors: Tiemo Bang, Norman May, Ilia Petrov, Carsten Binnig

    Abstract: In this paper, we propose a radical new approach for scale-out distributed DBMSs. Instead of hard-baking an architectural model, such as a shared-nothing architecture, into the distributed DBMS design, we aim for a new class of so-called architecture-less DBMSs. The main idea is that an architecture-less DBMS can mimic any architecture on a per-query basis on-the-fly without any additional overhea… ▽ More

    Submitted 4 September, 2020; originally announced September 2020.

    Comments: Submitted to 11th Annual Conference on Innovative Data Systems Research (CIDR 21)

  25. arXiv:1909.06182  [pdf, other

    cs.DB

    DBPal: Weak Supervision for Learning a Natural Language Interface to Databases

    Authors: Nathaniel Weir, Andrew Crotty, Alex Galakatos, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Ugur Cetintemel, Prasetya Utama, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Carsten Binnig

    Abstract: This paper describes DBPal, a new system to translate natural language utterances into SQL statements using a neural machine translation model. While other recent approaches use neural machine translation to implement a Natural Language Interface to Databases (NLIDB), existing techniques rely on supervised learning with manually curated training data, which results in substantial overhead for supp… ▽ More

    Submitted 11 September, 2019; originally announced September 2019.

    Comments: arXiv admin note: text overlap with arXiv:1804.00401

  26. arXiv:1909.00607  [pdf, other

    cs.DB

    DeepDB: Learn from Data, not from Queries!

    Authors: Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig

    Abstract: The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has t… ▽ More

    Submitted 2 September, 2019; originally announced September 2019.

  27. arXiv:1904.01279  [pdf, other

    cs.DB

    Learning a Partitioning Advisor with Deep Reinforcement Learning

    Authors: Benjamin Hilprecht, Carsten Binnig, Uwe Roehm

    Abstract: Commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to-use scale-out database solutions for OLAP-style workloads in the cloud. While the provisioning of a database cluster is usually fully automated by cloud providers, customers typically still have to make important design decisions which were traditionally made by the database administra… ▽ More

    Submitted 2 April, 2019; originally announced April 2019.

  28. arXiv:1812.08032  [pdf, other

    cs.HC cs.DB cs.LG

    Progressive Data Science: Potential and Challenges

    Authors: Cagatay Turkay, Nicola Pezzotti, Carsten Binnig, Hendrik Strobelt, Barbara Hammer, Daniel A. Keim, Jean-Daniel Fekete, Themis Palpanas, Yunhai Wang, Florin Rusu

    Abstract: Data science requires time-consuming iterative manual activities. In particular, activities such as data selection, preprocessing, transformation, and mining, highly depend on iterative trial-and-error processes that could be sped-up significantly by providing quick feedback on the impact of changes. The idea of progressive data science is to compute the results of changes in a progressive manner,… ▽ More

    Submitted 12 September, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

    ACM Class: H.5.2; H.3.m; I.2.m; I.3.m

  29. Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

    Authors: Erfan Zamanian, Julian Shun, Carsten Binnig, Tim Kraska

    Abstract: Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact… ▽ More

    Submitted 16 April, 2020; v1 submitted 29 November, 2018; originally announced November 2018.

  30. arXiv:1811.06224  [pdf, other

    cs.DB cs.LG

    Model-based Approximate Query Processing

    Authors: Moritz Kulessa, Alejandro Molina, Carsten Binnig, Benjamin Hilprecht, Kristian Kersting

    Abstract: Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer fr… ▽ More

    Submitted 15 November, 2018; originally announced November 2018.

  31. arXiv:1804.02593  [pdf, other

    cs.DB

    IDEBench: A Benchmark for Interactive Data Exploration

    Authors: Philipp Eichmann, Carsten Binnig, Tim Kraska, Emanuel Zgraggen

    Abstract: Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where… ▽ More

    Submitted 7 April, 2018; originally announced April 2018.

  32. arXiv:1804.00401  [pdf, other

    cs.DB cs.CL cs.HC

    An End-to-end Neural Natural Language Interface for Databases

    Authors: Prasetya Utama, Nathaniel Weir, Fuat Basik, Carsten Binnig, Ugur Cetintemel, Benjamin Hättasch, Amir Ilkhechi, Shekar Ramaswamy, Arif Usta

    Abstract: The ability to extract insights from new data sets is critical for decision making. Visual interactive tools play an important role in data exploration since they provide non-technical users with an effective way to visually compose queries and comprehend the results. Natural language has recently gained traction as an alternative query interface to databases with the potential to enable non-exper… ▽ More

    Submitted 2 April, 2018; originally announced April 2018.

  33. FITing-Tree: A Data-aware Index Structure

    Authors: Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska

    Abstract: Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a… ▽ More

    Submitted 25 March, 2020; v1 submitted 30 January, 2018; originally announced January 2018.

    Comments: 18 pages

    Journal ref: SIGMOD (2019) 1189-1206

  34. arXiv:1612.01040  [pdf, other

    cs.DB stat.ME

    Controlling False Discoveries During Interactive Data Exploration

    Authors: Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska

    Abstract: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypot… ▽ More

    Submitted 3 December, 2016; originally announced December 2016.

  35. arXiv:1608.05678  [pdf, ps, other

    cs.DB

    Revisiting Reuse in Main Memory Database Systems

    Authors: Kayhan Dursun, Carsten Binnig, Ugur Cetintemel, Tim Kraska

    Abstract: Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern… ▽ More

    Submitted 19 August, 2016; originally announced August 2016.

    Comments: 13 Pages, 11 Figures

  36. arXiv:1607.00655  [pdf, other

    cs.DB

    The End of a Myth: Distributed Transactions Can Scale

    Authors: Erfan Zamanian, Carsten Binnig, Tim Kraska, Tim Harris

    Abstract: The common wisdom is that distributed transactions do not scale. But what if distributed transactions could be made scalable using the next generation of networks and a redesign of distributed databases? There would be no need for developers anymore to worry about co-partitioning schemes to achieve decent performance. Application development would become easier as data placement would no longer de… ▽ More

    Submitted 21 November, 2016; v1 submitted 3 July, 2016; originally announced July 2016.

    Comments: 12 pages

  37. arXiv:1507.05591  [pdf, other

    cs.DB

    Estimating the Impact of Unknown Unknowns on Aggregate Query Results

    Authors: Yeounoh Chung, Michael Lind Mortensen, Carsten Binnig, Tim Kraska

    Abstract: It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate… ▽ More

    Submitted 26 December, 2015; v1 submitted 20 July, 2015; originally announced July 2015.

  38. arXiv:1504.01048  [pdf, other

    cs.DB

    The End of Slow Networks: It's Time for a Redesign

    Authors: Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, Erfan Zamanian

    Abstract: Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand… ▽ More

    Submitted 19 December, 2015; v1 submitted 4 April, 2015; originally announced April 2015.