Search | arXiv e-print repository

doi 10.1145/3722212.3724430

AutoComp: Automated Data Compaction for Log-Structured Tables in Data Lakes

Authors: Anja Gruenheid, Jesús Camacho-Rodríguez, Carlo Curino, Raghu Ramakrishnan, Stanislav Pak, Sumedh Sakdeo, Lenisha Gandhi, Sandeep K. Singhal, Pooja Nilangekar, Daniel J. Abadi

Abstract: The proliferation of small files in data lakes poses significant challenges, including degraded query performance, increased storage costs, and scalability bottlenecks in distributed storage systems. Log-structured table formats (LSTs) such as Delta Lake, Apache Iceberg, and Apache Hudi exacerbate this issue due to their append-only write patterns and metadata-intensive operations. While compactio… ▽ More The proliferation of small files in data lakes poses significant challenges, including degraded query performance, increased storage costs, and scalability bottlenecks in distributed storage systems. Log-structured table formats (LSTs) such as Delta Lake, Apache Iceberg, and Apache Hudi exacerbate this issue due to their append-only write patterns and metadata-intensive operations. While compaction--the process of consolidating small files into fewer, larger files--is a common solution, existing automation mechanisms often lack the flexibility and scalability to adapt to diverse workloads and system requirements while balancing the trade-offs between compaction benefits and costs. In this paper, we present AutoComp, a scalable framework for automatic data compaction tailored to the needs of modern data lakes. Drawing on deployment experience at LinkedIn, we analyze the operational impact of small file proliferation, establish key requirements for effective automatic compaction, and demonstrate how AutoComp addresses these challenges. Our evaluation, conducted using synthetic benchmarks and production environments via integration with OpenHouse--a control plane for catalog management, schema governance, and data services--shows significant improvements in file count reduction and query performance. We believe AutoComp's built-in extensibility provides a robust foundation for evolving compaction systems, facilitating future integration of refined multi-objective optimization approaches, workload-aware compaction strategies, and expanded support for broader data layout optimizations. △ Less

Submitted 5 April, 2025; originally announced April 2025.

Journal ref: ACM SIGMOD 2025

arXiv:2405.01813 [pdf, other]

doi 10.1145/3555041.3589674

Towards Building Autonomous Data Services on Azure

Authors: Yiwen Zhu, Yuanyuan Tian, Joyce Cahoon, Subru Krishnan, Ankita Agarwal, Rana Alotaibi, Jesús Camacho-Rodríguez, Bibin Chundatt, Andrew Chung, Niharika Dutta, Andrew Fogarty, Anja Gruenheid, Brandon Haynes, Matteo Interlandi, Minu Iyer, Nick Jurgens, Sumeet Khushalani, Brian Kroth, Manoj Kumar, Jyoti Leeka, Sergiy Matusevych, Minni Mittal, Andreas Mueller, Kartheek Muthyala, Harsha Nagulapalli , et al. (13 additional authors not shown)

Abstract: Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to… ▽ More Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to gain the most value from them. For cloud providers, managing every aspect of an ever-increasing set of data services, while meeting customer SLAs and minimizing operational cost is becoming more challenging. Cloud technology enables the collection of significant amounts of workload traces and system telemetry. With the progress in data science (DS) and machine learning (ML), it is feasible and desirable to utilize a data-driven, ML-based approach to automate various aspects of data services, resulting in the creation of autonomous data services. This paper presents our perspectives and insights on creating autonomous data services on Azure. It also covers the future endeavors we plan to undertake and unresolved issues that still need attention. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: SIGMOD Companion of the 2023 International Conference on Management of Data. 2023

arXiv:2305.01120 [pdf, other]

doi 10.1145/3639314

LST-Bench: Benchmarking Log-Structured Tables in the Cloud

Authors: Jesús Camacho-Rodríguez, Ashvin Agrawal, Anja Gruenheid, Ashit Gosalia, Cristian Petculescu, Josep Aguilar-Saborit, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan

Abstract: Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apach… ▽ More Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apache Iceberg, and Apache Hudi, offer an alternative for scenarios requiring data mutability, providing a balance between efficient updates and the benefits of columnar storage. They provide features like transactions, time-travel, and schema evolution, enhancing usability and enabling access from multiple engines. Moreover, engines like Apache Spark and Trino can be configured to leverage the optimizations and controls offered by LSTs to meet specific business needs. Conventional benchmarks and tools are inadequate for evaluating the transformative changes in the storage layer resulting from these advancements, as they do not allow us to measure the impact of design and optimization choices in this new setting. In this paper, we propose a novel benchmarking approach and metrics that build upon existing benchmarks, aiming to systematically assess LSTs. We develop a framework, LST-Bench, which facilitates effective exploration and evaluation of the collaborative functioning of LSTs and data processing engines through tailored benchmark packages. A package is a mix of use patterns reflecting a target workload; LST-Bench makes it easy to define a wide range of use patterns and combine them into a package, and we include a baseline package for completeness. Our assessment demonstrates the effectiveness of our framework and benchmark packages in extracting valuable insights across diverse environments. The code for LST-Bench is open-sourced and is available at https://github.com/microsoft/lst-bench/ . △ Less

Submitted 19 January, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

Journal ref: Proceedings of the ACM on Management of Data (2024) Volume 2 Issue 1

arXiv:2011.05549 [pdf, other]

Comprehensive and Efficient Workload Compression

Authors: Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey Naughton, Stratis Viglas

Abstract: This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be con… ▽ More This work studies the problem of constructing a representative workload from a given input analytical query workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e.,~a representative workload, over time. To construct such a workload in a principled manner, we formalize the notions of workload {\em representativity} and {\em coverage}. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques. △ Less

Submitted 3 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

arXiv:1802.07398 [pdf, other]

doi 10.1145/3269206.3272020

Investigating Rumor News Using Agreement-Aware Search

Authors: Jingbo Shang, Tianhang Sun, Jiaming Shen, Xingbang Liu, Anja Gruenheid, Flip Korn, Adam Lelkes, Cong Yu, Jiawei Han

Abstract: Recent years have witnessed a widespread increase of rumor news generated by humans and machines. Therefore, tools for investigating rumor news have become an urgent necessity. One useful function of such tools is to see ways a specific topic or event is represented by presenting different points of view from multiple sources. In this paper, we propose Maester, a novel agreement-aware search fra… ▽ More Recent years have witnessed a widespread increase of rumor news generated by humans and machines. Therefore, tools for investigating rumor news have become an urgent necessity. One useful function of such tools is to see ways a specific topic or event is represented by presenting different points of view from multiple sources. In this paper, we propose Maester, a novel agreement-aware search framework for investigating rumor news. Given an investigative question, Maester will retrieve related articles to that question, assign and display top articles from agree, disagree, and discuss categories to users. Splitting the results into these three categories provides the user a holistic view towards the investigative question. We build Maester based on the following two key observations: (1) relatedness can commonly be determined by keywords and entities occurring in both questions and articles, and (2) the level of agreement between the investigative question and the related news article can often be decided by a few key sentences. Accordingly, we use gradient boosting tree models with keyword/entity matching features for relatedness detection, and leverage recurrent neural network to infer the level of agreement. Our experiments on the Fake News Challenge (FNC) dataset demonstrate up to an order of magnitude improvement of Maester over the original FNC winning solution, for agreement-aware search. △ Less

Submitted 16 September, 2018; v1 submitted 20 February, 2018; originally announced February 2018.

arXiv:1610.07732 [pdf, other]

Online Event Integration with StoryPivot

Authors: Anja Gruenheid, Donald Kossmann, Divesh Srivastava

Abstract: Modern data integration systems need to process large amounts of data from a variety of data sources and with real-time integration constraints. They are not only employed in enterprises for managing internal data but are also used for a variety of web services that use techniques such as entity resolution or data cleaning in live systems. In this work, we discuss a new generation of data integrat… ▽ More Modern data integration systems need to process large amounts of data from a variety of data sources and with real-time integration constraints. They are not only employed in enterprises for managing internal data but are also used for a variety of web services that use techniques such as entity resolution or data cleaning in live systems. In this work, we discuss a new generation of data integration systems that operate on (un-)structured data in an online setting, i.e., systems which process continuously modified datasets upon which the integration task is based. We use as an example of such a system an online event integration system called StoryPivot. It observes events extracted from news articles in data sources such as the 'Guardian' or the 'Washington Post' which are integrated to show users the evolution of real-world stories over time. The design decisions for StoryPivot are influenced by the trade-off between maintaining high quality integration results while at the same time building a system that processes and integrates events in near real-time. We evaluate our design decisions with experiments on two real-world datasets and generalize our findings to other data integration tasks that have a similar system setup. △ Less

Submitted 25 October, 2016; originally announced October 2016.

arXiv:1512.00537 [pdf, other]

Fault-Tolerant Entity Resolution with the Crowd

Authors: Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann

Abstract: In recent years, crowdsourcing is increasingly applied as a means to enhance data quality. Although the crowd generates insightful information especially for complex problems such as entity resolution (ER), the output quality of crowd workers is often noisy. That is, workers may unintentionally generate false or contradicting data even for simple tasks. The challenge that we address in this paper… ▽ More In recent years, crowdsourcing is increasingly applied as a means to enhance data quality. Although the crowd generates insightful information especially for complex problems such as entity resolution (ER), the output quality of crowd workers is often noisy. That is, workers may unintentionally generate false or contradicting data even for simple tasks. The challenge that we address in this paper is how to minimize the cost for task requesters while maximizing ER result quality under the assumption of unreliable input from the crowd. For that purpose, we first establish how to deduce a consistent ER solution from noisy worker answers as part of the data interpretation problem. We then focus on the next-crowdsource problem which is to find the next task that maximizes the information gain of the ER result for the minimal additional cost. We compare our robust data interpretation strategies to alternative state-of-the-art approaches that do not incorporate the notion of fault-tolerance, i.e., the robustness to noise. In our experimental evaluation we show that our approaches yield a quality improvement of at least 20% for two real-world datasets. Furthermore, we examine task-to-worker assignment strategies as well as task parallelization techniques in terms of their cost and quality trade-offs in this paper. Based on both synthetic and crowdsourced datasets, we then draw conclusions on how to minimize cost while maintaining high quality ER results. △ Less

Submitted 1 December, 2015; originally announced December 2015.

arXiv:1508.01951 [pdf, other]

Crowd Access Path Optimization: Diversity Matters

Authors: Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, Donald Kossmann

Abstract: Quality assurance is one the most important challenges in crowdsourcing. Assigning tasks to several workers to increase quality through redundant answers can be expensive if asking homogeneous sources. This limitation has been overlooked by current crowdsourcing platforms resulting therefore in costly solutions. In order to achieve desirable cost-quality tradeoffs it is essential to apply efficien… ▽ More Quality assurance is one the most important challenges in crowdsourcing. Assigning tasks to several workers to increase quality through redundant answers can be expensive if asking homogeneous sources. This limitation has been overlooked by current crowdsourcing platforms resulting therefore in costly solutions. In order to achieve desirable cost-quality tradeoffs it is essential to apply efficient crowd access optimization techniques. Our work argues that optimization needs to be aware of diversity and correlation of information within groups of individuals so that crowdsourcing redundancy can be adequately planned beforehand. Based on this intuitive idea, we introduce the Access Path Model (APM), a novel crowd model that leverages the notion of access paths as an alternative way of retrieving information. APM aggregates answers ensuring high quality and meaningful confidence. Moreover, we devise a greedy optimization algorithm for this model that finds a provably good approximate plan to access the crowd. We evaluate our approach on three crowdsourced datasets that illustrate various aspects of the problem. Our results show that the Access Path Model combined with greedy optimization is cost-efficient and practical to overcome common difficulties in large-scale crowdsourcing like data sparsity and anonymity. △ Less

Submitted 11 August, 2015; v1 submitted 8 August, 2015; originally announced August 2015.

Comments: 10 pages, 3rd AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2015)

ACM Class: H.1.2; I.2.6; H.2.5

Showing 1–8 of 8 results for author: Gruenheid, A