Skip to main content

Showing 1–12 of 12 results for author: Camacho-Rodríguez, J

Searching in archive cs. Search in all archives.
.
  1. AutoComp: Automated Data Compaction for Log-Structured Tables in Data Lakes

    Authors: Anja Gruenheid, Jesús Camacho-Rodríguez, Carlo Curino, Raghu Ramakrishnan, Stanislav Pak, Sumedh Sakdeo, Lenisha Gandhi, Sandeep K. Singhal, Pooja Nilangekar, Daniel J. Abadi

    Abstract: The proliferation of small files in data lakes poses significant challenges, including degraded query performance, increased storage costs, and scalability bottlenecks in distributed storage systems. Log-structured table formats (LSTs) such as Delta Lake, Apache Iceberg, and Apache Hudi exacerbate this issue due to their append-only write patterns and metadata-intensive operations. While compactio… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Journal ref: ACM SIGMOD 2025

  2. arXiv:2411.13704  [pdf, other

    cs.DB

    Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Ecosystem: Can One QO Rule Them All?

    Authors: Rana Alotaibi, Yuanyuan Tian, Stefan Grafberger, Jesús Camacho-Rodríguez, Nicolas Bruno, Brian Kroth, Sergiy Matusevych, Ashvin Agrawal, Mahesh Behera, Ashit Gosalia, Cesar Galindo-Legaria, Milind Joshi, Milan Potocnik, Beysim Sezgin, Xiaoyu Li, Carlo Curino

    Abstract: Customer demand, regulatory pressure, and engineering efficiency are the driving forces behind the industry-wide trend of moving from siloed engines and services that are optimized in isolation to highly integrated solutions. This is confirmed by the wide adoption of open formats, shared component libraries, and the meteoric success of integrated data lake experiences such as Microsoft Fabric. I… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

  3. Towards Building Autonomous Data Services on Azure

    Authors: Yiwen Zhu, Yuanyuan Tian, Joyce Cahoon, Subru Krishnan, Ankita Agarwal, Rana Alotaibi, Jesús Camacho-Rodríguez, Bibin Chundatt, Andrew Chung, Niharika Dutta, Andrew Fogarty, Anja Gruenheid, Brandon Haynes, Matteo Interlandi, Minu Iyer, Nick Jurgens, Sumeet Khushalani, Brian Kroth, Manoj Kumar, Jyoti Leeka, Sergiy Matusevych, Minni Mittal, Andreas Mueller, Kartheek Muthyala, Harsha Nagulapalli , et al. (13 additional authors not shown)

    Abstract: Modern cloud has turned data services into easily accessible commodities. With just a few clicks, users are now able to access a catalog of data processing systems for a wide range of tasks. However, the cloud brings in both complexity and opportunity. While cloud users can quickly start an application by using various data services, it can be difficult to configure and optimize these services to… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: SIGMOD Companion of the 2023 International Conference on Management of Data. 2023

  4. arXiv:2401.09621  [pdf, other

    cs.DB

    XTable in Action: Seamless Interoperability in Data Lakes

    Authors: Ashvin Agrawal, Tim Brown, Anoop Johnson, Jesús Camacho-Rodríguez, Kyle Weller, Carlo Curino, Raghu Ramakrishnan

    Abstract: Contemporary approaches to data management are increasingly relying on unified analytics and AI platforms to foster collaboration, interoperability, seamless access to reliable data, and high performance. Data Lakes featuring open standard table formats such as Delta Lake, Apache Hudi, and Apache Iceberg are central components of these data architectures. Choosing the right format for managing a t… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  5. arXiv:2401.03723  [pdf, other

    cs.DB cs.LG

    Sibyl: Forecasting Time-Evolving Query Workloads

    Authors: Hanxian Huang, Tarique Siddiqui, Rana Alotaibi, Carlo Curino, Jyoti Leeka, Alekh Jindal, Jishen Zhao, Jesus Camacho-Rodriguez, Yuanyuan Tian

    Abstract: Database systems often rely on historical query traces to perform workload-based performance tuning. However, real production workloads are time-evolving, making historical queries ineffective for optimizing future workloads. To address this challenge, we propose SIBYL, an end-to-end machine learning-based framework that accurately forecasts a sequence of future queries, with the entire query stat… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: The paper has been accepted by SIGMOD 2024

  6. LST-Bench: Benchmarking Log-Structured Tables in the Cloud

    Authors: Jesús Camacho-Rodríguez, Ashvin Agrawal, Anja Gruenheid, Ashit Gosalia, Cristian Petculescu, Josep Aguilar-Saborit, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan

    Abstract: Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apach… ▽ More

    Submitted 19 January, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

    Journal ref: Proceedings of the ACM on Management of Data (2024) Volume 2 Issue 1

  7. arXiv:2211.02753  [pdf, other

    cs.DB cs.LG

    The Tensor Data Platform: Towards an AI-centric Database System

    Authors: Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesús Camacho-Rodríguez, Matteo Interlandi

    Abstract: Database engines have historically absorbed many of the innovations in data processing, adding features to process graph data, XML, object oriented, and text among many others. In this paper, we make the case that it is time to do the same for AI -- but with a twist! While existing approaches have tried to achieve this by integrating databases with external ML tools, in this paper we claim that ac… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Accepted for publication at The Conference on Innovative Data Systems Research (CIDR) 2023

  8. arXiv:2210.14047  [pdf, other

    cs.DB

    OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]

    Authors: Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jesús Camacho-Rodríguez, Avrilia Floratou, Carlo Curino, Raghu Ramakrishnan

    Abstract: Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity a… ▽ More

    Submitted 3 March, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

    ACM Class: H.2

  9. arXiv:2203.01877  [pdf, other

    cs.DB cs.AI cs.LG

    Query Processing on Tensor Computation Runtimes

    Authors: Dong He, Supun Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, Matteo Interlandi

    Abstract: The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientis… ▽ More

    Submitted 9 February, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

    Journal ref: Proceedings of the VLDB Endowment, 15(11): 2811 - 2825, 2022

  10. arXiv:2102.01148  [pdf, other

    cs.SI cs.IR

    A comparative study of Bot Detection techniques methods with an application related to Covid-19 discourse on Twitter

    Authors: Marzia Antenore, Jose M. Camacho-Rodriguez, Emanuele Panizzi

    Abstract: Bot Detection is an essential asset in a period where Online Social Networks(OSN) is a part of our lives. This task becomes more relevant in crises, as the Covid-19 pandemic, where there is an incipient risk of proliferation of social bots, producing a possible source of misinformation. In order to address this issue, it has been compared different methods to detect automatically social bots on Tw… ▽ More

    Submitted 1 February, 2021; originally announced February 2021.

    Comments: 36 pages, 10 figures, 5 tables

    ACM Class: J.4

  11. arXiv:1903.10970  [pdf, other

    cs.DB

    Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

    Authors: Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, Günther Hagleitner

    Abstract: Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's anal… ▽ More

    Submitted 26 March, 2019; originally announced March 2019.

    Comments: SIGMOD'19, 14 pages

  12. arXiv:1008.0557  [pdf, ps, other

    cs.DB

    LiquidXML: Adaptive XML Content Redistribution

    Authors: Jesús Camacho-Rodríguez, Asterios Katsifodimos, Ioana Manolescu, Alexandra Roatis

    Abstract: We propose to demonstrate LiquidXML, a platform for managing large corpora of XML documents in large-scale P2P networks. All LiquidXML peers may publish XML documents to be shared with all the network peers. The challenge then is to efficiently (re-)distribute the published content in the network, possibly in overlapping, redundant fragments, to support efficient processing of queries at each peer… ▽ More

    Submitted 4 August, 2010; v1 submitted 3 August, 2010; originally announced August 2010.

    Journal ref: ACM International Conference on Information and Knowledge Management, Toronto : Canada (2010)