-
PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking
Authors:
Yan Zhou,
Chunwei Liu,
Bhuvan Urgaonkar,
Zhengle Wang,
Magnus Mueller,
Chao Zhang,
Songyue Zhang,
Pascal Pfeil,
Dominik Horn,
Zhengchun Liu,
Davide Pagano,
Tim Kraska,
Samuel Madden,
Ju Fan
Abstract:
Cloud service providers commonly use standard benchmarks like TPC-H and TPC-DS to evaluate and optimize cloud data analytics systems. However, these benchmarks rely on fixed query patterns and fail to capture the real execution statistics of production cloud workloads. Although some cloud database vendors have recently released real workload traces, these traces alone do not qualify as benchmarks,…
▽ More
Cloud service providers commonly use standard benchmarks like TPC-H and TPC-DS to evaluate and optimize cloud data analytics systems. However, these benchmarks rely on fixed query patterns and fail to capture the real execution statistics of production cloud workloads. Although some cloud database vendors have recently released real workload traces, these traces alone do not qualify as benchmarks, as they typically lack essential components like the original SQL queries and their underlying databases. To overcome this limitation, this paper introduces a new problem of workload synthesis with real statistics, which aims to generate synthetic workloads that closely approximate real execution statistics, including key performance metrics and operator distributions, in real cloud workloads. To address this problem, we propose PBench, a novel workload synthesizer that constructs synthetic workloads by judiciously selecting and combining workload components (i.e., queries and databases) from existing benchmarks. This paper studies the key challenges in PBench. First, we address the challenge of balancing performance metrics and operator distributions by introducing a multi-objective optimization-based component selection method. Second, to capture the temporal dynamics of real workloads, we design a timestamp assignment method that progressively refines workload timestamps. Third, to handle the disparity between the original workload and the candidate workload, we propose a component augmentation approach that leverages large language models (LLMs) to generate additional workload components while maintaining statistical fidelity. We evaluate PBench on real cloud workload traces, demonstrating that it reduces approximation error by up to 6x compared to state-of-the-art methods.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Ticket Coverage: Putting Test Coverage into Context
Authors:
Jakob Rott,
Rainer Niedermayr,
Elmar Juergens,
Dennis Pagano
Abstract:
There is no metric that determines how well the implementation of a ticket has been tested. As a consequence, code changed within the context of a ticket might unintentionally remain untested and get into production. This is a major problem, because changed code is more fault-prone than unchanged code. In this paper, we introduce the metric ticket coverage which puts test coverage into the context…
▽ More
There is no metric that determines how well the implementation of a ticket has been tested. As a consequence, code changed within the context of a ticket might unintentionally remain untested and get into production. This is a major problem, because changed code is more fault-prone than unchanged code. In this paper, we introduce the metric ticket coverage which puts test coverage into the context of tickets. For each ticket, it determines the ratio of changed methods covered by automated or manual tests. We conducted an empirical study on an industrial system consisting of 650k lines of Java code and show that ticket coverage brings transparency into the test state of tickets and reveals relevant test gaps.
△ Less
Submitted 20 April, 2018;
originally announced April 2018.
-
BOAT: a cross-platform software for data analysis and numerical computing with arbitrary-precision
Authors:
Davide Pagano
Abstract:
BOAT is a free cross-platform software for statistical data analysis and numerical computing. Thanks to its multiple-precision floating point engine, it allows arbitrary-precision calculations, whose digits of precision are only limited by the amount of memory of the host machine. At the core of the software is a simple and efficient expression language, whose use is facilitated by the assisted ty…
▽ More
BOAT is a free cross-platform software for statistical data analysis and numerical computing. Thanks to its multiple-precision floating point engine, it allows arbitrary-precision calculations, whose digits of precision are only limited by the amount of memory of the host machine. At the core of the software is a simple and efficient expression language, whose use is facilitated by the assisted typing, the auto-complete engine and the built-in help for the syntax. In this paper a quick overview of the software is given. Detailed information, together with its applications to some case studies, is available at the BOAT web page.
△ Less
Submitted 10 November, 2015;
originally announced November 2015.
-
iPrivacy: a Distributed Approach to Privacy on the Cloud
Authors:
Ernesto Damiani,
Francesco Pagano,
Davide Pagano
Abstract:
The increasing adoption of Cloud storage poses a number of privacy issues. Users wish to preserve full control over their sensitive data and cannot accept that it to be accessible by the remote storage provider. Previous research was made on techniques to protect data stored on untrusted servers; however we argue that the cloud architecture presents a number of open issues. To handle them, we pres…
▽ More
The increasing adoption of Cloud storage poses a number of privacy issues. Users wish to preserve full control over their sensitive data and cannot accept that it to be accessible by the remote storage provider. Previous research was made on techniques to protect data stored on untrusted servers; however we argue that the cloud architecture presents a number of open issues. To handle them, we present an approach where confidential data is stored in a highly distributed database, partly located on the cloud and partly on the clients. Data is shared in a secure manner using a simple grant-and-revoke permission of shared data and we have developed a system test implementation, using an in-memory RDBMS with row-level data encryption for fine-grained data access control
△ Less
Submitted 27 March, 2015;
originally announced March 2015.
-
Using In-Memory Encrypted Databases on the Cloud
Authors:
Francesco Pagano,
Davide Pagano
Abstract:
Storing data in the cloud poses a number of privacy issues. A way to handle them is supporting data replication and distribution on the cloud via a local, centrally synchronized storage. In this paper we propose to use an in-memory RDBMS with row-level data encryption for granting and revoking access rights to distributed data. This type of solution is rarely adopted in conventional RDBMSs because…
▽ More
Storing data in the cloud poses a number of privacy issues. A way to handle them is supporting data replication and distribution on the cloud via a local, centrally synchronized storage. In this paper we propose to use an in-memory RDBMS with row-level data encryption for granting and revoking access rights to distributed data. This type of solution is rarely adopted in conventional RDBMSs because it requires several complex steps. In this paper we focus on implementation and benchmarking of a test system, which shows that our simple yet effective solution overcomes most of the problems.
△ Less
Submitted 16 September, 2011;
originally announced September 2011.