Search | arXiv e-print repository

From Good to Great: Improving Memory Tiering Performance Through Parameter Tuning

Authors: Konstantinos Kanellis, Sujay Yadalam, Fanchao Chen, Michael Swift, Shivaram Venkataraman

Abstract: Memory tiering systems achieve memory scaling by adding multiple tiers of memory wherein different tiers have different access latencies and bandwidth. For maximum performance, frequently accessed (hot) data must be placed close to the host in faster tiers and infrequently accessed (cold) data can be placed in farther slower memory tiers. Existing tiering solutions employ heuristics and pre-config… ▽ More Memory tiering systems achieve memory scaling by adding multiple tiers of memory wherein different tiers have different access latencies and bandwidth. For maximum performance, frequently accessed (hot) data must be placed close to the host in faster tiers and infrequently accessed (cold) data can be placed in farther slower memory tiers. Existing tiering solutions employ heuristics and pre-configured thresholds to make data placement and migration decisions. Unfortunately, these systems fail to adapt to different workloads and the underlying hardware, so perform sub-optimally. In this paper, we improve performance of memory tiering by using application behavior knowledge to set various parameters (knobs) in existing tiering systems. To do so, we leverage Bayesian Optimization to discover the good performing configurations that capture the application behavior and the underlying hardware characteristics. We find that Bayesian Optimization is able to learn workload behaviors and set the parameter values that result in good performance. We evaluate this approach with existing tiering systems, HeMem and HMSDK. Our evaluation reveals that configuring the parameter values correctly can improve performance by 2x over the same systems with default configurations and 1.56x over state-of-the-art tiering system. △ Less

Submitted 25 April, 2025; originally announced April 2025.

arXiv:2503.01801 [pdf, other]

doi 10.1145/3689031.3717480

TUNA: Tuning Unstable and Noisy Cloud Applications

Authors: Johannes Freischuetz, Konstantinos Kanellis, Brian Kroth, Shivaram Venkataraman

Abstract: Autotuning plays a pivotal role in optimizing the performance of systems, particularly in large-scale cloud deployments. One of the main challenges in performing autotuning in the cloud arises from performance variability. We first investigate the extent to which noise slows autotuning and find that as little as $5\%$ noise can lead to a $2.5$x slowdown in converging to the best-performing configu… ▽ More Autotuning plays a pivotal role in optimizing the performance of systems, particularly in large-scale cloud deployments. One of the main challenges in performing autotuning in the cloud arises from performance variability. We first investigate the extent to which noise slows autotuning and find that as little as $5\%$ noise can lead to a $2.5$x slowdown in converging to the best-performing configuration. We measure the magnitude of noise in cloud computing settings and find that while some components (CPU, disk) have almost no performance variability, there are still sources of significant variability (caches, memory). Furthermore, variability leads to autotuning finding unstable configurations. As many as $63.3\%$ of the configurations selected as "best" during tuning can have their performance degrade by $30\%$ or more when deployed. Using this as motivation, we propose a novel approach to improve the efficiency of autotuning systems by (a) detecting and removing outlier configurations and (b) using ML-based approaches to provide a more stable true signal of de-noised experiment results to the optimizer. The resulting system, TUNA (Tuning Unstable and Noisy Cloud Applications) enables faster convergence and robust configurations. Tuning postgres running mssales, an enterprise production workload, we find that TUNA can lead to $1.88$x lower running time on average with $2.58x$ lower standard deviation compared to traditional sampling methodologies. △ Less

Submitted 5 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

Comments: 14 pages, 20 figures, EuroSys'25

arXiv:2305.01516 [pdf, other]

F2: Designing a Key-Value Store for Large Skewed Workloads

Authors: Konstantinos Kanellis, Badrish Chandramouli, Shivaram Venkataraman

Abstract: Many real-world workloads present a challenging set of requirements: point operations requiring high throughput, working sets much larger than main memory, and natural skew in key access patterns for both reads and writes. We find that modern key-value designs are either optimized for memory-efficiency, sacrificing high-performance (LSM-tree designs), or achieve high-performance, saturating modern… ▽ More Many real-world workloads present a challenging set of requirements: point operations requiring high throughput, working sets much larger than main memory, and natural skew in key access patterns for both reads and writes. We find that modern key-value designs are either optimized for memory-efficiency, sacrificing high-performance (LSM-tree designs), or achieve high-performance, saturating modern NVMe SSD bandwidth, at the cost of substantial memory resources or high disk wear (CPU-optimized designs). Unfortunately these designs are not able to handle meet the challenging demands of such larger-than-memory, skewed workloads. To this end, we present F2, a new key-value store that bridges this gap by combining the strengths of both approaches. F2 adopts a tiered, record-oriented architecture inspired by LSM-trees to effectively separate hot from cold records, while incorporating concurrent latch-free mechanisms from CPU-optimized engines to maximize performance on modern NVMe SSDs. To realize this design, we tackle key challenges and introduce several innovations, including new latch-free algorithms for multi-threaded log compaction and user operations (e.g., RMWs), as well as new components: a two-level hash index to reduce indexing overhead for cold records and a read-cache for serving read-hot data. Detailed experimental results show that F2 matches or outperforms existing solutions, achieving on average better throughput on memory-constrained environments compared to state-of-the-art systems like RocksDB (11.75x), SplinterDB (4.52x), KVell (10.56x), LeanStore (2.04x), and FASTER (2.38x). F2 also maintains its high performance across varying workload skewness levels and memory budgets, while achieving low disk write amplification. △ Less

Submitted 4 December, 2024; v1 submitted 2 May, 2023; originally announced May 2023.

arXiv:2203.05128 [pdf, other]

LlamaTune: Sample-Efficient DBMS Configuration Tuning

Authors: Konstantinos Kanellis, Cong Ding, Brian Kroth, Andreas Müller, Carlo Curino, Shivaram Venkataraman

Abstract: Tuning a database system to achieve optimal performance on a given workload is a long-standing problem in the database community. A number of recent works have leveraged ML-based approaches to guide the sampling of large parameter spaces (hundreds of tuning knobs) in search for high performance configurations. Looking at Microsoft production services operating millions of databases, sample efficie… ▽ More Tuning a database system to achieve optimal performance on a given workload is a long-standing problem in the database community. A number of recent works have leveraged ML-based approaches to guide the sampling of large parameter spaces (hundreds of tuning knobs) in search for high performance configurations. Looking at Microsoft production services operating millions of databases, sample efficiency emerged as a crucial requirement to use tuners on diverse workloads. This motivates our investigation in LlamaTune, a tuner design that leverages domain knowledge to improve the sample efficiency of existing optimizers. LlamaTune employs an automated dimensionality reduction technique based on randomized projections, a biased-sampling approach to handle special values for certain knobs, and knob values bucketization, to reduce the size of the search space. LlamaTune compares favorably with the state-of-the-art optimizers across a diverse set of workloads. It identifies the best performing configurations with up to $11\times$ fewer workload runs, and reaching up to $21\%$ higher throughput. We also show that benefits from LlamaTune generalize across both BO-based and RL-based optimizers, as well as different DBMS versions. While the journey to perform database tuning at cloud-scale remains long, LlamaTune goes a long way in making automatic DBMS tuning practical at scale. △ Less

Submitted 23 August, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Proceedings of the VLDB Endowment 15 (VLDB'22)

Showing 1–4 of 4 results for author: Kanellis, K