-
The Design and Implementation of a High-Performance Log-Structured RAID System for ZNS SSDs
Authors:
Jinhong Li,
Qiuping Wang,
Shujie Han,
Patrick P. C. Lee
Abstract:
Zoned Namespace (ZNS) defines a new abstraction for host software to flexibly manage storage in flash-based SSDs as append-only zones. It also provides a Zone Append primitive to further boost the write performance of ZNS SSDs by exploiting intra-zone parallelism. However, making Zone Append effective for reliable and scalable storage, in the form of a RAID array of multiple ZNS SSDs, is non-trivi…
▽ More
Zoned Namespace (ZNS) defines a new abstraction for host software to flexibly manage storage in flash-based SSDs as append-only zones. It also provides a Zone Append primitive to further boost the write performance of ZNS SSDs by exploiting intra-zone parallelism. However, making Zone Append effective for reliable and scalable storage, in the form of a RAID array of multiple ZNS SSDs, is non-trivial since Zone Append offloads address management to ZNS SSDs and requires hosts to dedicatedly manage RAID stripes across multiple drives. We propose ZapRAID, a high-performance log-structured RAID system for ZNS SSDs by carefully exploiting Zone Append to achieve high write parallelism and lightweight stripe management. ZapRAID adopts a group-based data layout with a coarse-grained ordering across multiple groups of stripes, such that it can use small-size metadata for stripe management on a per-group basis under Zone Append. It further adopts hybrid data management to simultaneously achieve intra-zone and inter-zone parallelism through a careful combination of both Zone Write and Zone Append primitives. We implement ZapRAID as a user-space block device, and evaluate ZapRAID using microbenchmarks, trace-driven experiments, and real-application experiments. Our evaluation results show that ZapRAID achieves high write throughput and maintains high performance in normal reads, degraded reads, crash recovery, and full-drive recovery.
△ Less
Submitted 6 February, 2025; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Two New Piggybacking Designs with Lower Repair Bandwidth
Authors:
Zhengyi Jiang,
Hanxu Hou,
Yunghsiang S. Han,
Patrick P. C. Lee,
Bo Bai,
Zhongyi Huang
Abstract:
Piggybacking codes are a special class of MDS array codes that can achieve small repair bandwidth with small sub-packetization by first creating some instances of an $(n,k)$ MDS code, such as a Reed-Solomon (RS) code, and then designing the piggyback function. In this paper, we propose a new piggybacking coding design which designs the piggyback function over some instances of both $(n,k)$ MDS cod…
▽ More
Piggybacking codes are a special class of MDS array codes that can achieve small repair bandwidth with small sub-packetization by first creating some instances of an $(n,k)$ MDS code, such as a Reed-Solomon (RS) code, and then designing the piggyback function. In this paper, we propose a new piggybacking coding design which designs the piggyback function over some instances of both $(n,k)$ MDS code and $(n,k')$ MDS code, when $k\geq k'$. We show that our new piggybacking design can significantly reduce the repair bandwidth for single-node failures. When $k=k'$, we design piggybacking code that is MDS code and we show that the designed code has lower repair bandwidth for single-node failures than all existing piggybacking codes when the number of parity node $r=n-k\geq8$ and the sub-packetization $α<r$.
Moreover, we propose another piggybacking codes by designing $n$ piggyback functions of some instances of $(n,k)$ MDS code and adding the $n$ piggyback functions into the $n$ newly created empty entries with no data symbols. We show that our code can significantly reduce repair bandwidth for single-node failures at a cost of slightly more storage overhead. In addition, we show that our code can recover any $r+1$ node failures for some parameters. We also show that our code has lower repair bandwidth than locally repairable codes (LRCs) under the same fault-tolerance and redundancy for some parameters.
△ Less
Submitted 28 May, 2022;
originally announced May 2022.
-
Efficient LSM-Tree Key-Value Data Management on Hybrid SSD/HDD Zoned Storage
Authors:
Jinhong Li,
Qiuping Wang,
Patrick P. C. Lee
Abstract:
Zoned storage devices, such as zoned namespace (ZNS) solid-state drives (SSDs) and host-managed shingled magnetic recording (HM-SMR) hard-disk drives (HDDs), expose interfaces for host-level applications to support fine-grained, high-performance storage management. Combining ZNS SSDs and HM-SMR HDDs into a unified hybrid storage system is a natural direction to scale zoned storage at low cost, yet…
▽ More
Zoned storage devices, such as zoned namespace (ZNS) solid-state drives (SSDs) and host-managed shingled magnetic recording (HM-SMR) hard-disk drives (HDDs), expose interfaces for host-level applications to support fine-grained, high-performance storage management. Combining ZNS SSDs and HM-SMR HDDs into a unified hybrid storage system is a natural direction to scale zoned storage at low cost, yet how to effectively incorporate zoned storage awareness into hybrid storage is a non-trivial issue. We make a case for key-value (KV) stores based on log-structured merge trees (LSM-trees) as host-level applications, and present HHZS, a middleware system that bridges an LSM-tree KV store with hybrid zoned storage devices based on hints. HHZS leverages hints issued by the flushing, compaction, and caching operations of the LSM-tree KV store to manage KV objects in placement, migration, and caching in hybrid ZNS SSD and HM-SMR HDD zoned storage. Experiments show that our HHZS prototype, when running on real ZNS SSD and HM-SMR HDD devices, achieves the highest throughput compared with all baselines under various settings.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
An In-Depth Comparative Analysis of Cloud Block Storage Workloads: Findings and Implications
Authors:
Jinhong Li,
Qiuping Wang,
Patrick P. C. Lee,
Chao Shi
Abstract:
Cloud block storage systems support diverse types of applications in modern cloud services. Characterizing their I/O activities is critical for guiding better system designs and optimizations. In this paper, we present an in-depth comparative analysis of production cloud block storage workloads through the block-level I/O traces of billions of I/O requests collected from two production systems, Al…
▽ More
Cloud block storage systems support diverse types of applications in modern cloud services. Characterizing their I/O activities is critical for guiding better system designs and optimizations. In this paper, we present an in-depth comparative analysis of production cloud block storage workloads through the block-level I/O traces of billions of I/O requests collected from two production systems, Alibaba Cloud and Tencent Cloud Block Storage. We study their characteristics of load intensities, spatial patterns, and temporal patterns. We also compare the cloud block storage workloads with the notable public block-level I/O workloads from the enterprise data centers at Microsoft Research Cambridge, and identify the commonalities and differences of the three sources of traces. To this end, we provide 6 findings through the high-level analysis and 16 findings through the detailed analysis on load intensity, spatial patterns, and temporal patterns. We discuss the implications of our findings on load balancing, cache efficiency, and storage cluster management in cloud block storage systems.
△ Less
Submitted 19 November, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
A Generalization of Array Codes with Local Properties and Efficient Encoding/Decoding
Authors:
Hanxu Hou,
Yunghsiang S. Han,
Patrick P. C. Lee,
You Wu,
Guojun Han,
Mario Blaum
Abstract:
A maximum distance separable (MDS) array code is composed of $m\times (k+r)$ arrays such that any $k$ out of $k+r$ columns suffice to retrieve all the information symbols. Expanded-Blaum-Roth (EBR) codes and Expanded-Independent-Parity (EIP) codes are two classes of MDS array codes that can repair any one symbol in a column by locally accessing some other symbols within the column, where the numbe…
▽ More
A maximum distance separable (MDS) array code is composed of $m\times (k+r)$ arrays such that any $k$ out of $k+r$ columns suffice to retrieve all the information symbols. Expanded-Blaum-Roth (EBR) codes and Expanded-Independent-Parity (EIP) codes are two classes of MDS array codes that can repair any one symbol in a column by locally accessing some other symbols within the column, where the number of symbols $m$ in a column is a prime number. By generalizing the constructions of EBR and EIP codes, we propose new MDS array codes, such that any one symbol can be locally recovered and the number of symbols in a column can be not only a prime number but also a power of an odd prime number. Also, we present an efficient encoding/decoding method for the proposed generalized EBR (GEBR) and generalized EIP (GEIP) codes based on the LU factorization of a Vandermonde matrix. We show that the proposed decoding method has less computational complexity than existing methods. Furthermore, we show that the proposed GEBR codes have both a larger minimum symbol distance and a larger recovery ability of erased lines for some parameters when compared to EBR codes. We show that EBR codes can recover any $r$ erased lines of a slope for any parameter $r$, which was an open problem in [2].
△ Less
Submitted 12 September, 2022; v1 submitted 10 October, 2021;
originally announced October 2021.
-
MVPipe: Enabling Lightweight Updates and Fast Convergence in Hierarchical Heavy Hitter Detection
Authors:
Lu Tang,
Qun Huang,
Patrick P. C. Lee
Abstract:
Finding hierarchical heavy hitters (HHHs) (i.e., hierarchical aggregates with exceptionally huge amounts of traffic) is critical to network management, yet it is often challenged by the requirements of fast packet processing, real-time and accurate detection, as well as resource efficiency. Existing HHH detection schemes either incur expensive packet updates for multiple aggregation levels in the…
▽ More
Finding hierarchical heavy hitters (HHHs) (i.e., hierarchical aggregates with exceptionally huge amounts of traffic) is critical to network management, yet it is often challenged by the requirements of fast packet processing, real-time and accurate detection, as well as resource efficiency. Existing HHH detection schemes either incur expensive packet updates for multiple aggregation levels in the IP address hierarchy, or need to process sufficient packets to converge to the required detection accuracy. We present MVPipe, an invertible sketch that achieves both lightweight updates and fast convergence in HHH detection. MVPipe builds on the skewness property of IP traffic to process packets via a pipeline of majority voting executions, such that most packets can be updated for only one or few aggregation levels in the IP address hierarchy. We show how MVPipe can be feasibly deployed in P4-based programmable switches subject to limited switch resources. We also theoretically analyze the accuracy and coverage properties of MVPipe. Evaluation with real-world Internet traces shows that MVPipe achieves high accuracy, high throughput, and fast convergence compared to six state-of-the-art HHH detection schemes. It also incurs low resource overhead in the Tofino switch deployment.
△ Less
Submitted 28 June, 2023; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Separating Data via Block Invalidation Time Inference for Write Amplification Reduction in Log-Structured Storage
Authors:
Qiuping Wang,
Jinhong Li,
Patrick P. C. Lee,
Tao Ouyang,
Chao Shi,
Lilong Huang
Abstract:
Log-structured storage has been widely deployed in various domains of storage systems, yet its garbage collection incurs write amplification (WA) due to the rewrites of live data. We show that there exists an optimal data placement scheme that minimizes WA using the future knowledge of block invalidation time (BIT) of each written block, yet it is infeasible to realize in practice. We propose a no…
▽ More
Log-structured storage has been widely deployed in various domains of storage systems, yet its garbage collection incurs write amplification (WA) due to the rewrites of live data. We show that there exists an optimal data placement scheme that minimizes WA using the future knowledge of block invalidation time (BIT) of each written block, yet it is infeasible to realize in practice. We propose a novel data placement algorithm for reducing WA, SepBIT, that aims to infer the BITs of written blocks from storage workloads and separately place the blocks into groups with similar estimated BITs. We show via both mathematical and production trace analyses that SepBIT effectively infers the BITs by leveraging the write skewness property in practical storage workloads. Trace analysis and prototype experiments show that SepBIT reduces WA and improves I/O throughput, respectively, compared with state-of-the-art data placement schemes. SepBIT is currently deployed to support the log-structured block storage management at Alibaba Cloud.
△ Less
Submitted 10 February, 2022; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments
Authors:
Shujie Han,
Jun Wu,
Erci Xu,
Cheng He,
Patrick P. C. Lee,
Yi Qiang,
Qixing Zheng,
Tao Huang,
Zixi Huang,
Rui Li
Abstract:
To provide proactive fault tolerance for modern cloud data centers, extensive studies have proposed machine learning (ML) approaches to predict imminent disk failures for early remedy and evaluated their approaches directly on public datasets (e.g., Backblaze SMART logs). However, in real-world production environments, the data quality is imperfect (e.g., inaccurate labeling, missing data samples,…
▽ More
To provide proactive fault tolerance for modern cloud data centers, extensive studies have proposed machine learning (ML) approaches to predict imminent disk failures for early remedy and evaluated their approaches directly on public datasets (e.g., Backblaze SMART logs). However, in real-world production environments, the data quality is imperfect (e.g., inaccurate labeling, missing data samples, and complex failure types), thereby degrading the prediction accuracy. We present RODMAN, a robust data preprocessing pipeline that refines data samples before feeding them into ML models. We start with a large-scale trace-driven study of over three million disks from Alibaba Cloud's data centers, and motivate the practical challenges in ML-based disk failure prediction. We then design RODMAN with three data preprocessing echniques, namely failure-type filtering, spline-based data filling, and automated pre-failure backtracking, that are applicable for general ML models. Evaluation on both the Alibaba and Backblaze datasets shows that RODMAN improves the prediction accuracy compared to without data preprocessing under various settings.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
A Fast and Compact Invertible Sketch for Network-Wide Heavy Flow Detection
Authors:
Lu Tang,
Qun Huang,
Patrick P. C. Lee
Abstract:
Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertible sketches are summary data structures that can recover heavy flows with small memory footprints and bounded errors, yet existing invertible sketches incur high memory access overhead…
▽ More
Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertible sketches are summary data structures that can recover heavy flows with small memory footprints and bounded errors, yet existing invertible sketches incur high memory access overhead that leads to performance degradation. We present MV-Sketch, a fast and compact invertible sketch that supports heavy flow detection with small and static memory allocation. MV-Sketch tracks candidate heavy flows inside the sketch data structure via the idea of majority voting, such that it incurs small memory access overhead in both update and query operations, while achieving high detection accuracy. We present theoretical analysis on the memory usage, performance, and accuracy of MV-Sketch in both local and network-wide scenarios. We further show how MV-Sketch can be implemented and deployed on P4-based programmable switches subject to hardware deployment constraints. We conduct evaluation in both software and hardware environments. Trace-driven evaluation in software shows that MV-Sketch achieves higher accuracy than existing invertible sketches, with up to 3.38x throughput gain. We also show how to boost the performance of MV-Sketch with SIMD instructions. Furthermore, we evaluate MV-Sketch on a Barefoot Tofino switch and show how MV-Sketch achieves line-rate measurement with limited hardware resource overhead.
△ Less
Submitted 22 July, 2020; v1 submitted 23 October, 2019;
originally announced October 2019.
-
Repair Pipelining for Erasure-Coded Storage: Algorithms and Evaluation
Authors:
Xiaolu Li,
Zuoru Yang,
Jinhong Li,
Runhui Li,
Patrick P. C. Lee,
Qun Huang,
Yuchong Hu
Abstract:
We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design d…
▽ More
We propose repair pipelining, a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called ECPipe, and integrate it as a middleware system into two versions of Hadoop Distributed File System (HDFS) (namely HDFS-RAID and HDFS-3) as well as Quantcast File System (QFS). Experiments on a local testbed and Amazon EC2 show that repair pipelining significantly improves the performance of degraded reads and full-node recovery over existing repair techniques.
△ Less
Submitted 20 November, 2020; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Multi-Layer Transformed MDS Codes with Optimal Repair Access and Low Sub-Packetization
Authors:
Hanxu Hou,
Patrick P. C. Lee,
Yunghsiang S. Han
Abstract:
An $(n,k)$ maximum distance separable (MDS) code has optimal repair access if the minimum number of symbols accessed from $d$ surviving nodes is achieved, where $k+1\le d\le n-1$. Existing results show that the sub-packetization $α$ of an $(n,k,d)$ high code rate (i.e., $k/n>0.5$) MDS code with optimal repair access is at least $(d-k+1)^{\lceil\frac{n}{d-k+1}\rceil}$. In this paper, we propose a c…
▽ More
An $(n,k)$ maximum distance separable (MDS) code has optimal repair access if the minimum number of symbols accessed from $d$ surviving nodes is achieved, where $k+1\le d\le n-1$. Existing results show that the sub-packetization $α$ of an $(n,k,d)$ high code rate (i.e., $k/n>0.5$) MDS code with optimal repair access is at least $(d-k+1)^{\lceil\frac{n}{d-k+1}\rceil}$. In this paper, we propose a class of multi-layer transformed MDS codes such that the sub-packetization is $(d-k+1)^{\lceil\frac{n}{(d-k+1)η}\rceil}$, where $η=\lfloor\frac{n-k-1}{d-k}\rfloor$, and the repair access is optimal for any single node. We show that the sub-packetization of the proposed multi-layer transformed MDS codes is strictly less than the existing known lower bound when $η=\lfloor\frac{n-k-1}{d-k}\rfloor>1$, achieving by restricting the choice of $d$ specific helper nodes in repairing a failed node. We further propose multi-layer transformed EVENODD codes that have optimal repair access for any single node and lower sub-packetization than the existing binary MDS array codes with optimal repair access for any single node. With our multi-layer transformation, we can design new MDS codes that have the properties of low computational complexity, optimal repair access for any single node, and relatively small sub-packetization, all of which are critical for maintaining the reliability of distributed storage systems.
△ Less
Submitted 22 July, 2019; v1 submitted 21 July, 2019;
originally announced July 2019.
-
Information Leakage in Encrypted Deduplication via Frequency Analysis: Attacks and Defenses
Authors:
Jingwei Li,
Patrick P. C. Lee,
Chufeng Tan,
Chuan Qin,
Xiaosong Zhang
Abstract:
Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adve…
▽ More
Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adversary to launch frequency analysis against the ciphertext chunks and infer the content of the original plaintext chunks. In this paper, we study how frequency analysis affects information leakage in encrypted deduplication storage, from both attack and defense perspectives. Specifically, we target backup workloads, and propose a new inference attack that exploits chunk locality to increase the coverage of inferred chunks. We further combine the new inference attack with the knowledge of chunk sizes and show its attack effectiveness against variable-size chunks. We conduct trace-driven evaluation on both real-world and synthetic datasets and show that our proposed attacks infer a significant fraction of plaintext chunks under backup workloads. To defend against frequency analysis, we present two defense approaches, namely MinHash encryption and scrambling. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling scheme effectively mitigates the severity of the inference attacks, while maintaining high storage efficiency and incurring limited metadata access overhead.
△ Less
Submitted 9 October, 2019; v1 submitted 11 April, 2019;
originally announced April 2019.
-
Enabling Efficient Updates in KV Storage via Hashing: Design and Performance Evaluation
Authors:
Yongkun Li,
Helen H. W. Chan,
Patrick P. C. Lee,
Yinlong Xu
Abstract:
Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its h…
▽ More
Persistent key-value (KV) stores mostly build on the Log-Structured Merge (LSM) tree for high write performance, yet the LSM-tree suffers from the inherently high I/O amplification. KV separation mitigates I/O amplification by storing only keys in the LSM-tree and values in separate storage. However, the current KV separation design remains inefficient under update-intensive workloads due to its high garbage collection (GC) overhead in value storage. We propose HashKV, which aims for high update performance atop KV separation under update-intensive workloads. HashKV uses hash-based data grouping, which deterministically maps values to storage space so as to make both updates and GC efficient. We further relax the restriction of such deterministic mappings via simple but useful design extensions. We extensively evaluate various design aspects of HashKV. We show that HashKV achieves 4.6x update throughput and 53.4% less write traffic compared to the current KV separation design. In addition, we demonstrate that we can integrate the design of HashKV with state-of-the-art KV stores and improve their respective performance.
△ Less
Submitted 17 June, 2019; v1 submitted 25 November, 2018;
originally announced November 2018.
-
On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance
Authors:
Zhinan Cheng,
Qun Huang,
Patrick P. C. Lee
Abstract:
Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the trade-off between performance and accuracy in fault tolerance. AF-Stream builds on a notion called approximate fault tolerance, whose idea is to mitigate backup…
▽ More
Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the trade-off between performance and accuracy in fault tolerance. AF-Stream builds on a notion called approximate fault tolerance, whose idea is to mitigate backup overhead by adaptively issuing backups, while ensuring that the errors upon failures are bounded with theoretical guarantees. Specifically, AF-Stream allows users to specify bounds on both the state divergence and the loss of non-backup streaming items. It issues state and item backups only when the bounds are reached. Our AF-Stream design provides an extensible programming model for incorporating general streaming algorithms as well as exports only few threshold parameters for configuring approximation fault tolerance. Furthermore, we formally prove that AF-Stream preserves high algorithm-specific accuracy of streaming algorithms, and in particular the convergence guarantees of online learning. Experiments show that AF-Stream maintains high performance (compared to no fault tolerance) and high accuracy after multiple failures (compared to no failures) under various streaming algorithms.
△ Less
Submitted 12 August, 2019; v1 submitted 12 November, 2018;
originally announced November 2018.
-
Binary MDS Array Codes with Optimal Repair
Authors:
Hanxu Hou,
Patrick P. C. Lee
Abstract:
Consider a binary maximum distance separable (MDS) array code composed of an $m\times (k+r)$ array of bits with $k$ information columns and $r$ parity columns, such that any $k$ out of $k+r$ columns suffice to reconstruct the $k$ information columns. Our goal is to provide {\em optimal repair access} for binary MDS array codes, meaning that the bandwidth triggered to repair any single failed infor…
▽ More
Consider a binary maximum distance separable (MDS) array code composed of an $m\times (k+r)$ array of bits with $k$ information columns and $r$ parity columns, such that any $k$ out of $k+r$ columns suffice to reconstruct the $k$ information columns. Our goal is to provide {\em optimal repair access} for binary MDS array codes, meaning that the bandwidth triggered to repair any single failed information or parity column is minimized. In this paper, we propose a generic transformation framework for binary MDS array codes, using EVENODD codes as a motivating example, to support optimal repair access for $k+1\le d \le k+r-1$, where $d$ denotes the number of non-failed columns that are connected for repair; note that when $d<k+r-1$, some of the chosen $d$ columns in repairing a failed column are specific. In addition, we show how our transformation framework applies to an example of binary MDS array codes with asymptotically optimal repair access of any single information column and enables asymptotically or exactly optimal repair access for any column. Furthermore, we present a new transformation for EVENODD codes with two parity columns such that the existing efficient repair property of any information column is preserved and the repair access of parity column is optimal.
△ Less
Submitted 28 August, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
A New Design of Binary MDS Array Codes with Asymptotically Weak-Optimal Repair
Authors:
Hanxu Hou,
Yunghsiang Han,
Patrick P. C. Lee,
Yuchong Hu,
Hui Li
Abstract:
Binary maximum distance separable (MDS) array codes are a special class of erasure codes for distributed storage that not only provide fault tolerance with minimum storage redundancy but also achieve low computational complexity. They are constructed by encoding $k$ information columns into $r$ parity columns, in which each element in a column is a bit, such that any $k$ out of the $k+r$ columns s…
▽ More
Binary maximum distance separable (MDS) array codes are a special class of erasure codes for distributed storage that not only provide fault tolerance with minimum storage redundancy but also achieve low computational complexity. They are constructed by encoding $k$ information columns into $r$ parity columns, in which each element in a column is a bit, such that any $k$ out of the $k+r$ columns suffice to recover all information bits. In addition to providing fault tolerance, it is critical to improve repair performance in practical applications. Specifically, if a single column fails, our goal is to minimize the repair bandwidth by downloading the least amount of bits from $d$ healthy columns, where $k\leq d\leq k+r-1$. If one column of an MDS code is failed, it is known that we need to download at least $1/(d-k+1)$ fraction of the data stored in each of $d$ healthy columns. If this lower bound is achieved for the repair of the failure column from accessing arbitrary $d$ healthy columns, we say that the MDS code has optimal repair. However, if such lower bound is only achieved by $d$ specific healthy columns, then we say the MDS code has weak-optimal repair. In this paper, we propose two explicit constructions of binary MDS array codes with more parity columns (i.e., $r\geq 3$) that achieve asymptotically weak-optimal repair, where $k+1\leq d\leq k+\lfloor(r-1)/2\rfloor$, and "asymptotic" means that the repair bandwidth achieves the minimum value asymptotically in $d$. Codes in the first construction have odd number of parity columns and asymptotically weak-optimal repair for any one information failure, while codes in the second construction have even number of parity columns and asymptotically weak-optimal repair for any one column failure.
△ Less
Submitted 20 June, 2019; v1 submitted 21 February, 2018;
originally announced February 2018.
-
Rack-Aware Regenerating Codes for Data Centers
Authors:
Hanxu Hou,
Patrick P. C. Lee,
Kenneth W. Shum,
Yuchong Hu
Abstract:
Erasure coding is widely used for massive storage in data centers to achieve high fault tolerance and low storage redundancy. Since the cross-rack communication cost is often high, it is critical to design erasure codes that minimize the cross-rack repair bandwidth during failure repair. In this paper, we analyze the optimal trade-off between storage redundancy and cross-rack repair bandwidth spec…
▽ More
Erasure coding is widely used for massive storage in data centers to achieve high fault tolerance and low storage redundancy. Since the cross-rack communication cost is often high, it is critical to design erasure codes that minimize the cross-rack repair bandwidth during failure repair. In this paper, we analyze the optimal trade-off between storage redundancy and cross-rack repair bandwidth specifically for data centers, subject to the condition that the original data can be reconstructed from a sufficient number of any non-failed nodes. We characterize the optimal trade-off curve under functional repair, and propose a general family of erasure codes called rack-aware regenerating codes (RRC), which achieve the optimal trade-off. We further propose exact repair constructions of RRC that have minimum storage redundancy and minimum cross-rack repair bandwidth, respectively. We show that (i) the minimum storage redundancy constructions support a wide range of parameters and have cross-rack repair bandwidth that is strictly less than that of the classical minimum storage regenerating codes in most cases, and (ii) the minimum cross-rack repair bandwidth constructions support all the parameters and have less cross-rack repair bandwidth than that of the minimum bandwidth regenerating codes for almost all of the parameters.
△ Less
Submitted 25 February, 2019; v1 submitted 12 February, 2018;
originally announced February 2018.
-
Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice
Authors:
Yuchong Hu,
Xiaolu Li,
Mi Zhang,
Patrick P. C. Lee,
Xiaoyang Zhang,
Pan Zhou,
Dan Feng
Abstract:
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In th…
▽ More
Repair performance in hierarchical data centers is often bottlenecked by cross-rack network transfer. Recent theoretical results show that the cross-rack repair traffic can be minimized through repair layering, whose idea is to partition a repair operation into inner-rack and cross-rack layers. However, how repair layering should be implemented and deployed in practice remains an open issue. In this paper, we address this issue by proposing a practical repair layering framework called DoubleR. We design two families of practical double regenerating codes (DRC), which not only minimize the cross-rack repair traffic, but also have several practical properties that improve state-of-the-art regenerating codes. We implement and deploy DoubleR atop Hadoop Distributed File System (HDFS), and show that DoubleR maintains the theoretical guarantees of DRC and improves the repair performance of regenerating codes in both node recovery and degraded read operations.
△ Less
Submitted 15 September, 2017; v1 submitted 12 April, 2017;
originally announced April 2017.
-
Erasure Coding for Small Objects in In-Memory KV Storage
Authors:
Matt M. T. Yiu,
Helen H. W. Chan,
Patrick P. C. Lee
Abstract:
We present MemEC, an erasure-coding-based in-memory key-value (KV) store that achieves high availability and fast recovery while keeping low data redundancy across storage servers. MemEC is specifically designed for workloads dominated by small objects. By encoding objects in entirety, MemEC is shown to incur 60% less storage redundancy for small objects than existing replication- and erasure-codi…
▽ More
We present MemEC, an erasure-coding-based in-memory key-value (KV) store that achieves high availability and fast recovery while keeping low data redundancy across storage servers. MemEC is specifically designed for workloads dominated by small objects. By encoding objects in entirety, MemEC is shown to incur 60% less storage redundancy for small objects than existing replication- and erasure-coding-based approaches. It also supports graceful transitions between decentralized requests in normal mode (i.e., no failures) and coordinated requests in degraded mode (i.e., with failures). We evaluate our MemEC prototype via testbed experiments under read-heavy and update-heavy YCSB workloads. We show that MemEC achieves high throughput and low latency in both normal and degraded modes, and supports fast transitions between the two modes.
△ Less
Submitted 21 May, 2017; v1 submitted 27 January, 2017;
originally announced January 2017.
-
The Design and Implementation of a Rekeying-aware Encrypted Deduplication Storage System
Authors:
Chuan Qin,
Jingwei Li,
Patrick P. C. Lee
Abstract:
Rekeying refers to an operation of replacing an existing key with a new key for encryption. It renews security protection, so as to protect against key compromise and enable dynamic access control in cryptographic storage. However, it is non-trivial to realize efficient rekeying in encrypted deduplication storage systems, which use deterministic content-derived encryption keys to allow deduplicati…
▽ More
Rekeying refers to an operation of replacing an existing key with a new key for encryption. It renews security protection, so as to protect against key compromise and enable dynamic access control in cryptographic storage. However, it is non-trivial to realize efficient rekeying in encrypted deduplication storage systems, which use deterministic content-derived encryption keys to allow deduplication on ciphertexts. We design and implement REED, a rekeying-aware encrypted deduplication storage system. REED builds on a deterministic version of all-or-nothing transform (AONT), such that it enables secure and lightweight rekeying, while preserving the deduplication capability. We propose two REED encryption schemes that trade between performance and security, and extend REED for dynamic access control. We implement a REED prototype with various performance optimization techniques and demonstrate how we can exploit similarity to mitigate key generation overhead. Our trace-driven testbed evaluation shows that our REED prototype maintains high performance and storage efficiency.
△ Less
Submitted 19 December, 2016; v1 submitted 28 July, 2016;
originally announced July 2016.
-
CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via Convergent Dispersal
Authors:
Mingqiang Li,
Chuan Qin,
Patrick P. C. Lee
Abstract:
We present CDStore, which disperses users' backup data across multiple clouds and provides a unified multi-cloud storage solution with reliability, security, and cost-efficiency guarantees. CDStore builds on an augmented secret sharing scheme called convergent dispersal, which supports deduplication by using deterministic content-derived hashes as inputs to secret sharing. We present the design of…
▽ More
We present CDStore, which disperses users' backup data across multiple clouds and provides a unified multi-cloud storage solution with reliability, security, and cost-efficiency guarantees. CDStore builds on an augmented secret sharing scheme called convergent dispersal, which supports deduplication by using deterministic content-derived hashes as inputs to secret sharing. We present the design of CDStore, and in particular, describe how it combines convergent dispersal with two-stage deduplication to achieve both bandwidth and storage savings and be robust against side-channel attacks. We evaluate the performance of our CDStore prototype using real-world workloads on LAN and commercial cloud testbeds. Our cost analysis also demonstrates that CDStore achieves a monetary cost saving of 70% over a baseline cloud storage solution using state-of-the-art secret sharing.
△ Less
Submitted 29 May, 2015; v1 submitted 17 February, 2015;
originally announced February 2015.
-
STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures
Authors:
Mingqiang Li,
Patrick P. C. Lee
Abstract:
Practical storage systems often adopt erasure codes to tolerate device failures and sector failures, both of which are prevalent in the field. However, traditional erasure codes employ device-level redundancy to protect against sector failures, and hence incur significant space overhead. Recent sector-disk (SD) codes are available only for limited configurations. By making a relaxed but practical…
▽ More
Practical storage systems often adopt erasure codes to tolerate device failures and sector failures, both of which are prevalent in the field. However, traditional erasure codes employ device-level redundancy to protect against sector failures, and hence incur significant space overhead. Recent sector-disk (SD) codes are available only for limited configurations. By making a relaxed but practical assumption, we construct a general family of erasure codes called \emph{STAIR codes}, which efficiently and provably tolerate both device and sector failures without any restriction on the size of a storage array and the numbers of tolerable device failures and sector failures. We propose the \emph{upstairs encoding} and \emph{downstairs encoding} methods, which provide complementary performance advantages for different configurations. We conduct extensive experiments on STAIR codes in terms of space saving, encoding/decoding speed, and update cost. We demonstrate that STAIR codes not only improve space efficiency over traditional erasure codes, but also provide better computational efficiency than SD codes based on our special code construction. Finally, we present analytical models that characterize the reliability of STAIR codes, and show that the support of a wider range of configurations by STAIR codes is critical for tolerating sector failure bursts discovered in the field.
△ Less
Submitted 23 June, 2014; v1 submitted 20 June, 2014;
originally announced June 2014.
-
Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage
Authors:
Yan Kit Li,
Min Xu,
Chun Ho Ng,
Patrick P. C. Lee
Abstract:
Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest backup due to fragmentation, and complicates deletion of ex- pired backups due to the sharing of data chunks. While out-of-line deduplication addr…
▽ More
Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest backup due to fragmentation, and complicates deletion of ex- pired backups due to the sharing of data chunks. While out-of-line deduplication addresses the problems by forward-pointing existing duplicate chunks to those of the latest backup, it introduces additional I/Os of writing and removing duplicate chunks. We design and implement RevDedup, an efficient hybrid inline and out-of-line deduplication system for backup storage. It applies coarse-grained inline deduplication to remove duplicates of the latest backup, and then fine-grained out-of-line reverse deduplication to remove duplicates from older backups. Our reverse deduplication design limits the I/O overhead and prepares for efficient deletion of expired backups. Through extensive testbed experiments using synthetic and real-world datasets, we show that RevDedup can bring high performance to the backup, restore, and deletion operations, while maintaining high storage efficiency comparable to conventional inline deduplication.
△ Less
Submitted 22 May, 2014;
originally announced May 2014.
-
Stochastic Analysis on RAID Reliability for Solid-State Drives
Authors:
Yongkun Li,
Patrick P. C. Lee,
John C. S. Lui
Abstract:
Solid-state drives (SSDs) have been widely deployed in desktops and data centers. However, SSDs suffer from bit errors, and the bit error rate is time dependent since it increases as an SSD wears down. Traditional storage systems mainly use parity-based RAID to provide reliability guarantees by striping redundancy across multiple devices, but the effectiveness of RAID in SSDs remains debatable as…
▽ More
Solid-state drives (SSDs) have been widely deployed in desktops and data centers. However, SSDs suffer from bit errors, and the bit error rate is time dependent since it increases as an SSD wears down. Traditional storage systems mainly use parity-based RAID to provide reliability guarantees by striping redundancy across multiple devices, but the effectiveness of RAID in SSDs remains debatable as parity updates aggravate the wearing and bit error rates of SSDs. In particular, an open problem is that how different parity distributions over multiple devices, such as the even distribution suggested by conventional wisdom, or uneven distributions proposed in recent RAID schemes for SSDs, may influence the reliability of an SSD RAID array. To address this fundamental problem, we propose the first analytical model to quantify the reliability dynamics of an SSD RAID array. Specifically, we develop a "non-homogeneous" continuous time Markov chain model, and derive the transient reliability solution. We validate our model via trace-driven simulations and conduct numerical analysis to provide insights into the reliability dynamics of SSD RAID arrays under different parity distributions and subject to different bit error rates and array configurations. Designers can use our model to decide the appropriate parity distribution based on their reliability requirements.
△ Less
Submitted 6 April, 2013;
originally announced April 2013.
-
Stochastic Modeling of Large-Scale Solid-State Storage Systems: Analysis, Design Tradeoffs and Optimization
Authors:
Yongkun Li,
Patrick P. C. Lee,
John C. S. Lui
Abstract:
Solid state drives (SSDs) have seen wide deployment in mobiles, desktops, and data centers due to their high I/O performance and low energy consumption. As SSDs write data out-of-place, garbage collection (GC) is required to erase and reclaim space with invalid data. However, GC poses additional writes that hinder the I/O performance, while SSD blocks can only endure a finite number of erasures. T…
▽ More
Solid state drives (SSDs) have seen wide deployment in mobiles, desktops, and data centers due to their high I/O performance and low energy consumption. As SSDs write data out-of-place, garbage collection (GC) is required to erase and reclaim space with invalid data. However, GC poses additional writes that hinder the I/O performance, while SSD blocks can only endure a finite number of erasures. Thus, there is a performance-durability tradeoff on the design space of GC. To characterize the optimal tradeoff, this paper formulates an analytical model that explores the full optimal design space of any GC algorithm. We first present a stochastic Markov chain model that captures the I/O dynamics of large-scale SSDs, and adapt the mean-field approach to derive the asymptotic steady-state performance. We further prove the model convergence and generalize the model for all types of workload. Inspired by this model, we propose a randomized greedy algorithm (RGA) that can operate along the optimal tradeoff curve with a tunable parameter. Using trace-driven simulation on DiskSim with SSD add-ons, we demonstrate how RGA can be parameterized to realize the performance-durability tradeoff.
△ Less
Submitted 20 March, 2013; v1 submitted 19 March, 2013;
originally announced March 2013.
-
CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems
Authors:
Runhui Li,
Jian Lin,
Patrick P. C. Lee
Abstract:
Data availability is critical in distributed storage systems, especially when node failures are prevalent in real life. A key requirement is to minimize the amount of data transferred among nodes when recovering the lost or unavailable data of failed nodes. This paper explores recovery solutions based on regenerating codes, which are shown to provide fault-tolerant storage and minimum recovery ban…
▽ More
Data availability is critical in distributed storage systems, especially when node failures are prevalent in real life. A key requirement is to minimize the amount of data transferred among nodes when recovering the lost or unavailable data of failed nodes. This paper explores recovery solutions based on regenerating codes, which are shown to provide fault-tolerant storage and minimum recovery bandwidth. Existing optimal regenerating codes are designed for single node failures. We build a system called CORE, which augments existing optimal regenerating codes to support a general number of failures including single and concurrent failures. We theoretically show that CORE achieves the minimum possible recovery bandwidth for most cases. We implement CORE and evaluate our prototype atop a Hadoop HDFS cluster testbed with up to 20 storage nodes. We demonstrate that our CORE prototype conforms to our theoretical findings and achieves recovery bandwidth saving when compared to the conventional recovery approach based on erasure codes.
△ Less
Submitted 5 June, 2013; v1 submitted 14 February, 2013;
originally announced February 2013.
-
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups
Authors:
Chun-Ho Ng,
Patrick P. C. Lee
Abstract:
Scaling up the backup storage for an ever-increasing volume of virtual machine (VM) images is a critical issue in virtualization environments. While deduplication is known to effectively eliminate duplicates for VM image storage, it also introduces fragmentation that will degrade read performance. We propose RevDedup, a deduplication system that optimizes reads to latest VM image backups using an…
▽ More
Scaling up the backup storage for an ever-increasing volume of virtual machine (VM) images is a critical issue in virtualization environments. While deduplication is known to effectively eliminate duplicates for VM image storage, it also introduces fragmentation that will degrade read performance. We propose RevDedup, a deduplication system that optimizes reads to latest VM image backups using an idea called reverse deduplication. In contrast with conventional deduplication that removes duplicates from new data, RevDedup removes duplicates from old data, thereby shifting fragmentation to old data while keeping the layout of new data as sequential as possible. We evaluate our RevDedup prototype using microbenchmark and real-world workloads. For a 12-week span of real-world VM images from 160 users, RevDedup achieves high deduplication efficiency with around 97% of saving, and high backup and read throughput on the order of 1GB/s. RevDedup also incurs small metadata overhead in backup/read operations.
△ Less
Submitted 27 June, 2013; v1 submitted 4 February, 2013;
originally announced February 2013.
-
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems
Authors:
Yuchong Hu,
Patrick P. C. Lee,
Kenneth W. Shum
Abstract:
Modern distributed storage systems apply redundancy coding techniques to stored data. One form of redundancy is based on regenerating codes, which can minimize the repair bandwidth, i.e., the amount of data transferred when repairing a failed storage node. Existing regenerating codes mainly require surviving storage nodes encode data during repair. In this paper, we study functional minimum storag…
▽ More
Modern distributed storage systems apply redundancy coding techniques to stored data. One form of redundancy is based on regenerating codes, which can minimize the repair bandwidth, i.e., the amount of data transferred when repairing a failed storage node. Existing regenerating codes mainly require surviving storage nodes encode data during repair. In this paper, we study functional minimum storage regenerating (FMSR) codes, which enable uncoded repair without the encoding requirement in surviving nodes, while preserving the minimum repair bandwidth guarantees and also minimizing disk reads. Under double-fault tolerance settings, we formally prove the existence of FMSR codes, and provide a deterministic FMSR code construction that can significantly speed up the repair process. We further implement and evaluate our deterministic FMSR codes to show the benefits. Our work is built atop a practical cloud storage system that implements FMSR codes, and we provide theoretical validation to justify the practicality of FMSR codes.
△ Less
Submitted 21 January, 2013; v1 submitted 14 August, 2012;
originally announced August 2012.