Search | arXiv e-print repository

Nearest Neighbor Representations of Neural Circuits

Authors: Kordag Mehmet Kilic, Jin Sima, Jehoshua Bruck

Abstract: Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no resu… ▽ More Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no results even for small depth neural networks. Specifically, for depth-2 threshold circuits, we provide explicit constructions for their NN representation with an explicit bound on the number of bits to represent it. Example functions include NN representations of convex polytopes (AND of threshold gates), IP2, OR of threshold gates, and linear or exact decision lists. △ Less

Submitted 9 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: This paper is accepted to ISIT 2024. 2nd version has revisions for better clarity, more citations, and more explanation in the proofs. No results are changed

arXiv:2402.08748 [pdf, ps, other]

Nearest Neighbor Representations of Neurons

Authors: Kordag Mehmet Kilic, Jin Sima, Jehoshua Bruck

Abstract: The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required fo… ▽ More The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required for the entries of an anchor) is $O(n\log{n})$. In this work, the trade-off between the number of anchors and the resolution of a NN representation of threshold functions is investigated. We prove that the well-known threshold functions EQUALITY, COMPARISON, and ODD-MAX-BIT, which require 2 or 3 anchors and resolution of $O(n)$, can be represented by polynomially large number of anchors in $n$ and $O(\log{n})$ resolution. We conjecture that for all threshold functions, there are NN representations with polynomially large size and logarithmic resolution in $n$. △ Less

Submitted 9 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: This paper is accepted to ISIT 2024. 2nd version had revisions for better clarity, fixing of typos. No results are changed

arXiv:2311.06840 [pdf, ps, other]

Omitted Labels Induce Nontransitive Paradoxes in Causality

Authors: Bijan Mazaheri, Siddharth Jain, Matthew Cook, Jehoshua Bruck

Abstract: We explore "omitted label contexts," in which training data is limited to a subset of the possible labels. This setting is standard among specialized human experts or specific, focused studies. By studying Simpson's paradox, we observe that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. A generalization of Simpson's paradox leads us to study networks of co… ▽ More We explore "omitted label contexts," in which training data is limited to a subset of the possible labels. This setting is standard among specialized human experts or specific, focused studies. By studying Simpson's paradox, we observe that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. A generalization of Simpson's paradox leads us to study networks of conclusions drawn from different contexts, within which a paradox of nontransitivity arises. We prove that the space of possible nontransitive structures in these networks exactly corresponds to structures that form from aggregating ranked-choice votes. △ Less

Submitted 30 April, 2025; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: Accepted to appear in CLeaR 2025

arXiv:2310.01729 [pdf, other]

Error Correction for DNA Storage

Authors: Jin Sima, Netanel Raviv, Moshe Schwartz, Jehoshua Bruck

Abstract: DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including… ▽ More DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including deletion/insertion correcting codes, codes over sliced channels, and duplication correcting codes. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2308.07793 [pdf, ps, other]

Robust Indexing for the Sliced Channel: Almost Optimal Codes for Substitutions and Deletions

Authors: Jin Sima, Netanel Raviv, Jehoshua Bruck

Abstract: Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and… ▽ More Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and insertion errors for this channel model. The key ingredient in the code construction is a technique we call robust indexing: simultaneously assigning indices to unordered strings (hence, creating order) and also embedding information in these indices. The encoded indices are resilient to substitution, deletion, and insertion errors, and therefore, so is the entire code. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2305.05808 [pdf, ps, other]

On the Information Capacity of Nearest Neighbor Representations

Authors: Kordag Mehmet Kilic, Jin Sima, Jehoshua Bruck

Abstract: The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call… ▽ More The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call $\textit{anchors}$), computation is performed by convergence from an input vector to a nearest neighbor anchor, and the output is a label associated with an anchor. Specifically, in this paper, we study the representation of Boolean functions in the associative computation model, where the inputs are binary vectors and the corresponding outputs are the labels ($0$ or $1$) of the nearest neighbor anchors. The information capacity of a Boolean function in this model is associated with two quantities: $\textit{(i)}$ the number of anchors (called $\textit{Nearest Neighbor (NN) Complexity}$) and $\textit{(ii)}$ the maximal number of bits representing entries of anchors (called $\textit{Resolution}$). We study symmetric Boolean functions and present constructions that have optimal NN complexity and resolution. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: The conference version is submitted to and accepted by ISIT 2023

arXiv:2207.08372 [pdf, ps, other]

Correcting $k$ Deletions and Insertions in Racetrack Memory

Authors: Jin Sima, Jehoshua Bruck

Abstract: One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized ver… ▽ More One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized version of a sequence reconstruction problem. Using this approach, $k$-deletion correcting codes of length $n$, with $d \ge 2$ heads per-track, with redundancy $\log \log n + 4$ were constructed. However, the known approach requires that $k \le d$, namely, that the number of heads ($d$) is larger than or equal to the number of correctable deletions ($k$). Here we address the question: What is the best redundancy that can be achieved for a $k$-deletion code ($k$ is a constant) if the number of heads is fixed at $d$ (due to implementation constraints)? One of our key results is an answer to this question, namely, we construct codes that can correct $k$ deletions, for any $k$ beyond the known limit of $d$. The code has $4k \log \log n+o(\log \log n)$ redundancy for $k \le 2d-1$. In addition, when $k \ge 2d$, our codes have $2 \lfloor k/d\rfloor \log n+o(\log n)$ redundancy, that we prove it is order-wise optimal, specifically, we prove that the redundancy required for correcting $k$ deletions is at least $\lfloor k/d\rfloor \log n+o(\log n)$. The encoding/decoding complexity of our codes is $O(n\log^{2k}n)$. Finally, we ask a general question: What is the optimal redundancy for codes correcting a combination of at most $k$ deletions and insertions in a $d$-head racetrack memory? We prove that the redundancy sufficient to correct a combination of $k$ deletion and insertion errors is similar to the case of $k$ deletion errors. △ Less

Submitted 18 July, 2022; originally announced July 2022.

arXiv:2205.08032 [pdf, ps, other]

On Algebraic Constructions of Neural Networks with Small Weights

Authors: Kordag Mehmet Kilic, Jin Sima, Jehoshua Bruck

Abstract: Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory… ▽ More Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel's Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions. We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2107.07054 [pdf, other]

Expert Graphs: Synthesizing New Expertise via Collaboration

Authors: Bijan Mazaheri, Siddharth Jain, Jehoshua Bruck

Abstract: Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices represent classes and edges represent binary opinions on… ▽ More Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices represent classes and edges represent binary opinions on the topics of their vertices. We derive necessary conditions for expert graph validity and use them to create "synthetic experts" which describe opinions consistent with the observed opinions of other experts. We show this framework to be equivalent to the well-studied linear ordering polytope. We show our conditions are not sufficient for describing all expert graphs on cliques, but are sufficient for cycles. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Comments: 13 pages, 11 figures

arXiv:2102.05372 [pdf, ps, other]

Trace Reconstruction with Bounded Edit Distance

Authors: Jin Sima, Jehoshua Bruck

Abstract: The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA… ▽ More The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA storage. Yet, there is still an exponential gap between upper and lower bounds for the trace reconstruction problem. In this paper we study the trace reconstruction problem when $\boldsymbol{x}$ is confined to an edit distance ball of radius $k$, which is essentially equivalent to distinguishing two strings with edit distance at most $k$. It is shown that $n^{O(k)}$ samples suffice to achieve this task with high probability. △ Less

Submitted 14 April, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

arXiv:2010.12687 [pdf, other]

Robust Correction of Sampling Bias Using Cumulative Distribution Functions

Authors: Bijan Mazaheri, Siddharth Jain, Jehoshua Bruck

Abstract: Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariat… ▽ More Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariate shift using the empirical cumulative distribution function estimates of the target distribution by a rigorous generalization of a recent idea proposed by Vapnik and Izmailov. Further, we show experimentally that our method is more robust in its predictions, is not reliant on parameter tuning and shows similar classification performance compared to the current state-of-the-art techniques on synthetic and real datasets. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: Accepted in Neurips 2020

arXiv:2005.03248 [pdf, other]

Coding for Optimized Writing Rate in DNA Storage

Authors: Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied… ▽ More A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied here takes into account the existence of multiple copies of the DNA sequence, which are independently distorted. Finally, quantizers for various run-length distributions are designed. △ Less

Submitted 13 May, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: To appear in ISIT 2020

arXiv:2004.10700 [pdf, ps, other]

CodNN -- Robust Neural Networks From Coded Classification

Authors: Netanel Raviv, Siddharth Jain, Pulakesh Upadhyaya, Jehoshua Bruck, Anxiao Jiang

Abstract: Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment in critical applications such as autonomous drivin… ▽ More Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment in critical applications such as autonomous driving. In this paper we construct robust DNNs via error correcting codes. By our approach, either the data or internal layers of the DNN are coded with error correcting codes, and successful computation under noise is guaranteed. Since DNNs can be seen as a layered concatenation of classification tasks, our research begins with the core task of classifying noisy coded inputs, and progresses towards robust DNNs. We focus on binary data and linear codes. Our main result is that the prevalent parity code can guarantee robustness for a large family of DNNs, which includes the recently popularized binarized neural networks. Further, we show that the coded classification problem has a deep connection to Fourier analysis of Boolean functions. In contrast to existing solutions in the literature, our results do not rely on altering the training process of the DNN, and provide mathematically rigorous guarantees rather than experimental evidence. △ Less

Submitted 29 April, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

Comments: To appear in ISIT '20

arXiv:2001.03464 [pdf, other]

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

Authors: Netanel Raviv, Siddharth Jain, Jehoshua Bruck

Abstract: Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hy… ▽ More Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques. △ Less

Submitted 12 May, 2020; v1 submitted 9 January, 2020; originally announced January 2020.

arXiv:1910.12247 [pdf, ps, other]

Optimal $k$-Deletion Correcting Codes

Authors: Jin Sima, Jehoshua Bruck

Abstract: Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution t… ▽ More Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution to this longstanding open problem. We present a $k$-deletion correcting code that has redundancy $8k\log n +o(\log n)$ and encoding/decoding algorithms of complexity $O(n^{2k+1})$ for constant $k$. △ Less

Submitted 27 October, 2019; originally announced October 2019.

arXiv:1901.03084 [pdf, other]

Iterative Programming of Noisy Memory Cells

Authors: Michal Horovitz, Eitan Yaakobi, Eyal En Gad, Jehoshua Bruck

Abstract: In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to determine its programming success, and in case of pro… ▽ More In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to determine its programming success, and in case of programming failure, to reprogram the cell again. Reprogramming a cell can reduce the bit error rate, however this comes with the price of increasing the overall programming time and thereby affecting the writing speed of the memory. An iterative programming scheme is an algorithm which specifies the number of attempts to program each cell. Given the programming channel and constraints on the average and maximum number of attempts to program a cell, we study programming schemes which maximize the number of bits that can be reliably stored in the memory. We extend the results by Bunte and Lapidoth and study this problem when the programming channel is either the BSC, BEC, or $Z$ channel. For the BSC and the BEC our analysis is also extended for the case where the error probabilities on consecutive writes are not necessarily the same. Lastly, we also study a related model which is motivated by the synthesis process of DNA molecules. △ Less

Submitted 10 January, 2019; originally announced January 2019.

Comments: 10 pages, 2 figures

arXiv:1812.02250 [pdf, other]

Evolution of $k$-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems

Authors: Hao Lou, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analy… ▽ More Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that $k$-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems. △ Less

Submitted 5 December, 2018; originally announced December 2018.

arXiv:1809.02716 [pdf, ps, other]

On Coding over Sliced Information

Authors: Jin Sima, Netanel Raviv, Jehoshua Bruck

Abstract: The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surpris… ▽ More The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is order-wise equivalent to the amount required in the classical error correcting paradigm. △ Less

Submitted 27 October, 2019; v1 submitted 7 September, 2018; originally announced September 2018.

arXiv:1808.06062 [pdf, ps, other]

The Capacity of Some Pólya String Models

Authors: Ohad Elishco, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic mod… ▽ More We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics. △ Less

Submitted 18 August, 2018; originally announced August 2018.

arXiv:1806.09240 [pdf, ps, other]

Two Deletion Correcting Codes from Indicator Vectors

Authors: Jin Sima, Netanel Raviv, Jehoshua Bruck

Abstract: Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a… ▽ More Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a novel approach for constructing binary two-deletion correcting codes is proposed. By this approach, parity symbols are computed from indicator vectors (i.e., vectors that indicate the positions of certain patterns) of the encoded message, rather than from the message itself. Most interestingly, the parity symbols and the proof of correctness are a direct generalization of their counterparts in the Varshamov-Tenengolts construction. Our techniques require $7\log(n)+o(\log(n)$ redundant bits to encode an~$n$-bit message, which is near-optimal. △ Less

Submitted 24 June, 2018; originally announced June 2018.

arXiv:1706.00500 [pdf, other]

Generic Secure Repair for Distributed Storage

Authors: Wentao Huang, Jehoshua Bruck

Abstract: This paper studies the problem of repairing secret sharing schemes, i.e., schemes that encode a message into $n$ shares, assigned to $n$ nodes, so that any $n-r$ nodes can decode the message but any colluding $z$ nodes cannot infer any information about the message. In the event of node failures so that shares held by the failed nodes are lost, the system needs to be repaired by reconstructing and… ▽ More This paper studies the problem of repairing secret sharing schemes, i.e., schemes that encode a message into $n$ shares, assigned to $n$ nodes, so that any $n-r$ nodes can decode the message but any colluding $z$ nodes cannot infer any information about the message. In the event of node failures so that shares held by the failed nodes are lost, the system needs to be repaired by reconstructing and reassigning the lost shares to the failed (or replacement) nodes. This can be achieved trivially by a trustworthy third-party that receives the shares of the available nodes, recompute and reassign the lost shares. The interesting question, studied in the paper, is how to repair without a trustworthy third-party. The main issue that arises is repair security: how to maintain the requirement that any colluding $z$ nodes, including the failed nodes, cannot learn any information about the message, during and after the repair process? We solve this secure repair problem from the perspective of secure multi-party computation. Specifically, we design generic repair schemes that can securely repair any (scalar or vector) linear secret sharing schemes. We prove a lower bound on the repair bandwidth of secure repair schemes and show that the proposed secure repair schemes achieve the optimal repair bandwidth up to a small constant factor when $n$ dominates $z$, or when the secret sharing scheme being repaired has optimal rate. We adopt a formal information-theoretic approach in our analysis and bounds. A main idea in our schemes is to allow a more flexible repair model than the straightforward one-round repair model implicitly assumed by existing secure regenerating codes. Particularly, the proposed secure repair schemes are simple and efficient two-round protocols. △ Less

Submitted 1 June, 2017; originally announced June 2017.

arXiv:1611.05537 [pdf, other]

Duplication Distance to the Root for Binary Sequences

Authors: Noga Alon, Jehoshua Bruck, Farzad Farnoud, Siddharth Jain

Abstract: We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of le… ▽ More We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and for special classes of sequences, namely, the de Bruijn sequences, the Thue-Morse sequence, and the Fibbonaci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence. △ Less

Submitted 16 November, 2016; originally announced November 2016.

Comments: submitted to IEEE Transactions on Information Theory

arXiv:1606.00397 [pdf, ps, other]

doi 10.1109/ISIT.2016.7541455

Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

Authors: Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we prov… ▽ More The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length, the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant $k$, where we are primarily focused on the cases of $k=2,3$. Finally, we provide a full classification of the sets of lengths allowed in tandem duplication that result in a unique root for all sequences. △ Less

Submitted 1 June, 2016; originally announced June 2016.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:1603.01213 [pdf, ps, other]

Optimal Rebuilding of Multiple Erasures in MDS Codes

Authors: Zhiying Wang, Itzhak Tamo, Jehoshua Bruck

Abstract: MDS array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with $r$ redundancy nodes can correct any $r$ node erasures by accessing all the remaining information in the surviving nodes. However, in practice, $e$ erasures is a more likely failure event, for $1\le e<r$. Hence, a natural question is how much information do w… ▽ More MDS array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with $r$ redundancy nodes can correct any $r$ node erasures by accessing all the remaining information in the surviving nodes. However, in practice, $e$ erasures is a more likely failure event, for $1\le e<r$. Hence, a natural question is how much information do we need to access in order to rebuild $e$ storage nodes? We define the rebuilding ratio as the fraction of remaining information accessed during the rebuilding of $e$ erasures. In our previous work we constructed MDS codes, called zigzag codes, that achieve the optimal rebuilding ratio of $1/r$ for the rebuilding of any systematic node when $e=1$, however, all the information needs to be accessed for the rebuilding of the parity node erasure. The (normalized) repair bandwidth is defined as the fraction of information transmitted from the remaining nodes during the rebuilding process. For codes that are not necessarily MDS, Dimakis et al. proposed the regenerating codes framework where any $r$ erasures can be corrected by accessing some of the remaining information, and any $e=1$ erasure can be rebuilt from some subsets of surviving nodes with optimal repair bandwidth. In this work, we study 3 questions on rebuilding of codes: (i) We show a fundamental trade-off between the storage size of the node and the repair bandwidth similar to the regenerating codes framework, and show that zigzag codes achieve the optimal rebuilding ratio of $e/r$ for MDS codes, for any $1\le e\le r$. (ii) We construct systematic codes that achieve optimal rebuilding ratio of $1/r$, for any systematic or parity node erasure. (iii) We present error correction algorithms for zigzag codes, and in particular demonstrate how these codes can be corrected beyond their minimum Hamming distances. △ Less

Submitted 3 March, 2016; originally announced March 2016.

Comments: There is an overlap of this work with our two previous submissions: Zigzag Codes: MDS Array Codes with Optimal Rebuilding; On Codes for Optimal Rebuilding Access. arXiv admin note: text overlap with arXiv:1112.0371

arXiv:1509.06029 [pdf, other]

doi 10.1109/ISIT.2015.7282795

Capacity and Expressiveness of Genomic Tandem Duplication

Authors: Siddharth Jain, Farzad Farnoud, Jehoshua Bruck

Abstract: The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibil… ▽ More The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems. △ Less

Submitted 20 September, 2015; originally announced September 2015.

Comments: 19 pages, 3 figures, submitted to IEEE Transactions on Information Theory

arXiv:1505.07515 [pdf, other]

Communication Efficient Secret Sharing

Authors: Wentao Huang, Michael Langberg, Joerg Kliewer, Jehoshua Bruck

Abstract: A secret sharing scheme is a method to store information securely and reliably. Particularly, in a threshold secret sharing scheme, a secret is encoded into $n$ shares, such that any set of at least $t_1$ shares suffice to decode the secret, and any set of at most $t_2 < t_1$ shares reveal no information about the secret. Assuming that each party holds a share and a user wishes to decode the secre… ▽ More A secret sharing scheme is a method to store information securely and reliably. Particularly, in a threshold secret sharing scheme, a secret is encoded into $n$ shares, such that any set of at least $t_1$ shares suffice to decode the secret, and any set of at most $t_2 < t_1$ shares reveal no information about the secret. Assuming that each party holds a share and a user wishes to decode the secret by receiving information from a set of parties; the question we study is how to minimize the amount of communication between the user and the parties. We show that the necessary amount of communication, termed "decoding bandwidth", decreases as the number of parties that participate in decoding increases. We prove a tight lower bound on the decoding bandwidth, and construct secret sharing schemes achieving the bound. Particularly, we design a scheme that achieves the optimal decoding bandwidth when $d$ parties participate in decoding, universally for all $t_1 \le d \le n$. The scheme is based on Shamir's secret sharing scheme and preserves its simplicity and efficiency. In addition, we consider secure distributed storage where the proposed communication efficient secret sharing schemes further improve disk access complexity during decoding. △ Less

Submitted 1 April, 2016; v1 submitted 27 May, 2015; originally announced May 2015.

Comments: submitted to the IEEE Transactions on Information Theory. New references and a new construction added

arXiv:1502.00189 [pdf, ps, other]

Rewriting Flash Memories by Message Passing

Authors: Eyal En Gad, Wentao Huang, Yue Li, Jehoshua Bruck

Abstract: This paper constructs WOM codes that combine rewriting and error correction for mitigating the reliability and the endurance problems in flash memory. We consider a rewriting model that is of practical interest to flash applications where only the second write uses WOM codes. Our WOM code construction is based on binary erasure quantization with LDGM codes, where the rewriting uses message passing… ▽ More This paper constructs WOM codes that combine rewriting and error correction for mitigating the reliability and the endurance problems in flash memory. We consider a rewriting model that is of practical interest to flash applications where only the second write uses WOM codes. Our WOM code construction is based on binary erasure quantization with LDGM codes, where the rewriting uses message passing and has potential to share the efficient hardware implementations with LDPC codes in practice. We show that the coding scheme achieves the capacity of the rewriting model. Extensive simulations show that the rewriting performance of our scheme compares favorably with that of polar WOM code in the rate region where high rewriting success probability is desired. We further augment our coding schemes with error correction capability. By drawing a connection to the conjugate code pairs studied in the context of quantum error correction, we develop a general framework for constructing error-correction WOM codes. Under this framework, we give an explicit construction of WOM codes whose codewords are contained in BCH codes. △ Less

Submitted 31 January, 2015; originally announced February 2015.

Comments: Submitted to ISIT 2015

arXiv:1411.6328 [pdf, ps, other]

Explicit MDS Codes for Optimal Repair Bandwidth

Authors: Zhiying Wang, Itzhak Tamo, Jehoshua Bruck

Abstract: MDS codes are erasure-correcting codes that can correct the maximum number of erasures for a given number of redundancy or parity symbols. If an MDS code has $r$ parities and no more than $r$ erasures occur, then by transmitting all the remaining data in the code, the original information can be recovered. However, it was shown that in order to recover a single symbol erasure, only a fraction of… ▽ More MDS codes are erasure-correcting codes that can correct the maximum number of erasures for a given number of redundancy or parity symbols. If an MDS code has $r$ parities and no more than $r$ erasures occur, then by transmitting all the remaining data in the code, the original information can be recovered. However, it was shown that in order to recover a single symbol erasure, only a fraction of $1/r$ of the information needs to be transmitted. This fraction is called the repair bandwidth (fraction). Explicit code constructions were given in previous works. If we view each symbol in the code as a vector or a column over some field, then the code forms a 2D array and such codes are especially widely used in storage systems. In this paper, we address the following question: given the length of the column $l$, number of parities $r$, can we construct high-rate MDS array codes with optimal repair bandwidth of $1/r$, whose code length is as long as possible? In this paper, we give code constructions such that the code length is $(r+1)\log_r l$. △ Less

Submitted 23 November, 2014; originally announced November 2014.

Comments: 17 pages

arXiv:1410.3542 [pdf, ps, other]

Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes

Authors: Eyal En Gad, Yue Li, Joerg Kliewer, Michael Langberg, Anxiao Jiang, Jehoshua Bruck

Abstract: We propose efficient coding schemes for two communication settings: 1. asymmetric channels, and 2. channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In asymmetric channels, we tackle the exponential sto… ▽ More We propose efficient coding schemes for two communication settings: 1. asymmetric channels, and 2. channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In asymmetric channels, we tackle the exponential storage requirement of previously known schemes, that resulted from the use of large Boolean functions. We propose an improved scheme, that achieves the capacity of asymmetric channels with polynomial computational complexity and storage requirement. The proposed non-linear scheme is then generalized to the setting of channel coding with an informed encoder, using a multicoding technique. We consider specific instances of the scheme for flash memories, that incorporate error-correction capabilities together with rewriting. Since the considered codes are non-linear, they eliminate the requirement of previously known schemes (called polar write-once-memory codes) for shared randomness between the encoder and the decoder. Finally, we mention that the multicoding scheme is also useful for broadcast communication in Marton's region, improving upon previous schemes for this setting. △ Less

Submitted 28 December, 2015; v1 submitted 13 October, 2014; originally announced October 2014.

Comments: Submitted to IEEE Transactions on Information Theory. Partially presented at ISIT 2014

arXiv:1401.4634 [pdf, ps, other]

The Capacity of String-Replication Systems

Authors: Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules… ▽ More It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules, including those resembling genomic replication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-replication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-replication systems. △ Less

Submitted 18 January, 2014; originally announced January 2014.

arXiv:1401.3093 [pdf, ps, other]

Rate-Distortion for Ranking with Incomplete Information

Authors: Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium… ▽ More We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium, and large distortion regimes, while for the Chebyshev metric we present bounds that are valid for all distortions and are especially accurate for small distortions. In addition, for the Chebyshev metric, we provide a construction for covering codes. △ Less

Submitted 14 January, 2014; originally announced January 2014.

arXiv:1312.0972 [pdf, ps, other]

Rank-Modulation Rewrite Coding for Flash Memories

Authors: Eyal En Gad, Eitan Yaakobi, Anxiao, Jiang, Jehoshua Bruck

Abstract: The current flash memory technology focuses on the cost minimization of its static storage capacity. However, the resulting approach supports a relatively small number of program-erase cycles. This technology is effective for consumer devices (e.g., smartphones and cameras) where the number of program-erase cycles is small. However, it is not economical for enterprise storage systems that require… ▽ More The current flash memory technology focuses on the cost minimization of its static storage capacity. However, the resulting approach supports a relatively small number of program-erase cycles. This technology is effective for consumer devices (e.g., smartphones and cameras) where the number of program-erase cycles is small. However, it is not economical for enterprise storage systems that require a large number of lifetime writes. The proposed approach in this paper for alleviating this problem consists of the efficient integration of two key ideas: (i) improving reliability and endurance by representing the information using relative values via the rank modulation scheme and (ii) increasing the overall (lifetime) capacity of the flash device via rewriting codes, namely, performing multiple writes per cell before erasure. This paper presents a new coding scheme that combines rank modulation with rewriting. The key benefits of the new scheme include: (i) the ability to store close to 2 bits per cell on each write with minimal impact on the lifetime of the memory, and (ii) efficient encoding and decoding algorithms that make use of capacity-achieving write-once-memory (WOM) codes that were proposed recently. △ Less

Submitted 30 December, 2014; v1 submitted 3 December, 2013; originally announced December 2013.

Comments: Revised version for IEEE transactions on Information Theory

arXiv:1311.7113 [pdf, ps, other]

Systematic Codes for Rank Modulation

Authors: Sarit Buzaglo, Eitan Yaakobi, Tuvi Etzion, Jehoshua Bruck

Abstract: The goal of this paper is to construct systematic error-correcting codes for permutations and multi-permutations in the Kendall's $τ$-metric. These codes are important in new applications such as rank modulation for flash memories. The construction is based on error-correcting codes for multi-permutations and a partition of the set of permutations into error-correcting codes. For a given large eno… ▽ More The goal of this paper is to construct systematic error-correcting codes for permutations and multi-permutations in the Kendall's $τ$-metric. These codes are important in new applications such as rank modulation for flash memories. The construction is based on error-correcting codes for multi-permutations and a partition of the set of permutations into error-correcting codes. For a given large enough number of information symbols $k$, and for any integer $t$, we present a construction for ${(k+r,k)}$ systematic $t$-error-correcting codes, for permutations from $S_{k+r}$, with less redundancy symbols than the number of redundancy symbols in the codes of the known constructions. In particular, for a given $t$ and for sufficiently large $k$ we can obtain $r=t+1$. The same construction is also applied to obtain related systematic error-correcting codes for multi-permutations. △ Less

Submitted 20 April, 2014; v1 submitted 27 November, 2013; originally announced November 2013.

Comments: to be presented ISIT2014

arXiv:1310.6817 [pdf, ps, other]

Systematic Error-Correcting Codes for Rank Modulation

Authors: Hongchao Zhou, Moshe Schwartz, Anxiao Jiang, Jehoshua Bruck

Abstract: The rank-modulation scheme has been recently proposed for efficiently storing data in nonvolatile memories. Error-correcting codes are essential for rank modulation, however, existing results have been limited. In this work we explore a new approach, \emph{systematic error-correcting codes for rank modulation}. Systematic codes have the benefits of enabling efficient information retrieval and pote… ▽ More The rank-modulation scheme has been recently proposed for efficiently storing data in nonvolatile memories. Error-correcting codes are essential for rank modulation, however, existing results have been limited. In this work we explore a new approach, \emph{systematic error-correcting codes for rank modulation}. Systematic codes have the benefits of enabling efficient information retrieval and potentially supporting more efficient encoding and decoding procedures. We study systematic codes for rank modulation under Kendall's $τ$-metric as well as under the $\ell_\infty$-metric. In Kendall's $τ$-metric we present $[k+2,k,3]$-systematic codes for correcting one error, which have optimal rates, unless systematic perfect codes exist. We also study the design of multi-error-correcting codes, and provide two explicit constructions, one resulting in $[n+1,k+1,2t+2]$ systematic codes with redundancy at most $2t+1$. We use non-constructive arguments to show the existence of $[n,k,n-k]$-systematic codes for general parameters. Furthermore, we prove that for rank modulation, systematic codes achieve the same capacity as general error-correcting codes. Finally, in the $\ell_\infty$-metric we construct two $[n,k,d]$ systematic multi-error-correcting codes, the first for the case of $d=O(1)$, and the second for $d=Θ(n)$. In the latter case, the codes have the same asymptotic rate as the best codes currently known in this metric. △ Less

Submitted 25 October, 2013; originally announced October 2013.

arXiv:1303.3668 [pdf, ps, other]

doi 10.1109/ISIT.2012.6283042

Access vs. Bandwidth in Codes for Storage

Authors: Itzhak Tamo, Zhiying Wang, Jehoshua Bruck

Abstract: Maximum distance separable (MDS) codes are widely used in storage systems to protect against disk (node) failures. A node is said to have capacity $l$ over some field $\mathbb{F}$, if it can store that amount of symbols of the field. An $(n,k,l)$ MDS code uses $n$ nodes of capacity $l$ to store $k$ information nodes. The MDS property guarantees the resiliency to any $n-k$ node failures. An \emph{o… ▽ More Maximum distance separable (MDS) codes are widely used in storage systems to protect against disk (node) failures. A node is said to have capacity $l$ over some field $\mathbb{F}$, if it can store that amount of symbols of the field. An $(n,k,l)$ MDS code uses $n$ nodes of capacity $l$ to store $k$ information nodes. The MDS property guarantees the resiliency to any $n-k$ node failures. An \emph{optimal bandwidth} (resp. \emph{optimal access}) MDS code communicates (resp. accesses) the minimum amount of data during the repair process of a single failed node. It was shown that this amount equals a fraction of $1/(n-k)$ of data stored in each node. In previous optimal bandwidth constructions, $l$ scaled polynomially with $k$ in codes with asymptotic rate $<1$. Moreover, in constructions with a constant number of parities, i.e. rate approaches 1, $l$ is scaled exponentially w.r.t. $k$. In this paper, we focus on the later case of constant number of parities $n-k=r$, and ask the following question: Given the capacity of a node $l$ what is the largest number of information disks $k$ in an optimal bandwidth (resp. access) $(k+r,k,l)$ MDS code. We give an upper bound for the general case, and two tight bounds in the special cases of two important families of codes. Moreover, the bounds show that in some cases optimal-bandwidth code has larger $k$ than optimal-access code, and therefore these two measures are not equivalent. △ Less

Submitted 14 March, 2013; originally announced March 2013.

Comments: This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT 2012). submitted to IEEE transactions on information theory

arXiv:1209.0744 [pdf, ps, other]

Balanced Modulation for Nonvolatile Memories

Authors: Hongchao Zhou, Anxiao, Jiang, Jehoshua Bruck

Abstract: This paper presents a practical writing/reading scheme in nonvolatile memories, called balanced modulation, for minimizing the asymmetric component of errors. The main idea is to encode data using a balanced error-correcting code. When reading information from a block, it adjusts the reading threshold such that the resulting word is also balanced or approximately balanced. Balanced modulation has… ▽ More This paper presents a practical writing/reading scheme in nonvolatile memories, called balanced modulation, for minimizing the asymmetric component of errors. The main idea is to encode data using a balanced error-correcting code. When reading information from a block, it adjusts the reading threshold such that the resulting word is also balanced or approximately balanced. Balanced modulation has suboptimal performance for any cell-level distribution and it can be easily implemented in the current systems of nonvolatile memories. Furthermore, we studied the construction of balanced error-correcting codes, in particular, balanced LDPC codes. It has very efficient encoding and decoding algorithms, and it is more efficient than prior construction of balanced error-correcting codes. △ Less