-
Nearest Neighbor Representations of Neural Circuits
Authors:
Kordag Mehmet Kilic,
Jin Sima,
Jehoshua Bruck
Abstract:
Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no resu…
▽ More
Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no results even for small depth neural networks. Specifically, for depth-2 threshold circuits, we provide explicit constructions for their NN representation with an explicit bound on the number of bits to represent it. Example functions include NN representations of convex polytopes (AND of threshold gates), IP2, OR of threshold gates, and linear or exact decision lists.
△ Less
Submitted 9 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Nearest Neighbor Representations of Neurons
Authors:
Kordag Mehmet Kilic,
Jin Sima,
Jehoshua Bruck
Abstract:
The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required fo…
▽ More
The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required for the entries of an anchor) is $O(n\log{n})$. In this work, the trade-off between the number of anchors and the resolution of a NN representation of threshold functions is investigated. We prove that the well-known threshold functions EQUALITY, COMPARISON, and ODD-MAX-BIT, which require 2 or 3 anchors and resolution of $O(n)$, can be represented by polynomially large number of anchors in $n$ and $O(\log{n})$ resolution. We conjecture that for all threshold functions, there are NN representations with polynomially large size and logarithmic resolution in $n$.
△ Less
Submitted 9 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Omitted Labels Induce Nontransitive Paradoxes in Causality
Authors:
Bijan Mazaheri,
Siddharth Jain,
Matthew Cook,
Jehoshua Bruck
Abstract:
We explore "omitted label contexts," in which training data is limited to a subset of the possible labels. This setting is standard among specialized human experts or specific, focused studies. By studying Simpson's paradox, we observe that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. A generalization of Simpson's paradox leads us to study networks of co…
▽ More
We explore "omitted label contexts," in which training data is limited to a subset of the possible labels. This setting is standard among specialized human experts or specific, focused studies. By studying Simpson's paradox, we observe that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. A generalization of Simpson's paradox leads us to study networks of conclusions drawn from different contexts, within which a paradox of nontransitivity arises. We prove that the space of possible nontransitive structures in these networks exactly corresponds to structures that form from aggregating ranked-choice votes.
△ Less
Submitted 30 April, 2025; v1 submitted 12 November, 2023;
originally announced November 2023.
-
Error Correction for DNA Storage
Authors:
Jin Sima,
Netanel Raviv,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including…
▽ More
DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including deletion/insertion correcting codes, codes over sliced channels, and duplication correcting codes.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Robust Indexing for the Sliced Channel: Almost Optimal Codes for Substitutions and Deletions
Authors:
Jin Sima,
Netanel Raviv,
Jehoshua Bruck
Abstract:
Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and…
▽ More
Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and insertion errors for this channel model. The key ingredient in the code construction is a technique we call robust indexing: simultaneously assigning indices to unordered strings (hence, creating order) and also embedding information in these indices.
The encoded indices are resilient to substitution, deletion, and insertion errors, and therefore, so is the entire code.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
On the Information Capacity of Nearest Neighbor Representations
Authors:
Kordag Mehmet Kilic,
Jin Sima,
Jehoshua Bruck
Abstract:
The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call…
▽ More
The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call $\textit{anchors}$), computation is performed by convergence from an input vector to a nearest neighbor anchor, and the output is a label associated with an anchor. Specifically, in this paper, we study the representation of Boolean functions in the associative computation model, where the inputs are binary vectors and the corresponding outputs are the labels ($0$ or $1$) of the nearest neighbor anchors. The information capacity of a Boolean function in this model is associated with two quantities: $\textit{(i)}$ the number of anchors (called $\textit{Nearest Neighbor (NN) Complexity}$) and $\textit{(ii)}$ the maximal number of bits representing entries of anchors (called $\textit{Resolution}$). We study symmetric Boolean functions and present constructions that have optimal NN complexity and resolution.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Correcting $k$ Deletions and Insertions in Racetrack Memory
Authors:
Jin Sima,
Jehoshua Bruck
Abstract:
One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized ver…
▽ More
One of the main challenges in developing racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized version of a sequence reconstruction problem. Using this approach, $k$-deletion correcting codes of length $n$, with $d \ge 2$ heads per-track, with redundancy $\log \log n + 4$ were constructed. However, the known approach requires that $k \le d$, namely, that the number of heads ($d$) is larger than or equal to the number of correctable deletions ($k$). Here we address the question: What is the best redundancy that can be achieved for a $k$-deletion code ($k$ is a constant) if the number of heads is fixed at $d$ (due to implementation constraints)? One of our key results is an answer to this question, namely, we construct codes that can correct $k$ deletions, for any $k$ beyond the known limit of $d$. The code has $4k \log \log n+o(\log \log n)$ redundancy for $k \le 2d-1$. In addition, when $k \ge 2d$, our codes have $2 \lfloor k/d\rfloor \log n+o(\log n)$ redundancy, that we prove it is order-wise optimal, specifically, we prove that the redundancy required for correcting $k$ deletions is at least $\lfloor k/d\rfloor \log n+o(\log n)$. The encoding/decoding complexity of our codes is $O(n\log^{2k}n)$. Finally, we ask a general question: What is the optimal redundancy for codes correcting a combination of at most $k$ deletions and insertions in a $d$-head racetrack memory? We prove that the redundancy sufficient to correct a combination of $k$ deletion and insertion errors is similar to the case of $k$ deletion errors.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
On Algebraic Constructions of Neural Networks with Small Weights
Authors:
Kordag Mehmet Kilic,
Jin Sima,
Jehoshua Bruck
Abstract:
Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory…
▽ More
Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel's Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions.
We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Expert Graphs: Synthesizing New Expertise via Collaboration
Authors:
Bijan Mazaheri,
Siddharth Jain,
Jehoshua Bruck
Abstract:
Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices represent classes and edges represent binary opinions on…
▽ More
Consider multiple experts with overlapping expertise working on a classification problem under uncertain input. What constitutes a consistent set of opinions? How can we predict the opinions of experts on missing sub-domains? In this paper, we define a framework of to analyze this problem, termed "expert graphs." In an expert graph, vertices represent classes and edges represent binary opinions on the topics of their vertices. We derive necessary conditions for expert graph validity and use them to create "synthetic experts" which describe opinions consistent with the observed opinions of other experts. We show this framework to be equivalent to the well-studied linear ordering polytope. We show our conditions are not sufficient for describing all expert graphs on cliques, but are sufficient for cycles.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Trace Reconstruction with Bounded Edit Distance
Authors:
Jin Sima,
Jehoshua Bruck
Abstract:
The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA…
▽ More
The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA storage. Yet, there is still an exponential gap between upper and lower bounds for the trace reconstruction problem. In this paper we study the trace reconstruction problem when $\boldsymbol{x}$ is confined to an edit distance ball of radius $k$, which is essentially equivalent to distinguishing two strings with edit distance at most $k$. It is shown that $n^{O(k)}$ samples suffice to achieve this task with high probability.
△ Less
Submitted 14 April, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
Robust Correction of Sampling Bias Using Cumulative Distribution Functions
Authors:
Bijan Mazaheri,
Siddharth Jain,
Jehoshua Bruck
Abstract:
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariat…
▽ More
Varying domains and biased datasets can lead to differences between the training and the target distributions, known as covariate shift. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions. These techniques require parameter tuning and can be unstable across different datasets. We present a new method for handling covariate shift using the empirical cumulative distribution function estimates of the target distribution by a rigorous generalization of a recent idea proposed by Vapnik and Izmailov. Further, we show experimentally that our method is more robust in its predictions, is not reliant on parameter tuning and shows similar classification performance compared to the current state-of-the-art techniques on synthetic and real datasets.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Coding for Optimized Writing Rate in DNA Storage
Authors:
Siddharth Jain,
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied…
▽ More
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied here takes into account the existence of multiple copies of the DNA sequence, which are independently distorted. Finally, quantizers for various run-length distributions are designed.
△ Less
Submitted 13 May, 2020; v1 submitted 7 May, 2020;
originally announced May 2020.
-
CodNN -- Robust Neural Networks From Coded Classification
Authors:
Netanel Raviv,
Siddharth Jain,
Pulakesh Upadhyaya,
Jehoshua Bruck,
Anxiao Jiang
Abstract:
Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment in critical applications such as autonomous drivin…
▽ More
Deep Neural Networks (DNNs) are a revolutionary force in the ongoing information revolution, and yet their intrinsic properties remain a mystery. In particular, it is widely known that DNNs are highly sensitive to noise, whether adversarial or random. This poses a fundamental challenge for hardware implementations of DNNs, and for their deployment in critical applications such as autonomous driving. In this paper we construct robust DNNs via error correcting codes. By our approach, either the data or internal layers of the DNN are coded with error correcting codes, and successful computation under noise is guaranteed. Since DNNs can be seen as a layered concatenation of classification tasks, our research begins with the core task of classifying noisy coded inputs, and progresses towards robust DNNs. We focus on binary data and linear codes. Our main result is that the prevalent parity code can guarantee robustness for a large family of DNNs, which includes the recently popularized binarized neural networks. Further, we show that the coded classification problem has a deep connection to Fourier analysis of Boolean functions. In contrast to existing solutions in the literature, our results do not rely on altering the training process of the DNN, and provide mathematically rigorous guarantees rather than experimental evidence.
△ Less
Submitted 29 April, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
What is the Value of Data? On Mathematical Methods for Data Quality Estimation
Authors:
Netanel Raviv,
Siddharth Jain,
Jehoshua Bruck
Abstract:
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hy…
▽ More
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.
△ Less
Submitted 12 May, 2020; v1 submitted 9 January, 2020;
originally announced January 2020.
-
Optimal $k$-Deletion Correcting Codes
Authors:
Jin Sima,
Jehoshua Bruck
Abstract:
Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution t…
▽ More
Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution to this longstanding open problem. We present a $k$-deletion correcting code that has redundancy $8k\log n +o(\log n)$ and encoding/decoding algorithms of complexity $O(n^{2k+1})$ for constant $k$.
△ Less
Submitted 27 October, 2019;
originally announced October 2019.
-
Iterative Programming of Noisy Memory Cells
Authors:
Michal Horovitz,
Eitan Yaakobi,
Eyal En Gad,
Jehoshua Bruck
Abstract:
In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to determine its programming success, and in case of pro…
▽ More
In this paper, we study a model, which was first presented by Bunte and Lapidoth, that mimics the programming operation of memory cells. Under this paradigm we assume that cells are programmed sequentially and individually. The programming process is modeled as transmission over a channel, while it is possible to read the cell state in order to determine its programming success, and in case of programming failure, to reprogram the cell again. Reprogramming a cell can reduce the bit error rate, however this comes with the price of increasing the overall programming time and thereby affecting the writing speed of the memory. An iterative programming scheme is an algorithm which specifies the number of attempts to program each cell. Given the programming channel and constraints on the average and maximum number of attempts to program a cell, we study programming schemes which maximize the number of bits that can be reliably stored in the memory. We extend the results by Bunte and Lapidoth and study this problem when the programming channel is either the BSC, BEC, or $Z$ channel. For the BSC and the BEC our analysis is also extended for the case where the error probabilities on consecutive writes are not necessarily the same. Lastly, we also study a related model which is motivated by the synthesis process of DNA molecules.
△ Less
Submitted 10 January, 2019;
originally announced January 2019.
-
Evolution of $k$-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems
Authors:
Hao Lou,
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analy…
▽ More
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that $k$-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
On Coding over Sliced Information
Authors:
Jin Sima,
Netanel Raviv,
Jehoshua Bruck
Abstract:
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surpris…
▽ More
The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is order-wise equivalent to the amount required in the classical error correcting paradigm.
△ Less
Submitted 27 October, 2019; v1 submitted 7 September, 2018;
originally announced September 2018.
-
The Capacity of Some Pólya String Models
Authors:
Ohad Elishco,
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic mod…
▽ More
We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics.
△ Less
Submitted 18 August, 2018;
originally announced August 2018.
-
Two Deletion Correcting Codes from Indicator Vectors
Authors:
Jin Sima,
Netanel Raviv,
Jehoshua Bruck
Abstract:
Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a…
▽ More
Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a novel approach for constructing binary two-deletion correcting codes is proposed. By this approach, parity symbols are computed from indicator vectors (i.e., vectors that indicate the positions of certain patterns) of the encoded message, rather than from the message itself. Most interestingly, the parity symbols and the proof of correctness are a direct generalization of their counterparts in the Varshamov-Tenengolts construction. Our techniques require $7\log(n)+o(\log(n)$ redundant bits to encode an~$n$-bit message, which is near-optimal.
△ Less
Submitted 24 June, 2018;
originally announced June 2018.
-
Generic Secure Repair for Distributed Storage
Authors:
Wentao Huang,
Jehoshua Bruck
Abstract:
This paper studies the problem of repairing secret sharing schemes, i.e., schemes that encode a message into $n$ shares, assigned to $n$ nodes, so that any $n-r$ nodes can decode the message but any colluding $z$ nodes cannot infer any information about the message. In the event of node failures so that shares held by the failed nodes are lost, the system needs to be repaired by reconstructing and…
▽ More
This paper studies the problem of repairing secret sharing schemes, i.e., schemes that encode a message into $n$ shares, assigned to $n$ nodes, so that any $n-r$ nodes can decode the message but any colluding $z$ nodes cannot infer any information about the message. In the event of node failures so that shares held by the failed nodes are lost, the system needs to be repaired by reconstructing and reassigning the lost shares to the failed (or replacement) nodes. This can be achieved trivially by a trustworthy third-party that receives the shares of the available nodes, recompute and reassign the lost shares. The interesting question, studied in the paper, is how to repair without a trustworthy third-party. The main issue that arises is repair security: how to maintain the requirement that any colluding $z$ nodes, including the failed nodes, cannot learn any information about the message, during and after the repair process? We solve this secure repair problem from the perspective of secure multi-party computation. Specifically, we design generic repair schemes that can securely repair any (scalar or vector) linear secret sharing schemes. We prove a lower bound on the repair bandwidth of secure repair schemes and show that the proposed secure repair schemes achieve the optimal repair bandwidth up to a small constant factor when $n$ dominates $z$, or when the secret sharing scheme being repaired has optimal rate. We adopt a formal information-theoretic approach in our analysis and bounds. A main idea in our schemes is to allow a more flexible repair model than the straightforward one-round repair model implicitly assumed by existing secure regenerating codes. Particularly, the proposed secure repair schemes are simple and efficient two-round protocols.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Duplication Distance to the Root for Binary Sequences
Authors:
Noga Alon,
Jehoshua Bruck,
Farzad Farnoud,
Siddharth Jain
Abstract:
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of le…
▽ More
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and for special classes of sequences, namely, the de Bruijn sequences, the Thue-Morse sequence, and the Fibbonaci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms
Authors:
Siddharth Jain,
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we prov…
▽ More
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length, the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant $k$, where we are primarily focused on the cases of $k=2,3$. Finally, we provide a full classification of the sets of lengths allowed in tandem duplication that result in a unique root for all sequences.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
Optimal Rebuilding of Multiple Erasures in MDS Codes
Authors:
Zhiying Wang,
Itzhak Tamo,
Jehoshua Bruck
Abstract:
MDS array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with $r$ redundancy nodes can correct any $r$ node erasures by accessing all the remaining information in the surviving nodes. However, in practice, $e$ erasures is a more likely failure event, for $1\le e<r$. Hence, a natural question is how much information do w…
▽ More
MDS array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with $r$ redundancy nodes can correct any $r$ node erasures by accessing all the remaining information in the surviving nodes. However, in practice, $e$ erasures is a more likely failure event, for $1\le e<r$. Hence, a natural question is how much information do we need to access in order to rebuild $e$ storage nodes? We define the rebuilding ratio as the fraction of remaining information accessed during the rebuilding of $e$ erasures. In our previous work we constructed MDS codes, called zigzag codes, that achieve the optimal rebuilding ratio of $1/r$ for the rebuilding of any systematic node when $e=1$, however, all the information needs to be accessed for the rebuilding of the parity node erasure.
The (normalized) repair bandwidth is defined as the fraction of information transmitted from the remaining nodes during the rebuilding process. For codes that are not necessarily MDS, Dimakis et al. proposed the regenerating codes framework where any $r$ erasures can be corrected by accessing some of the remaining information, and any $e=1$ erasure can be rebuilt from some subsets of surviving nodes with optimal repair bandwidth.
In this work, we study 3 questions on rebuilding of codes: (i) We show a fundamental trade-off between the storage size of the node and the repair bandwidth similar to the regenerating codes framework, and show that zigzag codes achieve the optimal rebuilding ratio of $e/r$ for MDS codes, for any $1\le e\le r$. (ii) We construct systematic codes that achieve optimal rebuilding ratio of $1/r$, for any systematic or parity node erasure. (iii) We present error correction algorithms for zigzag codes, and in particular demonstrate how these codes can be corrected beyond their minimum Hamming distances.
△ Less
Submitted 3 March, 2016;
originally announced March 2016.
-
Capacity and Expressiveness of Genomic Tandem Duplication
Authors:
Siddharth Jain,
Farzad Farnoud,
Jehoshua Bruck
Abstract:
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibil…
▽ More
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.
△ Less
Submitted 20 September, 2015;
originally announced September 2015.
-
Communication Efficient Secret Sharing
Authors:
Wentao Huang,
Michael Langberg,
Joerg Kliewer,
Jehoshua Bruck
Abstract:
A secret sharing scheme is a method to store information securely and reliably. Particularly, in a threshold secret sharing scheme, a secret is encoded into $n$ shares, such that any set of at least $t_1$ shares suffice to decode the secret, and any set of at most $t_2 < t_1$ shares reveal no information about the secret. Assuming that each party holds a share and a user wishes to decode the secre…
▽ More
A secret sharing scheme is a method to store information securely and reliably. Particularly, in a threshold secret sharing scheme, a secret is encoded into $n$ shares, such that any set of at least $t_1$ shares suffice to decode the secret, and any set of at most $t_2 < t_1$ shares reveal no information about the secret. Assuming that each party holds a share and a user wishes to decode the secret by receiving information from a set of parties; the question we study is how to minimize the amount of communication between the user and the parties. We show that the necessary amount of communication, termed "decoding bandwidth", decreases as the number of parties that participate in decoding increases. We prove a tight lower bound on the decoding bandwidth, and construct secret sharing schemes achieving the bound. Particularly, we design a scheme that achieves the optimal decoding bandwidth when $d$ parties participate in decoding, universally for all $t_1 \le d \le n$. The scheme is based on Shamir's secret sharing scheme and preserves its simplicity and efficiency. In addition, we consider secure distributed storage where the proposed communication efficient secret sharing schemes further improve disk access complexity during decoding.
△ Less
Submitted 1 April, 2016; v1 submitted 27 May, 2015;
originally announced May 2015.
-
Rewriting Flash Memories by Message Passing
Authors:
Eyal En Gad,
Wentao Huang,
Yue Li,
Jehoshua Bruck
Abstract:
This paper constructs WOM codes that combine rewriting and error correction for mitigating the reliability and the endurance problems in flash memory. We consider a rewriting model that is of practical interest to flash applications where only the second write uses WOM codes. Our WOM code construction is based on binary erasure quantization with LDGM codes, where the rewriting uses message passing…
▽ More
This paper constructs WOM codes that combine rewriting and error correction for mitigating the reliability and the endurance problems in flash memory. We consider a rewriting model that is of practical interest to flash applications where only the second write uses WOM codes. Our WOM code construction is based on binary erasure quantization with LDGM codes, where the rewriting uses message passing and has potential to share the efficient hardware implementations with LDPC codes in practice. We show that the coding scheme achieves the capacity of the rewriting model. Extensive simulations show that the rewriting performance of our scheme compares favorably with that of polar WOM code in the rate region where high rewriting success probability is desired. We further augment our coding schemes with error correction capability. By drawing a connection to the conjugate code pairs studied in the context of quantum error correction, we develop a general framework for constructing error-correction WOM codes. Under this framework, we give an explicit construction of WOM codes whose codewords are contained in BCH codes.
△ Less
Submitted 31 January, 2015;
originally announced February 2015.
-
Explicit MDS Codes for Optimal Repair Bandwidth
Authors:
Zhiying Wang,
Itzhak Tamo,
Jehoshua Bruck
Abstract:
MDS codes are erasure-correcting codes that can correct the maximum number of erasures for a given number of redundancy or parity symbols. If an MDS code has $r$ parities and no more than $r$ erasures occur, then by transmitting all the remaining data in the code, the original information can be recovered. However, it was shown that in order to recover a single symbol erasure, only a fraction of…
▽ More
MDS codes are erasure-correcting codes that can correct the maximum number of erasures for a given number of redundancy or parity symbols. If an MDS code has $r$ parities and no more than $r$ erasures occur, then by transmitting all the remaining data in the code, the original information can be recovered. However, it was shown that in order to recover a single symbol erasure, only a fraction of $1/r$ of the information needs to be transmitted. This fraction is called the repair bandwidth (fraction). Explicit code constructions were given in previous works. If we view each symbol in the code as a vector or a column over some field, then the code forms a 2D array and such codes are especially widely used in storage systems. In this paper, we address the following question: given the length of the column $l$, number of parities $r$, can we construct high-rate MDS array codes with optimal repair bandwidth of $1/r$, whose code length is as long as possible? In this paper, we give code constructions such that the code length is $(r+1)\log_r l$.
△ Less
Submitted 23 November, 2014;
originally announced November 2014.
-
Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes
Authors:
Eyal En Gad,
Yue Li,
Joerg Kliewer,
Michael Langberg,
Anxiao Jiang,
Jehoshua Bruck
Abstract:
We propose efficient coding schemes for two communication settings: 1. asymmetric channels, and 2. channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In asymmetric channels, we tackle the exponential sto…
▽ More
We propose efficient coding schemes for two communication settings: 1. asymmetric channels, and 2. channels with an informed encoder. These settings are important in non-volatile memories, as well as optical and broadcast communication. The schemes are based on non-linear polar codes, and they build on and improve recent work on these settings. In asymmetric channels, we tackle the exponential storage requirement of previously known schemes, that resulted from the use of large Boolean functions. We propose an improved scheme, that achieves the capacity of asymmetric channels with polynomial computational complexity and storage requirement.
The proposed non-linear scheme is then generalized to the setting of channel coding with an informed encoder, using a multicoding technique. We consider specific instances of the scheme for flash memories, that incorporate error-correction capabilities together with rewriting. Since the considered codes are non-linear, they eliminate the requirement of previously known schemes (called polar write-once-memory codes) for shared randomness between the encoder and the decoder. Finally, we mention that the multicoding scheme is also useful for broadcast communication in Marton's region, improving upon previous schemes for this setting.
△ Less
Submitted 28 December, 2015; v1 submitted 13 October, 2014;
originally announced October 2014.
-
The Capacity of String-Replication Systems
Authors:
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules…
▽ More
It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules, including those resembling genomic replication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-replication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-replication systems.
△ Less
Submitted 18 January, 2014;
originally announced January 2014.
-
Rate-Distortion for Ranking with Incomplete Information
Authors:
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium…
▽ More
We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium, and large distortion regimes, while for the Chebyshev metric we present bounds that are valid for all distortions and are especially accurate for small distortions. In addition, for the Chebyshev metric, we provide a construction for covering codes.
△ Less
Submitted 14 January, 2014;
originally announced January 2014.
-
Rank-Modulation Rewrite Coding for Flash Memories
Authors:
Eyal En Gad,
Eitan Yaakobi,
Anxiao,
Jiang,
Jehoshua Bruck
Abstract:
The current flash memory technology focuses on the cost minimization of its static storage capacity. However, the resulting approach supports a relatively small number of program-erase cycles. This technology is effective for consumer devices (e.g., smartphones and cameras) where the number of program-erase cycles is small. However, it is not economical for enterprise storage systems that require…
▽ More
The current flash memory technology focuses on the cost minimization of its static storage capacity. However, the resulting approach supports a relatively small number of program-erase cycles. This technology is effective for consumer devices (e.g., smartphones and cameras) where the number of program-erase cycles is small. However, it is not economical for enterprise storage systems that require a large number of lifetime writes. The proposed approach in this paper for alleviating this problem consists of the efficient integration of two key ideas: (i) improving reliability and endurance by representing the information using relative values via the rank modulation scheme and (ii) increasing the overall (lifetime) capacity of the flash device via rewriting codes, namely, performing multiple writes per cell before erasure. This paper presents a new coding scheme that combines rank modulation with rewriting. The key benefits of the new scheme include: (i) the ability to store close to 2 bits per cell on each write with minimal impact on the lifetime of the memory, and (ii) efficient encoding and decoding algorithms that make use of capacity-achieving write-once-memory (WOM) codes that were proposed recently.
△ Less
Submitted 30 December, 2014; v1 submitted 3 December, 2013;
originally announced December 2013.
-
Systematic Codes for Rank Modulation
Authors:
Sarit Buzaglo,
Eitan Yaakobi,
Tuvi Etzion,
Jehoshua Bruck
Abstract:
The goal of this paper is to construct systematic error-correcting codes for permutations and multi-permutations in the Kendall's $τ$-metric. These codes are important in new applications such as rank modulation for flash memories. The construction is based on error-correcting codes for multi-permutations and a partition of the set of permutations into error-correcting codes. For a given large eno…
▽ More
The goal of this paper is to construct systematic error-correcting codes for permutations and multi-permutations in the Kendall's $τ$-metric. These codes are important in new applications such as rank modulation for flash memories. The construction is based on error-correcting codes for multi-permutations and a partition of the set of permutations into error-correcting codes. For a given large enough number of information symbols $k$, and for any integer $t$, we present a construction for ${(k+r,k)}$ systematic $t$-error-correcting codes, for permutations from $S_{k+r}$, with less redundancy symbols than the number of redundancy symbols in the codes of the known constructions. In particular, for a given $t$ and for sufficiently large $k$ we can obtain $r=t+1$. The same construction is also applied to obtain related systematic error-correcting codes for multi-permutations.
△ Less
Submitted 20 April, 2014; v1 submitted 27 November, 2013;
originally announced November 2013.
-
Systematic Error-Correcting Codes for Rank Modulation
Authors:
Hongchao Zhou,
Moshe Schwartz,
Anxiao Jiang,
Jehoshua Bruck
Abstract:
The rank-modulation scheme has been recently proposed for efficiently storing data in nonvolatile memories. Error-correcting codes are essential for rank modulation, however, existing results have been limited. In this work we explore a new approach, \emph{systematic error-correcting codes for rank modulation}. Systematic codes have the benefits of enabling efficient information retrieval and pote…
▽ More
The rank-modulation scheme has been recently proposed for efficiently storing data in nonvolatile memories. Error-correcting codes are essential for rank modulation, however, existing results have been limited. In this work we explore a new approach, \emph{systematic error-correcting codes for rank modulation}. Systematic codes have the benefits of enabling efficient information retrieval and potentially supporting more efficient encoding and decoding procedures. We study systematic codes for rank modulation under Kendall's $τ$-metric as well as under the $\ell_\infty$-metric.
In Kendall's $τ$-metric we present $[k+2,k,3]$-systematic codes for correcting one error, which have optimal rates, unless systematic perfect codes exist. We also study the design of multi-error-correcting codes, and provide two explicit constructions, one resulting in $[n+1,k+1,2t+2]$ systematic codes with redundancy at most $2t+1$. We use non-constructive arguments to show the existence of $[n,k,n-k]$-systematic codes for general parameters. Furthermore, we prove that for rank modulation, systematic codes achieve the same capacity as general error-correcting codes.
Finally, in the $\ell_\infty$-metric we construct two $[n,k,d]$ systematic multi-error-correcting codes, the first for the case of $d=O(1)$, and the second for $d=Θ(n)$. In the latter case, the codes have the same asymptotic rate as the best codes currently known in this metric.
△ Less
Submitted 25 October, 2013;
originally announced October 2013.
-
Access vs. Bandwidth in Codes for Storage
Authors:
Itzhak Tamo,
Zhiying Wang,
Jehoshua Bruck
Abstract:
Maximum distance separable (MDS) codes are widely used in storage systems to protect against disk (node) failures. A node is said to have capacity $l$ over some field $\mathbb{F}$, if it can store that amount of symbols of the field. An $(n,k,l)$ MDS code uses $n$ nodes of capacity $l$ to store $k$ information nodes. The MDS property guarantees the resiliency to any $n-k$ node failures. An \emph{o…
▽ More
Maximum distance separable (MDS) codes are widely used in storage systems to protect against disk (node) failures. A node is said to have capacity $l$ over some field $\mathbb{F}$, if it can store that amount of symbols of the field. An $(n,k,l)$ MDS code uses $n$ nodes of capacity $l$ to store $k$ information nodes. The MDS property guarantees the resiliency to any $n-k$ node failures. An \emph{optimal bandwidth} (resp. \emph{optimal access}) MDS code communicates (resp. accesses) the minimum amount of data during the repair process of a single failed node. It was shown that this amount equals a fraction of $1/(n-k)$ of data stored in each node. In previous optimal bandwidth constructions, $l$ scaled polynomially with $k$ in codes with asymptotic rate $<1$. Moreover, in constructions with a constant number of parities, i.e. rate approaches 1, $l$ is scaled exponentially w.r.t. $k$. In this paper, we focus on the later case of constant number of parities $n-k=r$, and ask the following question: Given the capacity of a node $l$ what is the largest number of information disks $k$ in an optimal bandwidth (resp. access) $(k+r,k,l)$ MDS code. We give an upper bound for the general case, and two tight bounds in the special cases of two important families of codes. Moreover, the bounds show that in some cases optimal-bandwidth code has larger $k$ than optimal-access code, and therefore these two measures are not equivalent.
△ Less
Submitted 14 March, 2013;
originally announced March 2013.
-
Balanced Modulation for Nonvolatile Memories
Authors:
Hongchao Zhou,
Anxiao,
Jiang,
Jehoshua Bruck
Abstract:
This paper presents a practical writing/reading scheme in nonvolatile memories, called balanced modulation, for minimizing the asymmetric component of errors. The main idea is to encode data using a balanced error-correcting code. When reading information from a block, it adjusts the reading threshold such that the resulting word is also balanced or approximately balanced. Balanced modulation has…
▽ More
This paper presents a practical writing/reading scheme in nonvolatile memories, called balanced modulation, for minimizing the asymmetric component of errors. The main idea is to encode data using a balanced error-correcting code. When reading information from a block, it adjusts the reading threshold such that the resulting word is also balanced or approximately balanced. Balanced modulation has suboptimal performance for any cell-level distribution and it can be easily implemented in the current systems of nonvolatile memories. Furthermore, we studied the construction of balanced error-correcting codes, in particular, balanced LDPC codes. It has very efficient encoding and decoding algorithms, and it is more efficient than prior construction of balanced error-correcting codes.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Nonuniform Codes for Correcting Asymmetric Errors in Data Storage
Authors:
Hongchao Zhou,
Anxiao,
Jiang,
Jehoshua Bruck
Abstract:
The construction of asymmetric error correcting codes is a topic that was studied extensively, however, the existing approach for code construction assumes that every codeword should tolerate $t$ asymmetric errors. Our main observation is that in contrast to symmetric errors, asymmetric errors are content dependent. For example, in Z-channels, the all-1 codeword is prone to have more errors than t…
▽ More
The construction of asymmetric error correcting codes is a topic that was studied extensively, however, the existing approach for code construction assumes that every codeword should tolerate $t$ asymmetric errors. Our main observation is that in contrast to symmetric errors, asymmetric errors are content dependent. For example, in Z-channels, the all-1 codeword is prone to have more errors than the all-0 codeword. This motivates us to develop nonuniform codes whose codewords can tolerate different numbers of asymmetric errors depending on their Hamming weights. The idea in a nonuniform codes' construction is to augment the redundancy in a content-dependent way and guarantee the worst case reliability while maximizing the code size. In this paper, we first study nonuniform codes for Z-channels, namely, they only suffer one type of errors, say 1 to 0. Specifically, we derive their upper bounds, analyze their asymptotic performances, and introduce two general constructions. Then we extend the concept and results of nonuniform codes to general binary asymmetric channels, where the error probability for each bit from 0 to 1 is smaller than that from 1 to 0.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Efficiently Extracting Randomness from Imperfect Stochastic Processes
Authors:
Hongchao Zhou,
Jehoshua Bruck
Abstract:
We study the problem of extracting a prescribed number of random bits by reading the smallest possible number of symbols from non-ideal stochastic processes. The related interval algorithm proposed by Han and Hoshi has asymptotically optimal performance; however, it assumes that the distribution of the input stochastic process is known. The motivation for our work is the fact that, in practice, so…
▽ More
We study the problem of extracting a prescribed number of random bits by reading the smallest possible number of symbols from non-ideal stochastic processes. The related interval algorithm proposed by Han and Hoshi has asymptotically optimal performance; however, it assumes that the distribution of the input stochastic process is known. The motivation for our work is the fact that, in practice, sources of randomness have inherent correlations and are affected by measurement's noise. Namely, it is hard to obtain an accurate estimation of the distribution. This challenge was addressed by the concepts of seeded and seedless extractors that can handle general random sources with unknown distributions. However, known seeded and seedless extractors provide extraction efficiencies that are substantially smaller than Shannon's entropy limit. Our main contribution is the design of extractors that have a variable input-length and a fixed output length, are efficient in the consumption of symbols from the source, are capable of generating random bits from general stochastic processes and approach the information theoretic upper bound on efficiency.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Linear Transformations for Randomness Extraction
Authors:
Hongchao Zhou,
Jehoshua Bruck
Abstract:
Information-efficient approaches for extracting randomness from imperfect sources have been extensively studied, but simpler and faster ones are required in the high-speed applications of random number generation. In this paper, we focus on linear constructions, namely, applying linear transformation for randomness extraction. We show that linear transformations based on sparse random matrices are…
▽ More
Information-efficient approaches for extracting randomness from imperfect sources have been extensively studied, but simpler and faster ones are required in the high-speed applications of random number generation. In this paper, we focus on linear constructions, namely, applying linear transformation for randomness extraction. We show that linear transformations based on sparse random matrices are asymptotically optimal to extract randomness from independent sources and bit-fixing sources, and they are efficient (may not be optimal) to extract randomness from hidden Markov sources. Further study demonstrates the flexibility of such constructions on source models as well as their excellent information-preserving capabilities. Since linear transformations based on sparse random matrices are computationally fast and can be easy to implement using hardware like FPGAs, they are very attractive in the high-speed applications. In addition, we explore explicit constructions of transformation matrices. We show that the generator matrices of primitive BCH codes are good choices, but linear transformations based on such matrices require more computational time due to their high densities.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Streaming Algorithms for Optimal Generation of Random Bits
Authors:
Hongchao Zhou,
Jehoshua Bruck
Abstract:
Generating random bits from a source of biased coins (the biased is unknown) is a classical question that was originally studied by von Neumann. There are a number of known algorithms that have asymptotically optimal information efficiency, namely, the expected number of generated random bits per input bit is asymptotically close to the entropy of the source. However, only the original von Neumann…
▽ More
Generating random bits from a source of biased coins (the biased is unknown) is a classical question that was originally studied by von Neumann. There are a number of known algorithms that have asymptotically optimal information efficiency, namely, the expected number of generated random bits per input bit is asymptotically close to the entropy of the source. However, only the original von Neumann algorithm has a `streaming property' - it operates on a single input bit at a time and it generates random bits when possible, alas, it does not have an optimal information efficiency.
The main contribution of this paper is an algorithm that generates random bit streams from biased coins, uses bounded space and runs in expected linear time. As the size of the allotted space increases, the algorithm approaches the information-theoretic upper bound on efficiency. In addition, we discuss how to extend this algorithm to generate random bit streams from m-sided dice or correlated sources such as Markov chains.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
A Universal Scheme for Transforming Binary Algorithms to Generate Random Bits from Loaded Dice
Authors:
Hongchao Zhou,
Jehoshua Bruck
Abstract:
In this paper, we present a universal scheme for transforming an arbitrary algorithm for biased 2-face coins to generate random bits from the general source of an m-sided die, hence enabling the application of existing algorithms to general sources. In addition, we study approaches of efficiently generating a prescribed number of random bits from an arbitrary biased coin. This contrasts with most…
▽ More
In this paper, we present a universal scheme for transforming an arbitrary algorithm for biased 2-face coins to generate random bits from the general source of an m-sided die, hence enabling the application of existing algorithms to general sources. In addition, we study approaches of efficiently generating a prescribed number of random bits from an arbitrary biased coin. This contrasts with most existing works, which typically assume that the number of coin tosses is fixed, and they generate a variable number of random bits.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Synthesis of Stochastic Flow Networks
Authors:
Hongchao Zhou,
Ho-Lin Chen,
Jehoshua Bruck
Abstract:
A stochastic flow network is a directed graph with incoming edges (inputs) and outgoing edges (outputs), tokens enter through the input edges, travel stochastically in the network, and can exit the network through the output edges. Each node in the network is a splitter, namely, a token can enter a node through an incoming edge and exit on one of the output edges according to a predefined probabil…
▽ More
A stochastic flow network is a directed graph with incoming edges (inputs) and outgoing edges (outputs), tokens enter through the input edges, travel stochastically in the network, and can exit the network through the output edges. Each node in the network is a splitter, namely, a token can enter a node through an incoming edge and exit on one of the output edges according to a predefined probability distribution. Stochastic flow networks can be easily implemented by DNA-based chemical reactions, with promising applications in molecular computing and stochastic computing. In this paper, we address a fundamental synthesis question: Given a finite set of possible splitters and an arbitrary rational probability distribution, design a stochastic flow network, such that every token that enters the input edge will exit the outputs with the prescribed probability distribution.
The problem of probability transformation dates back to von Neumann's 1951 work and was followed, among others, by Knuth and Yao in 1976. Most existing works have been focusing on the "simulation" of target distributions. In this paper, we design optimal-sized stochastic flow networks for "synthesizing" target distributions. It shows that when each splitter has two outgoing edges and is unbiased, an arbitrary rational probability \frac{a}{b} with a\leq b\leq 2^n can be realized by a stochastic flow network of size n that is optimal. Compared to the other stochastic systems, feedback (cycles in networks) strongly improves the expressibility of stochastic flow networks.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
The Synthesis and Analysis of Stochastic Switching Circuits
Authors:
Hongchao Zhou,
Po-Ling Loh,
Jehoshua Bruck
Abstract:
Stochastic switching circuits are relay circuits that consist of stochastic switches called pswitches. The study of stochastic switching circuits has widespread applications in many fields of computer science, neuroscience, and biochemistry. In this paper, we discuss several properties of stochastic switching circuits, including robustness, expressibility, and probability approximation.
First, w…
▽ More
Stochastic switching circuits are relay circuits that consist of stochastic switches called pswitches. The study of stochastic switching circuits has widespread applications in many fields of computer science, neuroscience, and biochemistry. In this paper, we discuss several properties of stochastic switching circuits, including robustness, expressibility, and probability approximation.
First, we study the robustness, namely, the effect caused by introducing an error of size εto each pswitch in a stochastic circuit. We analyze two constructions and prove that simple series-parallel circuits are robust to small error perturbations, while general series-parallel circuits are not. Specifically, the total error introduced by perturbations of size less than εis bounded by a constant multiple of εin a simple series-parallel circuit, independent of the size of the circuit.
Next, we study the expressibility of stochastic switching circuits: Given an integer q and a pswitch set S=\{\frac{1}{q},\frac{2}{q},...,\frac{q-1}{q}\}, can we synthesize any rational probability with denominator q^n (for arbitrary n) with a simple series-parallel stochastic switching circuit? We generalize previous results and prove that when q is a multiple of 2 or 3, the answer is yes. We also show that when q is a prime number larger than 3, the answer is no.
Probability approximation is studied for a general case of an arbitrary pswitch set S=\{s_1,s_2,...,s_{|S|}\}. In this case, we propose an algorithm based on local optimization to approximate any desired probability. The analysis reveals that the approximation error of a switching circuit decreases exponentially with an increasing circuit size.
△ Less
Submitted 4 September, 2012;
originally announced September 2012.
-
Zigzag Codes: MDS Array Codes with Optimal Rebuilding
Authors:
Itzhak Tamo,
Zhiying Wang,
Jehoshua Bruck
Abstract:
MDS array codes are widely used in storage systems to protect data against erasures. We address the \emph{rebuilding ratio} problem, namely, in the case of erasures, what is the fraction of the remaining information that needs to be accessed in order to rebuild \emph{exactly} the lost information? It is clear that when the number of erasures equals the maximum number of erasures that an MDS code c…
▽ More
MDS array codes are widely used in storage systems to protect data against erasures. We address the \emph{rebuilding ratio} problem, namely, in the case of erasures, what is the fraction of the remaining information that needs to be accessed in order to rebuild \emph{exactly} the lost information? It is clear that when the number of erasures equals the maximum number of erasures that an MDS code can correct then the rebuilding ratio is 1 (access all the remaining information). However, the interesting and more practical case is when the number of erasures is smaller than the erasure correcting capability of the code. For example, consider an MDS code that can correct two erasures: What is the smallest amount of information that one needs to access in order to correct a single erasure? Previous work showed that the rebuilding ratio is bounded between 1/2 and 3/4, however, the exact value was left as an open problem. In this paper, we solve this open problem and prove that for the case of a single erasure with a 2-erasure correcting code, the rebuilding ratio is 1/2. In general, we construct a new family of $r$-erasure correcting MDS array codes that has optimal rebuilding ratio of $\frac{e}{r}$ in the case of $e$ erasures, $1 \le e \le r$. Our array codes have efficient encoding and decoding algorithms (for the case $r=2$ they use a finite field of size 3) and an optimal update property.
△ Less
Submitted 1 December, 2011;
originally announced December 2011.
-
Compressed Encoding for Rank Modulation
Authors:
Eyal En Gad,
Anxiao,
Jiang,
Jehoshua Bruck
Abstract:
Rank modulation has been recently proposed as a scheme for storing information in flash memories. While rank modulation has advantages in improving write speed and endurance, the current encoding approach is based on the "push to the top" operation that is not efficient in the general case. We propose a new encoding procedure where a cell level is raised to be higher than the minimal necessary sub…
▽ More
Rank modulation has been recently proposed as a scheme for storing information in flash memories. While rank modulation has advantages in improving write speed and endurance, the current encoding approach is based on the "push to the top" operation that is not efficient in the general case. We propose a new encoding procedure where a cell level is raised to be higher than the minimal necessary subset - instead of all - of the other cell levels. This new procedure leads to a significantly more compressed (lower charge levels) encoding. We derive an upper bound for a family of codes that utilize the proposed encoding procedure, and consider code constructions that achieve that bound for several special cases.
△ Less
Submitted 12 August, 2011;
originally announced August 2011.
-
On Codes for Optimal Rebuilding Access
Authors:
Zhiying Wang,
Itzhak Tamo,
Jehoshua Bruck
Abstract:
MDS (maximum distance separable) array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with r redundancy nodes can correct any r erasures by accessing (reading) all the remaining information in both the systematic nodes and the parity (redundancy) nodes. However, in practice, a single erasure is the most likely failure e…
▽ More
MDS (maximum distance separable) array codes are widely used in storage systems due to their computationally efficient encoding and decoding procedures. An MDS code with r redundancy nodes can correct any r erasures by accessing (reading) all the remaining information in both the systematic nodes and the parity (redundancy) nodes. However, in practice, a single erasure is the most likely failure event; hence, a natural question is how much information do we need to access in order to rebuild a single storage node? We define the rebuilding ratio as the fraction of remaining information accessed during the rebuilding of a single erasure. In our previous work we showed that the optimal rebuilding ratio of 1/r is achievable (using our newly constructed array codes) for the rebuilding of any systematic node, however, all the information needs to be accessed for the rebuilding of the parity nodes. Namely, constructing array codes with a rebuilding ratio of 1/r was left as an open problem. In this paper, we solve this open problem and present array codes that achieve the lower bound of 1/r for rebuilding any single systematic or parity node.
△ Less
Submitted 8 July, 2011;
originally announced July 2011.
-
MDS Array Codes with Optimal Rebuilding
Authors:
Itzhak Tamo,
Zhiying Wang,
Jehoshua Bruck
Abstract:
MDS array codes are widely used in storage systems to protect data against erasures. We address the \emph{rebuilding ratio} problem, namely, in the case of erasures, what is the the fraction of the remaining information that needs to be accessed in order to rebuild \emph{exactly} the lost information? It is clear that when the number of erasures equals the maximum number of erasures that an MDS co…
▽ More
MDS array codes are widely used in storage systems to protect data against erasures. We address the \emph{rebuilding ratio} problem, namely, in the case of erasures, what is the the fraction of the remaining information that needs to be accessed in order to rebuild \emph{exactly} the lost information? It is clear that when the number of erasures equals the maximum number of erasures that an MDS code can correct then the rebuilding ratio is 1 (access all the remaining information). However, the interesting (and more practical) case is when the number of erasures is smaller than the erasure correcting capability of the code. For example, consider an MDS code that can correct two erasures: What is the smallest amount of information that one needs to access in order to correct a single erasure? Previous work showed that the rebuilding ratio is bounded between 1/2 and 3/4, however, the exact value was left as an open problem. In this paper, we solve this open problem and prove that for the case of a single erasure with a 2-erasure correcting code, the rebuilding ratio is 1/2. In general, we construct a new family of $r$-erasure correcting MDS array codes that has optimal rebuilding ratio of $\frac{1}{r}$ in the case of a single erasure. Our array codes have efficient encoding and decoding algorithms (for the case $r=2$ they use a finite field of size 3) and an optimal update property.
△ Less
Submitted 18 March, 2011;
originally announced March 2011.
-
Generalized Gray Codes for Local Rank Modulation
Authors:
Eyal En Gad,
Michael Langberg,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
We consider the local rank-modulation scheme in which a sliding window going over a sequence of real-valued variables induces a sequence of permutations. Local rank-modulation is a generalization of the rank-modulation scheme, which has been recently suggested as a way of storing information in flash memory. We study Gray codes for the local rank-modulation scheme in order to simulate conventional…
▽ More
We consider the local rank-modulation scheme in which a sliding window going over a sequence of real-valued variables induces a sequence of permutations. Local rank-modulation is a generalization of the rank-modulation scheme, which has been recently suggested as a way of storing information in flash memory. We study Gray codes for the local rank-modulation scheme in order to simulate conventional multi-level flash cells while retaining the benefits of rank modulation. Unlike the limited scope of previous works, we consider code constructions for the entire range of parameters including the code length, sliding window size, and overlap between adjacent windows. We show our constructed codes have asymptotically-optimal rate. We also provide efficient encoding, decoding, and next-state algorithms.
△ Less
Submitted 1 March, 2011;
originally announced March 2011.
-
Generating Probability Distributions using Multivalued Stochastic Relay Circuits
Authors:
David Lee,
Jehoshua Bruck
Abstract:
The problem of random number generation dates back to von Neumann's work in 1951. Since then, many algorithms have been developed for generating unbiased bits from complex correlated sources as well as for generating arbitrary distributions from unbiased bits. An equally interesting, but less studied aspect is the structural component of random number generation as opposed to the algorithmic aspec…
▽ More
The problem of random number generation dates back to von Neumann's work in 1951. Since then, many algorithms have been developed for generating unbiased bits from complex correlated sources as well as for generating arbitrary distributions from unbiased bits. An equally interesting, but less studied aspect is the structural component of random number generation as opposed to the algorithmic aspect. That is, given a network structure imposed by nature or physical devices, how can we build networks that generate arbitrary probability distributions in an optimal way? In this paper, we study the generation of arbitrary probability distributions in multivalued relay circuits, a generalization in which relays can take on any of N states and the logical 'and' and 'or' are replaced with 'min' and 'max' respectively. Previous work was done on two-state relays. We generalize these results, describing a duality property and networks that generate arbitrary rational probability distributions. We prove that these networks are robust to errors and design a universal probability generator which takes input bits and outputs arbitrary binary probability distributions.
△ Less
Submitted 7 February, 2011;
originally announced February 2011.
-
Trajectory Codes for Flash Memory
Authors:
Anxiao,
Jiang,
Michael Langberg,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
Flash memory is well-known for its inherent asymmetry: the flash-cell charge levels are easy to increase but are hard to decrease. In a general rewriting model, the stored data changes its value with certain patterns. The patterns of data updates are determined by the data structure and the application, and are independent of the constraints imposed by the storage medium. Thus, an appropriate codi…
▽ More
Flash memory is well-known for its inherent asymmetry: the flash-cell charge levels are easy to increase but are hard to decrease. In a general rewriting model, the stored data changes its value with certain patterns. The patterns of data updates are determined by the data structure and the application, and are independent of the constraints imposed by the storage medium. Thus, an appropriate coding scheme is needed so that the data changes can be updated and stored efficiently under the storage-medium's constraints.
In this paper, we define the general rewriting problem using a graph model. It extends many known rewriting models such as floating codes, WOM codes, buffer codes, etc. We present a new rewriting scheme for flash memories, called the trajectory code, for rewriting the stored data as many times as possible without block erasures. We prove that the trajectory code is asymptotically optimal in a wide range of scenarios.
We also present randomized rewriting codes optimized for expected performance (given arbitrary rewriting sequences). Our rewriting codes are shown to be asymptotically optimal.
△ Less
Submitted 24 December, 2010;
originally announced December 2010.