-
Making it to First: The Random Access Problem in DNA Storage
Authors:
Avital Boruchovsky,
Ohad Elishco,
Ryan Gabrys,
Anina Gruica,
Itzhak Tamo,
Eitan Yaakobi
Abstract:
We study the Random Access Problem in DNA storage, which addresses the challenge of retrieving a specific information strand from a DNA-based storage system. Given that $k$ information strands, representing the data, are encoded into $n$ strands using a code. The goal under this paradigm is to identify and analyze codes that minimize the expected number of reads required to retrieve any of the…
▽ More
We study the Random Access Problem in DNA storage, which addresses the challenge of retrieving a specific information strand from a DNA-based storage system. Given that $k$ information strands, representing the data, are encoded into $n$ strands using a code. The goal under this paradigm is to identify and analyze codes that minimize the expected number of reads required to retrieve any of the $k$ information strand, while in each read one of the $n$ encoded strands is read uniformly at random. We fully solve the case when $k=2$, showing that the best possible code attains a random access expectation of $0.914 \cdot 2$. Moreover, we generalize a construction from \cite{GMZ24}, specific to $k=3$, for any value of $k$. Our construction uses $B_{k-1}$ sequences over $\mathbb{Z}_{q-1}$, that always exist over large finite fields. For $k=4$, we show that this generalized construction outperforms all previous constructions in terms of reducing the random access expectation .
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
Codes Correcting Two Bursts of Exactly $b$ Deletions
Authors:
Zuo Ye,
Yubo Sun,
Wenjun Yu,
Gennian Ge,
Ohad Elishco
Abstract:
In this paper, we investigate codes designed to correct two bursts of deletions, where each burst has a length of exactly $b$, where $b>1$. The previous best construction, achieved through the syndrome compression technique, had a redundancy of at most $7\log n+O\left(\log n/\log\log n\right)$ bits. In contrast, our work introduces a novel approach for constructing $q$-ary codes that attain a redu…
▽ More
In this paper, we investigate codes designed to correct two bursts of deletions, where each burst has a length of exactly $b$, where $b>1$. The previous best construction, achieved through the syndrome compression technique, had a redundancy of at most $7\log n+O\left(\log n/\log\log n\right)$ bits. In contrast, our work introduces a novel approach for constructing $q$-ary codes that attain a redundancy of at most $5\log n+O(\log\log n)$ bits for all $b>1$ and $q\ge2$. Additionally, for the case where $b=1$, we present a new construction of $q$-ary two-deletion correcting codes with a redundancy of $5\log n+O(\log\log n)$ bits, for all $q>2$.
△ Less
Submitted 8 September, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
On the Long-Term behavior of $k$-tuples Frequencies in Mutation Systems
Authors:
Ohad Elishco
Abstract:
In response to the evolving landscape of data storage, researchers have increasingly explored non-traditional platforms, with DNA-based storage emerging as a cutting-edge solution. Our work is motivated by the potential of in-vivo DNA storage, known for its capacity to store vast amounts of information efficiently and confidentially within an organism's native DNA. While promising, in-vivo DNA sto…
▽ More
In response to the evolving landscape of data storage, researchers have increasingly explored non-traditional platforms, with DNA-based storage emerging as a cutting-edge solution. Our work is motivated by the potential of in-vivo DNA storage, known for its capacity to store vast amounts of information efficiently and confidentially within an organism's native DNA. While promising, in-vivo DNA storage faces challenges, including susceptibility to errors introduced by mutations. To understand the long-term behavior of such mutation systems, we investigate the frequency of $k$-tuples after multiple mutation applications.
Drawing inspiration from related works, we generalize results from the study of mutation systems, particularly focusing on the frequency of $k$-tuples. In this work, we provide a broad analysis through the construction of a specialized matrix and the identification of its eigenvectors. In the context of substitution and duplication systems, we leverage previous results on almost sure convergence, equating the expected frequency to the limiting frequency. Moreover, we demonstrate convergence in probability under certain assumptions.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Bounds and Constructions for Generalized Batch Codes
Authors:
Xiangliang Kong,
Ohad Elishco
Abstract:
Private information retrieval (PIR) codes and batch codes are two important types of codes that are designed for coded distributed storage systems and private information retrieval protocols. These codes have been the focus of much attention in recent years, as they enable efficient and secure storage and retrieval of data in distributed systems.
In this paper, we introduce a new class of codes…
▽ More
Private information retrieval (PIR) codes and batch codes are two important types of codes that are designed for coded distributed storage systems and private information retrieval protocols. These codes have been the focus of much attention in recent years, as they enable efficient and secure storage and retrieval of data in distributed systems.
In this paper, we introduce a new class of codes called \emph{$(s,t)$-batch codes}. These codes are a type of storage codes that can handle any multi-set of $t$ requests, comprised of $s$ distinct information symbols. Importantly, PIR codes and batch codes are special cases of $(s,t)$-batch codes.
The main goal of this paper is to explore the relationship between the number of redundancy symbols and the $(s,t)$-batch code property. Specifically, we establish a lower bound on the number of redundancy symbols required and present several constructions of $(s,t)$-batch codes. Furthermore, we extend this property to the case where each request is a linear combination of information symbols, which we refer to as \emph{functional $(s,t)$-batch codes}. Specifically, we demonstrate that simplex codes are asymptotically optimal functional $(s,t)$-batch codes, in terms of the number of redundancy symbols required, under certain parameter regime.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Storage codes and recoverable systems on lines and grids
Authors:
Alexander Barg,
Ohad Elishco,
Ryan Gabrys,
Geyang Wang,
Eitan Yaakobi
Abstract:
A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedur…
▽ More
A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedure driven by resolvable designs. We also study storage codes on $\mathbb Z$ and ${\mathbb Z}^2$ (lines and grids), finding closed-form expressions for the capacity of several one and two-dimensional systems depending on their recovery set, using connections between storage codes, graphs, anticodes, and difference-avoiding sets.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Codes Over Absorption Channels
Authors:
Zuo Ye,
Ohad Elishco
Abstract:
In this paper, we present a novel communication channel, called the absorption channel, inspired by information transmission in neurons. Our motivation comes from in-vivo nano-machines, emerging medical applications, and brain-machine interfaces that communicate over the nervous system. Another motivation comes from viewing our model as a specific deletion channel, which may provide a new perspect…
▽ More
In this paper, we present a novel communication channel, called the absorption channel, inspired by information transmission in neurons. Our motivation comes from in-vivo nano-machines, emerging medical applications, and brain-machine interfaces that communicate over the nervous system. Another motivation comes from viewing our model as a specific deletion channel, which may provide a new perspective and ideas to study the general deletion channel.
For any given finite alphabet, we give codes that can correct absorption errors. For the binary alphabet, the problem is relatively trivial and we can apply binary (multiple-) deletion correcting codes. For single-absorption error, we prove that the Varshamov-Tenengolts codes can provide a near-optimal code in our setting. When the alphabet size $q$ is at least $3$, we first construct a single-absorption correcting code whose redundancy is at most $3\log_q(n)+O(1)$. Then, based on this code and ideas introduced in \cite{Gabrys2022IT}, we give a second construction of single-absorption correcting codes with redundancy $\log_q(n)+12\log_q\log_q(n)+O(1)$, which is optimal up to an $O\left(\log_q\log_q(n)\right)$.
Finally, we apply the syndrome compression technique with pre-coding to obtain a subcode of the single-absorption correcting code. This subcode can combat multiple-absorption errors and has low redundancy. For each setup, efficient encoders and decoders are provided.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Binary $t_1$-Deletion-$t_2$-Insertion-Burst Correcting Codes and Codes Correcting a Burst of Deletions
Authors:
Zuo Ye,
Ohad Elishco
Abstract:
We first give a construction of binary $t_1$-deletion-$t_2$-insertion-burst correcting codes with redundancy at most $\log(n)+(t_1-t_2-1)\log\log(n)+O(1)$, where $t_1\ge 2t_2$. Then we give an improved construction of binary codes capable of correcting a burst of $4$ non-consecutive deletions, whose redundancy is reduced from $7\log(n)+2\log\log(n)+O(1)$ to $4\log(n)+6\log\log(n)+O(1)$. Lastly, by…
▽ More
We first give a construction of binary $t_1$-deletion-$t_2$-insertion-burst correcting codes with redundancy at most $\log(n)+(t_1-t_2-1)\log\log(n)+O(1)$, where $t_1\ge 2t_2$. Then we give an improved construction of binary codes capable of correcting a burst of $4$ non-consecutive deletions, whose redundancy is reduced from $7\log(n)+2\log\log(n)+O(1)$ to $4\log(n)+6\log\log(n)+O(1)$. Lastly, by connecting non-binary $b$-burst-deletion correcting codes with binary $2b$-deletion-$b$-insertion-burst correcting codes, we give a new construction of non-binary $b$-burst-deletion correcting codes with redundancy at most $\log(n)+(b-1)\log\log(n)+O(1)$. This construction is different from previous results.
△ Less
Submitted 22 November, 2022; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Reconstruction of a Single String from a Part of its Composition Multiset
Authors:
Zuo Ye,
Ohad Elishco
Abstract:
Motivated by applications in polymer-based data storage, we study the problem of reconstructing a string from part of its composition multiset. We give a full description of the structure of the strings that cannot be uniquely reconstructed (up to reversal) from their multiset of all of their prefix-suffix compositions. Leveraging this description, we prove that for all $n\ge 6$, there exists a st…
▽ More
Motivated by applications in polymer-based data storage, we study the problem of reconstructing a string from part of its composition multiset. We give a full description of the structure of the strings that cannot be uniquely reconstructed (up to reversal) from their multiset of all of their prefix-suffix compositions. Leveraging this description, we prove that for all $n\ge 6$, there exists a string of length $n$ that cannot be uniquely reconstructed up to reversal. Moreover, for all $n\ge 6$, we explicitly construct the set consisting of all length $n$ strings that can be uniquely reconstructed up to reversal. As a by product, we obtain that any binary string can be constructed using Dyck strings and Catalan-Bertrand strings.
For any given string $\bm{s}$, we provide a method to explicitly construct the set of all strings with the same prefix-suffix composition multiset as $\bm{s}$, as well as a formula for the size of this set. As an application, we construct a composition code of maximal size. Furthermore, we construct two classes of composition codes which can respectively correct composition missing errors and mass reducing substitution errors.
In addition, we raise two new problems: reconstructing a string from its composition multiset when at most a constant number of substring compositions are lost; reconstructing a string when only given its compositions of substrings of length at most $r$. For each of these setups, we give suitable codes under some conditions.
△ Less
Submitted 15 October, 2022; v1 submitted 31 August, 2022;
originally announced August 2022.
-
Optimal Reference for DNA Synthesis
Authors:
Ohad Elishco,
Wasim Huleihel
Abstract:
In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the sam…
▽ More
In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Quite recently, [26] studied the role of batch optimization in reducing the cost of large scale DNA synthesis, for a given pool $\mathcal{S}$ of random quaternary strings of fixed length. Among other things, it was shown that the asymptotic cost savings of batch optimization are significantly greater when the strings in $\mathcal{S}$ contain repeats of the same character (homopolymer run of length one), as compared to the case where strings are unconstrained.
Following the lead of [26], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length $k\geq1$. Specifically, we are given a set of DNA strands $\mathcal{S}$, randomly drawn from a natural Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we prove that for any $k\geq 1$, the optimal reference strand, minimizing the cost of DNA synthesis is, perhaps surprisingly, the periodic sequence $\overline{\mathsf{ACGT}}$. It turns out that tackling the homopolymer constraint of length $k\geq2$ is a challenging problem; our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
Recoverable Systems
Authors:
Ohad Elishco,
Alexander Barg
Abstract:
Motivated by the established notion of storage codes, we consider sets of infinite sequences over a finite alphabet such that every $k$-tuple of consecutive entries is uniquely recoverable from its $l$-neighborhood in the sequence. We address the problem of finding the maximum growth rate of the set, which we term capacity, as well as constructions of explicit families that approach the optimal ra…
▽ More
Motivated by the established notion of storage codes, we consider sets of infinite sequences over a finite alphabet such that every $k$-tuple of consecutive entries is uniquely recoverable from its $l$-neighborhood in the sequence. We address the problem of finding the maximum growth rate of the set, which we term capacity, as well as constructions of explicit families that approach the optimal rate. The techniques that we employ rely on the connection of this problem with constrained systems. In the second part of the paper we consider a modification of the problem wherein the entries in the sequence are viewed as random variables over a finite alphabet that follow some joint distribution, and the recovery condition requires that the Shannon entropy of the $k$-tuple conditioned on its $l$-neighborhood be bounded above by some $ε>0.$ We study properties of measures on infinite sequences that maximize the metric entropy under the recoverability condition. Drawing on tools from ergodic theory, we prove some properties of entropy-maximizing measures. We also suggest a procedure of constructing an $ε$-recoverable measure from a corresponding deterministic system.
△ Less
Submitted 6 March, 2022; v1 submitted 1 October, 2020;
originally announced October 2020.
-
Repeat-Free Codes
Authors:
Ohad Elishco,
Ryan Gabrys,
Eitan Yaakobi,
Muriel Médard
Abstract:
In this paper we consider the problem of encoding data into \textit{repeat-free} sequences in which sequences are imposed to contain any $k$-tuple at most once (for predefined $k$). First, the capacity of the repeat-free constraint are calculated. Then, an efficient algorithm, which uses two bits of redundancy, is presented to encode length-$n$ sequences for $k=2+2\log (n)$. This algorithm is then…
▽ More
In this paper we consider the problem of encoding data into \textit{repeat-free} sequences in which sequences are imposed to contain any $k$-tuple at most once (for predefined $k$). First, the capacity of the repeat-free constraint are calculated. Then, an efficient algorithm, which uses two bits of redundancy, is presented to encode length-$n$ sequences for $k=2+2\log (n)$. This algorithm is then improved to support any value of $k$ of the form $k=a\log (n)$, for $1<a$, while its redundancy is $o(n)$. We also calculate the capacity of repeat-free sequences when combined with local constraints which are given by a constrained system, and the capacity of multi-dimensional repeat-free codes.
△ Less
Submitted 21 June, 2021; v1 submitted 12 September, 2019;
originally announced September 2019.
-
Capacity of dynamical storage systems
Authors:
Ohad Elishco,
Alexander Barg
Abstract:
We introduce a dynamical model of node repair in distributed storage systems wherein the storage nodes are subjected to failures according to independent Poisson processes. The main parameter that we study is the time-average capacity of the network in the scenario where a fixed subset of the nodes support a higher repair bandwidth than the other nodes. The sequence of node failures generates rand…
▽ More
We introduce a dynamical model of node repair in distributed storage systems wherein the storage nodes are subjected to failures according to independent Poisson processes. The main parameter that we study is the time-average capacity of the network in the scenario where a fixed subset of the nodes support a higher repair bandwidth than the other nodes. The sequence of node failures generates random permutations of the nodes in the encoded block, and we model the state of the network as a Markov random walk on permutations of $n$ elements. As our main result we show that the capacity of the network can be increased compared to the static (worst-case) model of the storage system, while maintaining the same (average) repair bandwidth, and we derive estimates of the increase. We also quantify the capacity increase in the case that the repair center has information about the sequence of the recently failed storage nodes.
△ Less
Submitted 20 September, 2020; v1 submitted 26 August, 2019;
originally announced August 2019.
-
The Capacity of Some Pólya String Models
Authors:
Ohad Elishco,
Farzad Farnoud,
Moshe Schwartz,
Jehoshua Bruck
Abstract:
We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic mod…
▽ More
We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics.
△ Less
Submitted 18 August, 2018;
originally announced August 2018.
-
On Independence and Capacity of Multidimensional Semiconstrained Systems
Authors:
Ohad Elishco,
Tom Meyerovitch,
Moshe Schwartz
Abstract:
We find a new formula for the limit of the capacity of certain sequences of multidimensional semiconstrained systems as the dimension tends to infinity. We do so by generalizing the notion of independence entropy, originally studied in the context of constrained systems, to the study of semiconstrained systems. Using the independence entropy, we obtain new lower bounds on the capacity of multidime…
▽ More
We find a new formula for the limit of the capacity of certain sequences of multidimensional semiconstrained systems as the dimension tends to infinity. We do so by generalizing the notion of independence entropy, originally studied in the context of constrained systems, to the study of semiconstrained systems. Using the independence entropy, we obtain new lower bounds on the capacity of multidimensional semiconstrained systems in general, and $d$-dimensional axial-product systems in particular. In the case of the latter, we prove our bound is asymptotically tight, giving the exact limiting capacity in terms of the independence entropy. We show the new bound improves upon the best-known bound in a case study of $(0,k,p)$-RLL.
△ Less
Submitted 15 September, 2017;
originally announced September 2017.
-
Encoding Semiconstrained Systems
Authors:
Ohad Elishco,
Tom Meyerovitch,
Moshe Schwartz
Abstract:
Semiconstrained systems were recently suggested as a generalization of constrained systems, commonly used in communication and data-storage applications that require certain offending subsequences be avoided. In an attempt to apply techniques from constrained systems, we study sequences of constrained systems that are contained in, or contain, a given semiconstrained system, while approaching its…
▽ More
Semiconstrained systems were recently suggested as a generalization of constrained systems, commonly used in communication and data-storage applications that require certain offending subsequences be avoided. In an attempt to apply techniques from constrained systems, we study sequences of constrained systems that are contained in, or contain, a given semiconstrained system, while approaching its capacity. In the case of contained systems we describe to such sequences resulting in constant-to-constant bit-rate block encoders and sliding-block encoders. Surprisingly, in the case of containing systems we show that a "generic" semiconstrained system is never contained in a proper fully-constrained system.
△ Less
Submitted 24 October, 2016; v1 submitted 21 January, 2016;
originally announced January 2016.
-
Semi-constrained Systems
Authors:
Ohad Elishco,
Tom Meyerovitch,
Moshe Schwartz
Abstract:
When transmitting information over a noisy channel, two approaches, dating back to Shannon's work, are common: assuming the channel errors are independent of the transmitted content and devising an error-correcting code, or assuming the errors are data dependent and devising a constrained-coding scheme that eliminates all offending data patterns. In this paper we analyze a middle road, which we ca…
▽ More
When transmitting information over a noisy channel, two approaches, dating back to Shannon's work, are common: assuming the channel errors are independent of the transmitted content and devising an error-correcting code, or assuming the errors are data dependent and devising a constrained-coding scheme that eliminates all offending data patterns. In this paper we analyze a middle road, which we call a semiconstrained system. In such a system, which is an extension of the channel with cost constraints model, we do not eliminate the error-causing sequences entirely, but rather restrict the frequency in which they appear.
We address several key issues in this study. The first is proving closed-form bounds on the capacity which allow us to bound the asymptotics of the capacity. In particular, we bound the rate at which the capacity of the semiconstrained $(0,k)$-RLL tends to $1$ as $k$ grows. The second key issue is devising efficient encoding and decoding procedures that asymptotically achieve capacity with vanishing error. Finally, we consider delicate issues involving the continuity of the capacity and a relaxation of the definition of semiconstrained systems.
△ Less
Submitted 21 January, 2015; v1 submitted 12 January, 2014;
originally announced January 2014.
-
Capacity and coding for the Ising Channel with Feedback
Authors:
Ohad Elishco,
Haim Permuter
Abstract:
The Ising channel, which was introduced in 1990, is a channel with memory that models Inter-Symbol interference. In this paper we consider the Ising channel with feedback and find the capacity of the channel together with a capacity-achieving coding scheme. To calculate the channel capacity, an equivalent dynamic programming (DP) problem is formulated and solved. Using the DP solution, we establis…
▽ More
The Ising channel, which was introduced in 1990, is a channel with memory that models Inter-Symbol interference. In this paper we consider the Ising channel with feedback and find the capacity of the channel together with a capacity-achieving coding scheme. To calculate the channel capacity, an equivalent dynamic programming (DP) problem is formulated and solved. Using the DP solution, we establish that the feedback capacity is the expression $C=(\frac{2H_b(a)}{3+a})\approx 0.575522$ where $a$ is a particular root of a fourth-degree polynomial and $H_b(x)$ denotes the binary entropy function. Simultaneously, $a=\arg \max_{0\leq x \leq 1} (\frac{2H_b(x)}{3+x})$. Finally, a simple, error-free, capacity-achieving coding scheme is provided together with outlining a strong connection between the DP results and the coding scheme.
△ Less
Submitted 21 May, 2012;
originally announced May 2012.