-
On the Reliability of Information Retrieval From MDS Coded Data in DNA Storage
Authors:
Serge Kas Hanna
Abstract:
This work presents a theoretical analysis of the probability of successfully retrieving data encoded with MDS codes (e.g., Reed-Solomon codes) in DNA storage systems. We study this probability under independent and identically distributed (i.i.d.) substitution errors, focusing on a common code design strategy that combines inner and outer MDS codes. Our analysis demonstrates how this probability d…
▽ More
This work presents a theoretical analysis of the probability of successfully retrieving data encoded with MDS codes (e.g., Reed-Solomon codes) in DNA storage systems. We study this probability under independent and identically distributed (i.i.d.) substitution errors, focusing on a common code design strategy that combines inner and outer MDS codes. Our analysis demonstrates how this probability depends on factors such as the total number of sequencing reads, their distribution across strands, the rates of the inner and outer codes, and the substitution error probabilities. These results provide actionable insights into optimizing DNA storage systems under reliability constraints, including determining the minimum number of sequencing reads needed for reliable data retrieval and identifying the optimal balance between the rates of inner and outer MDS codes.
△ Less
Submitted 1 May, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes
Authors:
Balu Bhasuran,
Qiao Jin,
Yuzhang Xie,
Carl Yang,
Karim Hanna,
Jennifer Costa,
Cindy Shavor,
Zhiyong Lu,
Zhe He
Abstract:
Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and l…
▽ More
Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.
△ Less
Submitted 31 October, 2024;
originally announced November 2024.
-
A Foundation Model for Chemical Design and Property Prediction
Authors:
Feiyang Cai,
Katelin Hanna,
Tianyu Zhu,
Tzuen-Rong Tzeng,
Yongping Duan,
Ling Liu,
Srikanth Pilla,
Gang Li,
Feng Luo
Abstract:
Artificial intelligence (AI) has significantly advanced computational chemistry research in various tasks. However, traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. ChemFM comprises 3 bi…
▽ More
Artificial intelligence (AI) has significantly advanced computational chemistry research in various tasks. However, traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. ChemFM comprises 3 billion parameters and is pre-trained on 178 million molecules using self-supervised causal language modeling to extract generalizable molecular representations. This model can be adapted to diverse downstream chemical applications using either full-parameter or parameter-efficient fine-tuning methods. ChemFM consistently outperforms state-of-the-art task-specific AI models across all tested tasks. Notably, it achieves up to 67.48% performance improvement across 34 property prediction benchmarks, up to 33.80% reduction in mean average deviation between conditioned and actual properties of generated molecules in conditional molecular generation tasks, and up to 3.7% top-1 accuracy improvement across 4 reaction prediction datasets. Moreover, ChemFM demonstrates its superior performance in predicting antibiotic activity and cytotoxicity, highlighting its potential to advance the discovery of novel antibiotics. We anticipate that ChemFM will significantly advance chemistry research by providing a foundation model capable of effectively generalizing across a broad range of tasks with minimal additional training.
△ Less
Submitted 23 January, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Lab-AI: Using Retrieval Augmentation to Enhance Language Models for Personalized Lab Test Interpretation in Clinical Medicine
Authors:
Xiaoyu Wang,
Haoyong Ouyang,
Balu Bhasuran,
Xiao Luo,
Karim Hanna,
Mia Liza A. Lustria,
Carl Yang,
Zhe He
Abstract:
Accurate interpretation of lab results is crucial in clinical medicine, yet most patient portals use universal normal ranges, ignoring conditional factors like age and gender. This study introduces Lab-AI, an interactive system that offers personalized normal ranges using retrieval-augmented generation (RAG) from credible health sources. Lab-AI has two modules: factor retrieval and normal range re…
▽ More
Accurate interpretation of lab results is crucial in clinical medicine, yet most patient portals use universal normal ranges, ignoring conditional factors like age and gender. This study introduces Lab-AI, an interactive system that offers personalized normal ranges using retrieval-augmented generation (RAG) from credible health sources. Lab-AI has two modules: factor retrieval and normal range retrieval. We tested these on 122 lab tests: 40 with conditional factors and 82 without. For tests with factors, normal ranges depend on patient-specific information. Our results show GPT-4-turbo with RAG achieved a 0.948 F1 score for factor retrieval and 0.995 accuracy for normal range retrieval. GPT-4-turbo with RAG outperformed the best non-RAG system by 33.5% in factor retrieval and showed 132% and 100% improvements in question-level and lab-level performance, respectively, for normal range retrieval. These findings highlight Lab-AI's potential to enhance patient understanding of lab results.
△ Less
Submitted 23 April, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Approximate Gradient Coding for Privacy-Flexible Federated Learning with Non-IID Data
Authors:
Okko Makkonen,
Sampo Niemelä,
Camilla Hollanti,
Serge Kas Hanna
Abstract:
This work focuses on the challenges of non-IID data and stragglers/dropouts in federated learning. We introduce and explore a privacy-flexible paradigm that models parts of the clients' local data as non-private, offering a more versatile and business-oriented perspective on privacy. Within this framework, we propose a data-driven strategy for mitigating the effects of label heterogeneity and clie…
▽ More
This work focuses on the challenges of non-IID data and stragglers/dropouts in federated learning. We introduce and explore a privacy-flexible paradigm that models parts of the clients' local data as non-private, offering a more versatile and business-oriented perspective on privacy. Within this framework, we propose a data-driven strategy for mitigating the effects of label heterogeneity and client straggling on federated learning. Our solution combines both offline data sharing and approximate gradient coding techniques. Through numerical simulations using the MNIST dataset, we demonstrate that our approach enables achieving a deliberate trade-off between privacy and utility, leading to improved model convergence and accuracy while using an adaptable portion of non-private data.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study
Authors:
Zhe He,
Balu Bhasuran,
Qiao Jin,
Shubo Tian,
Karim Hanna,
Cindy Shavor,
Lisbeth Garcia Arguello,
Patrick Murray,
Zhiyong Lu
Abstract:
Lab results are often confusing and hard to understand. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with au…
▽ More
Lab results are often confusing and hard to understand. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches. We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 QA pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses to seven selected questions on the same four aspects. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from a lack of interpretation in one's medical context, incorrect statements, and lack of references. We find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases which GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses.
△ Less
Submitted 23 January, 2024;
originally announced February 2024.
-
$\mathsf{GC+}$ Code: a Short Systematic Code for Correcting Random Edit Errors in DNA Storage
Authors:
Serge Kas Hanna
Abstract:
Storing digital data in synthetic DNA faces challenges in ensuring data reliability in the presence of edit errors -- deletions, insertions, and substitutions -- that occur randomly during various phases of the storage process. Current limitations in DNA synthesis technology also require the use of short DNA sequences, highlighting the particular need for short edit-correcting codes. Motivated by…
▽ More
Storing digital data in synthetic DNA faces challenges in ensuring data reliability in the presence of edit errors -- deletions, insertions, and substitutions -- that occur randomly during various phases of the storage process. Current limitations in DNA synthesis technology also require the use of short DNA sequences, highlighting the particular need for short edit-correcting codes. Motivated by these factors, we introduce a systematic code designed to correct random edits while adhering to typical length constraints in DNA storage. We evaluate the performance of the code through simulations and assess its effectiveness within a DNA storage framework, revealing promising results.
△ Less
Submitted 7 September, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Can Attention Be Used to Explain EHR-Based Mortality Prediction Tasks: A Case Study on Hemorrhagic Stroke
Authors:
Qizhang Feng,
Jiayi Yuan,
Forhan Bin Emdad,
Karim Hanna,
Xia Hu,
Zhe He
Abstract:
Stroke is a significant cause of mortality and morbidity, necessitating early predictive strategies to minimize risks. Traditional methods for evaluating patients, such as Acute Physiology and Chronic Health Evaluation (APACHE II, IV) and Simplified Acute Physiology Score III (SAPS III), have limited accuracy and interpretability. This paper proposes a novel approach: an interpretable, attention-b…
▽ More
Stroke is a significant cause of mortality and morbidity, necessitating early predictive strategies to minimize risks. Traditional methods for evaluating patients, such as Acute Physiology and Chronic Health Evaluation (APACHE II, IV) and Simplified Acute Physiology Score III (SAPS III), have limited accuracy and interpretability. This paper proposes a novel approach: an interpretable, attention-based transformer model for early stroke mortality prediction. This model seeks to address the limitations of previous predictive models, providing both interpretability (providing clear, understandable explanations of the model) and fidelity (giving a truthful explanation of the model's dynamics from input to output). Furthermore, the study explores and compares fidelity and interpretability scores using Shapley values and attention-based scores to improve model explainability. The research objectives include designing an interpretable attention-based transformer model, evaluating its performance compared to existing models, and providing feature importance derived from the model.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Optimal Codes Detecting Deletions in Concatenated Binary Strings Applied to Trace Reconstruction
Authors:
Serge Kas Hanna
Abstract:
Consider two or more strings $\mathbf{x}^1,\mathbf{x}^2,\ldots,$ that are concatenated to form $\mathbf{x}=\langle \mathbf{x}^1,\mathbf{x}^2,\ldots \rangle$. Suppose that up to $δ$ deletions occur in each of the concatenated strings. Since deletions alter the lengths of the strings, a fundamental question to ask is: how much redundancy do we need to introduce in $\mathbf{x}$ in order to recover th…
▽ More
Consider two or more strings $\mathbf{x}^1,\mathbf{x}^2,\ldots,$ that are concatenated to form $\mathbf{x}=\langle \mathbf{x}^1,\mathbf{x}^2,\ldots \rangle$. Suppose that up to $δ$ deletions occur in each of the concatenated strings. Since deletions alter the lengths of the strings, a fundamental question to ask is: how much redundancy do we need to introduce in $\mathbf{x}$ in order to recover the boundaries of $\mathbf{x}^1,\mathbf{x}^2,\ldots$? This boundary problem is equivalent to the problem of designing codes that can detect the exact number of deletions in each concatenated string. In this work, we answer the question above by first deriving converse results that give lower bounds on the redundancy of deletion-detecting codes. Then, we present a marker-based code construction whose redundancy is asymptotically optimal in $δ$ among all families of deletion-detecting codes, and exactly optimal among all block-by-block decodable codes. To exemplify the usefulness of such deletion-detecting codes, we apply our code to trace reconstruction and design an efficient coded reconstruction scheme that requires a constant number of traces.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load
Authors:
Maximilian Egger,
Serge Kas Hanna,
Rawad Bitar
Abstract:
In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers, that otherwise degrade the benefit of outsourcing the computation. This can be done by only waiting for…
▽ More
In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers, that otherwise degrade the benefit of outsourcing the computation. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. Previous works proposed to adapt the number of workers to wait for as the algorithm evolves to optimize the speed of convergence. In contrast, we model the communication and computation times using independent random variables. Considering this model, we construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm. Consequently, we improve the convergence speed of distributed SGD while significantly reducing the computation load, at the expense of a slight increase in communication load.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Codes Correcting Burst and Arbitrary Erasures for Reliable and Low-Latency Communication
Authors:
Serge Kas Hanna,
Zhiyuan Tan,
Wen Xu,
Antonia Wachter-Zeh
Abstract:
Motivated by modern network communication applications which require low latency, we study codes that correct erasures with low decoding delay. We provide a simple explicit construction that yields convolutional codes that can correct both burst and arbitrary erasures under a maximum decoding delay constraint $T$. Our proposed code has efficient encoding/decoding algorithms and requires a field si…
▽ More
Motivated by modern network communication applications which require low latency, we study codes that correct erasures with low decoding delay. We provide a simple explicit construction that yields convolutional codes that can correct both burst and arbitrary erasures under a maximum decoding delay constraint $T$. Our proposed code has efficient encoding/decoding algorithms and requires a field size that is linear in $T$. We study the performance of our code over the Gilbert-Elliot channel; our simulation results show significant performance gains over low-delay codes existing in the literature.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning
Authors:
Serge Kas Hanna,
Rawad Bitar,
Parimal Parag,
Venkat Dasari,
Salim El Rouayheb
Abstract:
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers, each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updatin…
▽ More
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers, each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating the model, where $k$ is a fixed parameter. The choice of the value of $k$ presents a trade-off between the runtime (i.e., convergence rate) of SGD and the error of the model. Towards optimizing the error-runtime trade-off, we investigate distributed SGD with adaptive~$k$, i.e., varying $k$ throughout the runtime of the algorithm. We first design an adaptive policy for varying $k$ that optimizes this trade-off based on an upper bound on the error as a function of the wall-clock time that we derive. Then, we propose and implement an algorithm for adaptive distributed SGD that is based on a statistical heuristic. Our results show that the adaptive version of distributed SGD can reach lower error values in less time compared to non-adaptive implementations. Moreover, the results also show that the adaptive version is communication-efficient, where the amount of communication required between the master and the workers is less than that of non-adaptive versions.
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Coding for Trace Reconstruction over Multiple Channels with Vanishing Deletion Probabilities
Authors:
Serge Kas Hanna
Abstract:
Motivated by DNA-based storage applications, we study the problem of reconstructing a coded sequence from multiple traces. We consider the model where the traces are outputs of independent deletion channels, where each channel deletes each bit of the input codeword \(\mathbf{x} \in \{0,1\}^n\) independently with probability \(p\). We focus on the regime where the deletion probability \(p \to 0\) w…
▽ More
Motivated by DNA-based storage applications, we study the problem of reconstructing a coded sequence from multiple traces. We consider the model where the traces are outputs of independent deletion channels, where each channel deletes each bit of the input codeword \(\mathbf{x} \in \{0,1\}^n\) independently with probability \(p\). We focus on the regime where the deletion probability \(p \to 0\) when \(n\to \infty\). Our main contribution is designing a novel code for trace reconstruction that allows reconstructing a coded sequence efficiently from a constant number of traces. We provide theoretical results on the performance of our code in addition to simulation results where we compare the performance of our code to other reconstruction techniques in terms of the edit distance error.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Optimal Codes Correcting Localized Deletions
Authors:
Rawad Bitar,
Serge Kas Hanna,
Nikita Polyanskii,
Ilya Vorobyev
Abstract:
We consider the problem of constructing codes that can correct deletions that are localized within a certain part of the codeword that is unknown a priori. Namely, the model that we study is when at most $k$ deletions occur in a window of size $k$, where the positions of the deletions within this window are not necessarily consecutive. Localized deletions are thus a generalization of burst deletio…
▽ More
We consider the problem of constructing codes that can correct deletions that are localized within a certain part of the codeword that is unknown a priori. Namely, the model that we study is when at most $k$ deletions occur in a window of size $k$, where the positions of the deletions within this window are not necessarily consecutive. Localized deletions are thus a generalization of burst deletions that occur in consecutive positions. We present novel explicit codes that are efficiently encodable and decodable and can correct up to $k$ localized deletions. Furthermore, these codes have $\log n+\mathcal{O}(k \log^2 (k\log n))$ redundancy, where $n$ is the length of the information message, which is asymptotically optimal in $n$ for $k=o(\log n/(\log \log n)^2)$.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Detecting Deletions and Insertions in Concatenated Strings with Optimal Redundancy
Authors:
Serge Kas Hanna,
Rawad Bitar
Abstract:
We study codes that can detect the exact number of deletions and insertions in concatenated binary strings. We construct optimal codes for the case of detecting up to $\del$ deletions. We prove the optimality of these codes by deriving a converse result which shows that the redundancy of our codes is asymptotically optimal in $\del$ among all families of deletion detecting codes, and particularly…
▽ More
We study codes that can detect the exact number of deletions and insertions in concatenated binary strings. We construct optimal codes for the case of detecting up to $\del$ deletions. We prove the optimality of these codes by deriving a converse result which shows that the redundancy of our codes is asymptotically optimal in $\del$ among all families of deletion detecting codes, and particularly optimal among all block-by-block decodable codes. For the case of insertions, we construct codes that can detect up to $2$ insertions in each concatenated binary string.
△ Less
Submitted 1 May, 2021;
originally announced May 2021.
-
Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers
Authors:
Serge Kas Hanna,
Rawad Bitar,
Parimal Parag,
Venkat Dasari,
Salim El Rouayheb
Abstract:
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating…
▽ More
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating the model, where $k$ is a fixed parameter. The choice of the value of $k$ presents a trade-off between the runtime (i.e., convergence rate) of SGD and the error of the model. Towards optimizing the error-runtime trade-off, we investigate distributed SGD with adaptive $k$. We first design an adaptive policy for varying $k$ that optimizes this trade-off based on an upper bound on the error as a function of the wall-clock time which we derive. Then, we propose an algorithm for adaptive distributed SGD that is based on a statistical heuristic. We implement our algorithm and provide numerical simulations which confirm our intuition and theoretical analysis.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
List Decoding of Deletions Using Guess & Check Codes
Authors:
Serge Kas Hanna,
Salim El Rouayheb
Abstract:
Guess & Check (GC) codes are systematic binary codes that can correct multiple deletions, with high probability. GC codes have logarithmic redundancy in the length of the message $k$, and the encoding and decoding algorithms of these codes are deterministic and run in polynomial time for a constant number of deletions $δ$. The unique decoding properties of GC codes were examined in a previous work…
▽ More
Guess & Check (GC) codes are systematic binary codes that can correct multiple deletions, with high probability. GC codes have logarithmic redundancy in the length of the message $k$, and the encoding and decoding algorithms of these codes are deterministic and run in polynomial time for a constant number of deletions $δ$. The unique decoding properties of GC codes were examined in a previous work by the authors. In this paper, we investigate the list decoding performance of these codes. Namely, we study the average size and the maximum size of the list obtained by a GC decoder for a constant number of deletions $δ$. The theoretical results show that: (i) the average size of the list approaches $1$ as $k$ grows; and (ii) there exists an infinite sequence of GC codes indexed by $k$, whose maximum list size in upper bounded by a constant that is independent of $k$. We also provide numerical simulations on the list decoding performance of GC codes for multiple values of $k$ and $δ$.
△ Less
Submitted 29 April, 2019; v1 submitted 16 October, 2018;
originally announced October 2018.
-
Codes for Correcting Localized Deletions
Authors:
Serge Kas Hanna,
Salim El Rouayheb
Abstract:
We consider the problem of constructing binary codes for correcting deletions that are localized within certain parts of the codeword that are unknown a priori. The model that we study is when $δ\leq w$ deletions are localized in a window of size $w$ bits. These $δ$ deletions do not necessarily occur in consecutive positions, but are restricted to the window of size $w$. The localized deletions mo…
▽ More
We consider the problem of constructing binary codes for correcting deletions that are localized within certain parts of the codeword that are unknown a priori. The model that we study is when $δ\leq w$ deletions are localized in a window of size $w$ bits. These $δ$ deletions do not necessarily occur in consecutive positions, but are restricted to the window of size $w$. The localized deletions model is a generalization of the bursty model, in which all the deleted bits are consecutive. In this paper, we construct new explicit codes for the localized model, based on the family of Guess & Check codes which was previously introduced by the authors. The codes that we construct can correct, with high probability, $δ\leq w$ deletions that are localized in a single window of size $w$, where $w$ grows with the block length. Moreover, these codes are systematic; have low redundancy; and have efficient deterministic encoding and decoding algorithms. We also generalize these codes to deletions that are localized within multiple windows in the codeword.
△ Less
Submitted 8 January, 2021; v1 submitted 2 November, 2017;
originally announced November 2017.
-
Guess & Check Codes for Deletions, Insertions, and Synchronization
Authors:
Serge Kas Hanna,
Salim El Rouayheb
Abstract:
We consider the problem of constructing codes that can correct $δ$ deletions occurring in an arbitrary binary string of length $n$ bits. Varshamov-Tenengolts (VT) codes, dating back to 1965, are zero-error single deletion $(δ=1)$ correcting codes, and have an asymptotically optimal redundancy. Finding similar codes for $δ\geq 2$ deletions remains an open problem. In this work, we relax the standar…
▽ More
We consider the problem of constructing codes that can correct $δ$ deletions occurring in an arbitrary binary string of length $n$ bits. Varshamov-Tenengolts (VT) codes, dating back to 1965, are zero-error single deletion $(δ=1)$ correcting codes, and have an asymptotically optimal redundancy. Finding similar codes for $δ\geq 2$ deletions remains an open problem. In this work, we relax the standard zero-error (i.e., worst-case) decoding requirement by assuming that the positions of the $δ$ deletions (or insertions) are independent of the codeword. Our contribution is a new family of explicit codes, that we call Guess & Check (GC) codes, that can correct with high probability up to a constant number of $δ$ deletions (or insertions). GC codes are systematic; and have deterministic polynomial time encoding and decoding algorithms. We also describe the application of GC codes to file synchronization.
△ Less
Submitted 24 May, 2018; v1 submitted 24 May, 2017;
originally announced May 2017.
-
Guess & Check Codes for Deletions and Synchronization
Authors:
Serge Kas Hanna,
Salim El Rouayheb
Abstract:
We consider the problem of constructing codes that can correct $δ$ deletions occurring in an arbitrary binary string of length $n$ bits. Varshamov-Tenengolts (VT) codes can correct all possible single deletions $(δ=1)$ with an asymptotically optimal redundancy. Finding similar codes for $δ\geq 2$ deletions is an open problem. We propose a new family of codes, that we call Guess & Check (GC) codes,…
▽ More
We consider the problem of constructing codes that can correct $δ$ deletions occurring in an arbitrary binary string of length $n$ bits. Varshamov-Tenengolts (VT) codes can correct all possible single deletions $(δ=1)$ with an asymptotically optimal redundancy. Finding similar codes for $δ\geq 2$ deletions is an open problem. We propose a new family of codes, that we call Guess & Check (GC) codes, that can correct, with high probability, a constant number of deletions $δ$ occurring at uniformly random positions within an arbitrary string. The GC codes are based on MDS codes and have an asymptotically optimal redundancy that is $Θ(δ\log n)$. We provide deterministic polynomial time encoding and decoding schemes for these codes. We also describe the applications of GC codes to file synchronization.
△ Less
Submitted 27 April, 2017; v1 submitted 15 February, 2017;
originally announced February 2017.