Search | arXiv e-print repository

Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

Authors: Zinan Lin, Tadas Baltrusaitis, Wenyu Wang, Sergey Yekhanin

Abstract: Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to i… ▽ More Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA. △ Less

Submitted 20 May, 2025; v1 submitted 8 February, 2025; originally announced February 2025.

Comments: Published in: (1) ICLR 2025 Workshop on Data Problems, (2) ICLR 2025 Workshop on Synthetic Data

arXiv:2404.02241 [pdf, other]

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Shuaiqi Wang, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Abstract: Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find that high-quality model weights often lie in a basin which cannot be reached by SGD but can be obtained by pro… ▽ More Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find that high-quality model weights often lie in a basin which cannot be reached by SGD but can be obtained by proper checkpoint averaging. Based on these observations, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: $\textbf{(a) Reducing training cost.}$ With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23$\times$ on CIFAR-10 and 15$\times$ on ImageNet-64). $\textbf{(b) Enhancing pre-trained models.}$ Assuming full training is already done, LCSC can further improve the generation quality or speed of the final converged models. For example, LCSC achieves better performance using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality on CIFAR-10. Our code is available at https://github.com/imagination-research/LCSC. △ Less

Submitted 26 February, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.01749 [pdf, other]

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Authors: Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin

Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalab… ▽ More Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe. △ Less

Submitted 23 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: ICML'24 Spotlight

arXiv:2305.15560 [pdf, other]

Differentially Private Synthetic Data via Foundation Model APIs 1: Images

Authors: Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, Harsha Nori, Sergey Yekhanin

Abstract: Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs… ▽ More Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID <= 7.9 with privacy cost ε = 0.67, significantly improving the previous SOTA from ε = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA. △ Less

Submitted 17 May, 2025; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Published in ICLR 2024

arXiv:2304.06929 [pdf]

Advancing Differential Privacy: Where We Are Now and Future Directions for Real-World Deployment

Authors: Rachel Cummings, Damien Desfontaines, David Evans, Roxana Geambasu, Yangsibo Huang, Matthew Jagielski, Peter Kairouz, Gautam Kamath, Sewoong Oh, Olga Ohrimenko, Nicolas Papernot, Ryan Rogers, Milan Shen, Shuang Song, Weijie Su, Andreas Terzis, Abhradeep Thakurta, Sergei Vassilvitskii, Yu-Xiang Wang, Li Xiong, Sergey Yekhanin, Da Yu, Huanyu Zhang, Wanrong Zhang

Abstract: In this article, we present a detailed review of current practices and state-of-the-art methodologies in the field of differential privacy (DP), with a focus of advancing DP's deployment in real-world applications. Key points and high-level contents of the article were originated from the discussions from "Differential Privacy (DP): Challenges Towards the Next Frontier," a workshop held in July 20… ▽ More In this article, we present a detailed review of current practices and state-of-the-art methodologies in the field of differential privacy (DP), with a focus of advancing DP's deployment in real-world applications. Key points and high-level contents of the article were originated from the discussions from "Differential Privacy (DP): Challenges Towards the Next Frontier," a workshop held in July 2022 with experts from industry, academia, and the public sector seeking answers to broad questions pertaining to privacy and its implications in the design of industry-grade systems. This article aims to provide a reference point for the algorithmic and design decisions within the realm of privacy, highlighting important challenges and potential research directions. Covering a wide spectrum of topics, this article delves into the infrastructure needs for designing private systems, methods for achieving better privacy/utility trade-offs, performing privacy attacks and auditing, as well as communicating privacy with broader audiences and stakeholders. △ Less

Submitted 12 March, 2024; v1 submitted 14 April, 2023; originally announced April 2023.

arXiv:2110.06500 [pdf, other]

Differentially Private Fine-tuning of Language Models

Authors: Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang

Abstract: We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially… ▽ More We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced. △ Less

Submitted 14 July, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: ICLR 2022. Code available at https://github.com/huseyinatahaninan/Differentially-Private-Fine-tuning-of-Language-Models

arXiv:2108.02831 [pdf, other]

Differentially Private n-gram Extraction

Authors: Kunho Kim, Sivakanth Gopi, Janardhan Kulkarni, Sergey Yekhanin

Abstract: We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other a… ▽ More We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other applications such as sequence mining, and is a generalization of recently studied differentially private set union (DPSU). In this paper, we develop a new differentially private algorithm for this problem which, in our experiments, significantly outperforms the state-of-the-art. Our improvements stem from combining recent advances in DPSU, privacy accounting, and new heuristics for pruning in the tree-based approach initiated by Chen et al. (2012). △ Less

Submitted 5 August, 2021; originally announced August 2021.

arXiv:2107.06440 [pdf, other]

Trellis BMA: Coded Trace Reconstruction on IDS Channels for DNA Storage

Authors: Sundara Rajan Srinivasavaradhan, Sivakanth Gopi, Henry D. Pfister, Sergey Yekhanin

Abstract: Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an in… ▽ More Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an insertion-deletion-substitution (IDS) channel and design both encoding schemes and low-complexity decoding algorithms for coded trace reconstruction. We introduce Trellis BMA, a new reconstruction algorithm whose complexity is linear in the number of traces, and compare its performance to previous algorithms. Our results show that it reduces the error rate on both simulated and experimental data. The performance comparisons in this paper are based on a new dataset of traces that will be publicly released with the paper. Our hope is that this dataset will enable research progress by allowing objective comparisons between candidate algorithms. △ Less

Submitted 20 August, 2024; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: Extended version of paper presented at ISIT 2021. On 8/20/2024, in Section III, added a note regarding the dataset of traces released with the paper

arXiv:2011.14532 [pdf, other]

Batch Optimization for DNA Synthesis

Authors: Konstantin Makarychev, Miklos Z. Racz, Cyrus Rashtchian, Sergey Yekhanin

Abstract: Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cos… ▽ More Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cost of large scale DNA synthesis, which translates to the following algorithmic task. Given a large pool $\mathcal{S}$ of random quaternary strings of fixed length, partition $\mathcal{S}$ into batches in a way that minimizes the sum of the lengths of the shortest common supersequences across batches. We introduce two ideas for batch optimization that both improve (in different ways) upon a naive baseline: (1) using both $(ACGT)^{*}$ and its reverse $(TGCA)^{*}$ as reference strands, and batching appropriately, and (2) batching via the quantiles of an appropriate ordering of the strands. We also prove asymptotically matching lower bounds on the cost of DNA synthesis, showing that one cannot improve upon these two ideas. Our results uncover a surprising separation between two cases that naturally arise in the context of DNA data storage: the asymptotic cost savings of batch optimization are significantly greater in the case where strings in $\mathcal{S}$ do not contain repeats of the same character (homopolymers), as compared to the case where strings in $\mathcal{S}$ are unconstrained. △ Less

Submitted 23 February, 2021; v1 submitted 29 November, 2020; originally announced November 2020.

Comments: Improved Theorem 1.2 and its proof

arXiv:2002.09745 [pdf, other]

Differentially Private Set Union

Authors: Sivakanth Gopi, Pankaj Gulhane, Janardhan Kulkarni, Judy Hanwen Shen, Milad Shokouhi, Sergey Yekhanin

Abstract: We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large a… ▽ More We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large as possible. The problem arises in countless real world applications; it is particularly ubiquitous in natural language processing (NLP) applications as vocabulary extraction. For example, discovering words, sentences, $n$-grams etc., from private text data belonging to users is an instance of the set union problem. Known algorithms for this problem proceed by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts fall above a certain threshold. Crucially, in the above process, the contribution of each individual user is always independent of the items held by other users, resulting in a wasteful aggregation process, where some item counts happen to be way above the threshold. We deviate from the above paradigm by allowing users to contribute their items in a $\textit{dependent fashion}$, guided by a $\textit{policy}$. In this new setting ensuring privacy is significantly delicate. We prove that any policy which has certain $\textit{contractive}$ properties would result in a differentially private algorithm. We design two new algorithms, one using Laplace noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties. Our experiments show that the new algorithms significantly outperform previously known mechanisms for the problem. △ Less

Submitted 6 April, 2022; v1 submitted 22 February, 2020; originally announced February 2020.

Comments: 23 pages, 7 figures

arXiv:1807.00736 [pdf, other]

An Algorithmic Framework For Differentially Private Data Analysis on Trusted Processors

Authors: Joshua Allen, Bolin Ding, Janardhan Kulkarni, Harsha Nori, Olga Ohrimenko, Sergey Yekhanin

Abstract: Differential privacy has emerged as the main definition for private data analysis and machine learning. The {\em global} model of differential privacy, which assumes that users trust the data collector, provides strong privacy guarantees and introduces small errors in the output. In contrast, applications of differential privacy in commercial systems by Apple, Google, and Microsoft, use the {\em l… ▽ More Differential privacy has emerged as the main definition for private data analysis and machine learning. The {\em global} model of differential privacy, which assumes that users trust the data collector, provides strong privacy guarantees and introduces small errors in the output. In contrast, applications of differential privacy in commercial systems by Apple, Google, and Microsoft, use the {\em local model}. Here, users do not trust the data collector, and hence randomize their data before sending it to the data collector. Unfortunately, local model is too strong for several important applications and hence is limited in its applicability. In this work, we propose a framework based on trusted processors and a new definition of differential privacy called {\em Oblivious Differential Privacy}, which combines the best of both local and global models. The algorithms we design in this framework show interesting interplay of ideas from the streaming algorithms, oblivious algorithms, and differential privacy. △ Less

Submitted 26 October, 2019; v1 submitted 2 July, 2018; originally announced July 2018.

Comments: Accepted at NeurIPS 2019

arXiv:1712.01524 [pdf, other]

Collecting Telemetry Data Privately

Authors: Bolin Ding, Janardhan Kulkarni, Sergey Yekhanin

Abstract: The collection and analysis of telemetry data from users' devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privac… ▽ More The collection and analysis of telemetry data from users' devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privacy. The guarantees provided by such algorithms are typically very strong for a single round of telemetry collection, but degrade rapidly when telemetry is collected regularly. In particular, existing LDP algorithms are not suitable for repeated collection of counter data such as daily app usage statistics. In this paper, we develop new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for an arbitrarily long period of time. For two basic analytical tasks, mean estimation and histogram estimation, our LDP mechanisms for repeated data collection provide estimates with comparable or even the same accuracy as existing single-round LDP collection mechanisms. We conduct empirical evaluation on real-world counter datasets to verify our theoretical results. Our mechanisms have been deployed by Microsoft to collect telemetry across millions of devices. △ Less

Submitted 5 December, 2017; originally announced December 2017.

Comments: To appear in NIPS 2017

arXiv:1710.10322 [pdf, other]

Maximally Recoverable LRCs: A field size lower bound and constructions for few heavy parities

Authors: Sivakanth Gopi, Venkatesan Guruswami, Sergey Yekhanin

Abstract: The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Local Reconstruction Codes (LRCs) have emerged as the codes of choice for these applications. These codes can correct a small number of erasures by accessing only a small number of remaining coordinates. An $(n,r,h,a,q)$-LRC is a linear code over… ▽ More The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Local Reconstruction Codes (LRCs) have emerged as the codes of choice for these applications. These codes can correct a small number of erasures by accessing only a small number of remaining coordinates. An $(n,r,h,a,q)$-LRC is a linear code over $\mathbb{F}_q$ of length $n$, whose codeword symbols are partitioned into $g=n/r$ local groups each of size $r$. It has $h$ global parity checks and each local group has $a$ local parity checks. Such an LRC is Maximally Recoverable (MR), if it corrects all erasure patterns which are information-theoretically correctable under the stipulated structure of local and global parity checks. We show the first non-trivial lower bounds on the field size required for MR LRCs. When $a,h$ are constant and the number of local groups $g \ge h$, while $r$ may grow with $n$, our lower bound simplifies to $q\ge Ω_{a,h}\left(n\cdot r^{\min\{a,h-2\}}\right).$ No superlinear (in $n$) lower bounds were known prior to this work for any setting of parameters. MR LRCs deployed in practice have a small number of global parities, typically $h=2,3$. We complement our lower bounds by giving constructions with small field size for $h\le 3$. For $h=2$, we give a linear field size construction. We also show a surprising application of elliptic curves and arithmetic progression free sets in the construction of MR LRCs. △ Less

Submitted 15 November, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

Comments: Conference version to appear in Symposium on Discrete Algorithms (SODA) 2018

arXiv:1605.05412 [pdf, other]

Maximally Recoverable Codes for Grid-like Topologies

Authors: Parikshit Gopalan, Guangda Hu, Swastik Kopparty, Shubhangi Saraf, Carol Wang, Sergey Yekhanin

Abstract: The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Yet, the codes being deployed in practice are fairly short. In this work, we address what we view as the main coding theoretic barrier to deploying longer codes in storage: at large lengths, failures are not independent and correlated failures are inev… ▽ More The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Yet, the codes being deployed in practice are fairly short. In this work, we address what we view as the main coding theoretic barrier to deploying longer codes in storage: at large lengths, failures are not independent and correlated failures are inevitable. This motivates designing codes that allow quick data recovery even after large correlated failures, and which have efficient encoding and decoding. We propose that code design for distributed storage be viewed as a two-step process. The first step is choose a topology of the code, which incorporates knowledge about the correlated failures that need to be handled, and ensures local recovery from such failures. In the second step one specifies a code with the chosen topology by choosing coefficients from a finite field. In this step, one tries to balance reliability (which is better over larger fields) with encoding and decoding efficiency (which is better over smaller fields). This work initiates an in-depth study of this reliability/efficiency tradeoff. We consider the field-size needed for achieving maximal recoverability: the strongest reliability possible with a given topology. We propose a family of topologies called grid-like topologies which unify a number of topologies considered both in theory and practice, and prove a collection of results about maximally recoverable codes in such topologies including the first super-polynomial lower bound on the field size. △ Less

Submitted 20 September, 2016; v1 submitted 17 May, 2016; originally announced May 2016.

arXiv:1605.02290 [pdf, ps, other]

New Constructions of SD and MR Codes over Small Finite Fields

Authors: Guangda Hu, Sergey Yekhanin

Abstract: Data storage applications require erasure-correcting codes with prescribed sets of dependencies between data symbols and redundant symbols. The most common arrangement is to have $k$ data symbols and $h$ redundant symbols (that each depends on all data symbols) be partitioned into a number of disjoint groups, where for each group one allocates an additional (local) redundant symbol storing the par… ▽ More Data storage applications require erasure-correcting codes with prescribed sets of dependencies between data symbols and redundant symbols. The most common arrangement is to have $k$ data symbols and $h$ redundant symbols (that each depends on all data symbols) be partitioned into a number of disjoint groups, where for each group one allocates an additional (local) redundant symbol storing the parity of all symbols in the group. A code as above is maximally recoverable, if it corrects all erasure patterns that are information theoretically correctable given the dependency constraints. A slightly weaker guarantee is provided by SD codes. One key consideration in the design of MR and SD codes is the size of the finite field underlying the code as using small finite fields facilitates encoding and decoding operations. In this paper we present new explicit constructions of SD and MR codes over small finite fields. △ Less

Submitted 8 May, 2016; originally announced May 2016.

arXiv:1307.4150 [pdf, ps, other]

Explicit Maximally Recoverable Codes with Locality

Authors: Parikshit Gopalan, Cheng Huang, Bob Jenkins, Sergey Yekhanin

Abstract: Consider a systematic linear code where some (local) parity symbols depend on few prescribed symbols, while other (heavy) parity symbols may depend on all data symbols. Local parities allow to quickly recover any single symbol when it is erased, while heavy parities provide tolerance to a large number of simultaneous erasures. A code as above is maximally-recoverable if it corrects all erasure pat… ▽ More Consider a systematic linear code where some (local) parity symbols depend on few prescribed symbols, while other (heavy) parity symbols may depend on all data symbols. Local parities allow to quickly recover any single symbol when it is erased, while heavy parities provide tolerance to a large number of simultaneous erasures. A code as above is maximally-recoverable if it corrects all erasure patterns which are information theoretically recoverable given the code topology. In this paper we present explicit families of maximally-recoverable codes with locality. We also initiate the study of the trade-off between maximal recoverability and alphabet size. △ Less

Submitted 19 July, 2013; v1 submitted 15 July, 2013; originally announced July 2013.

MSC Class: 94B05

arXiv:1303.3921 [pdf, ps, other]

On the Locality of Codeword Symbols in Non-Linear Codes

Authors: Michael Forbes, Sergey Yekhanin

Abstract: Consider a possibly non-linear (n,K,d)_q code. Coordinate i has locality r if its value is determined by some r other coordinates. A recent line of work obtained an optimal trade-off between information locality of codes and their redundancy. Further, for linear codes meeting this trade-off, structure theorems were derived. In this work we give a new proof of the locality / redundancy trade-off an… ▽ More Consider a possibly non-linear (n,K,d)_q code. Coordinate i has locality r if its value is determined by some r other coordinates. A recent line of work obtained an optimal trade-off between information locality of codes and their redundancy. Further, for linear codes meeting this trade-off, structure theorems were derived. In this work we give a new proof of the locality / redundancy trade-off and generalize structure theorems to non-linear codes. △ Less

Submitted 15 March, 2013; originally announced March 2013.

arXiv:1106.3625 [pdf, ps, other]

On the Locality of Codeword Symbols

Authors: Parikshit Gopalan, Cheng Huang, Huseyin Simitci, Sergey Yekhanin

Abstract: Consider a linear [n,k,d]_q code C. We say that that i-th coordinate of C has locality r, if the value at this coordinate can be recovered from accessing some other r coordinates of C. Data storage applications require codes with small redundancy, low locality for information coordinates, large distance, and low locality for parity coordinates. In this paper we carry out an in-depth study of the r… ▽ More Consider a linear [n,k,d]_q code C. We say that that i-th coordinate of C has locality r, if the value at this coordinate can be recovered from accessing some other r coordinates of C. Data storage applications require codes with small redundancy, low locality for information coordinates, large distance, and low locality for parity coordinates. In this paper we carry out an in-depth study of the relations between these parameters. We establish a tight bound for the redundancy n-k in terms of the message length, the distance, and the locality of information coordinates. We refer to codes attaining the bound as optimal. We prove some structure theorems about optimal codes, which are particularly strong for small distances. This gives a fairly complete picture of the tradeoffs between codewords length, worst-case distance and locality of information symbols. We then consider the locality of parity check symbols and erasure correction beyond worst case distance for optimal codes. Using our structure theorem, we obtain a tight bound for the locality of parity symbols possible in such codes for a broad class of parameter settings. We prove that there is a tradeoff between having good locality for parity checks and the ability to correct erasures beyond the minimum distance. △ Less

Submitted 18 June, 2011; originally announced June 2011.

arXiv:1004.2294 [pdf, ps, other]

Sets with large additive energy and symmetric sets

Authors: Ilya Shkredov, Sergey Yekhanin

Abstract: We show that for any set A in a finite Abelian group G that has at least c |A|^3 solutions to a_1 + a_2 = a_3 + a_4, where a_i belong A there exist sets A' in A and L in G, |L| \ll c^{-1} log |A| such that A' is contained in Span of L and A' has approximately c |A|^3 solutions to a'_1 + a'_2 = a'_3 + a'_4, where a'_i belong A'. We also study so-called symmetric sets or, in other words, sets of la… ▽ More We show that for any set A in a finite Abelian group G that has at least c |A|^3 solutions to a_1 + a_2 = a_3 + a_4, where a_i belong A there exist sets A' in A and L in G, |L| \ll c^{-1} log |A| such that A' is contained in Span of L and A' has approximately c |A|^3 solutions to a'_1 + a'_2 = a'_3 + a'_4, where a'_i belong A'. We also study so-called symmetric sets or, in other words, sets of large values of convolution. △ Less

Submitted 13 April, 2010; originally announced April 2010.

arXiv:0704.1694 [pdf, ps, other]

Locally Decodable Codes From Nice Subsets of Finite Fields and Prime Factors of Mersenne Numbers

Authors: Kiran S. Kedlaya, Sergey Yekhanin

Abstract: A k-query Locally Decodable Code (LDC) encodes an n-bit message x as an N-bit codeword C(x), such that one can probabilistically recover any bit x_i of the message by querying only k bits of the codeword C(x), even after some constant fraction of codeword bits has been corrupted. The major goal of LDC related research is to establish the optimal trade-off between length and query complexity of s… ▽ More A k-query Locally Decodable Code (LDC) encodes an n-bit message x as an N-bit codeword C(x), such that one can probabilistically recover any bit x_i of the message by querying only k bits of the codeword C(x), even after some constant fraction of codeword bits has been corrupted. The major goal of LDC related research is to establish the optimal trade-off between length and query complexity of such codes. Recently [Y] introduced a novel technique for constructing locally decodable codes and vastly improved the upper bounds for code length. The technique is based on Mersenne primes. In this paper we extend the work of [Y] and argue that further progress via these methods is tied to progress on an old number theory question regarding the size of the largest prime factors of Mersenne numbers. Specifically, we show that every Mersenne number m=2^t-1 that has a prime factor p>m^γyields a family of k(γ)-query locally decodable codes of length Exp(n^{1/t}). Conversely, if for some fixed k and all ε> 0 one can use the technique of [Y] to obtain a family of k-query LDCs of length Exp(n^ε); then infinitely many Mersenne numbers have prime factors arger than known currently. △ Less

Submitted 13 April, 2007; originally announced April 2007.

Comments: 18 pages

arXiv:cs/0408017 [pdf, ps, other]

Improved Upper Bound for the Redundancy of Fix-Free Codes

Authors: Sergey Yekhanin

Abstract: A variable-length code is a fix-free code if no codeword is a prefix or a suffix of any other codeword. In a fix-free code any finite sequence of codewords can be decoded in both directions, which can improve the robustness to channel noise and speed up the decoding process. In this paper we prove a new sufficient condition of the existence of fix-free codes and improve the upper bound on the re… ▽ More A variable-length code is a fix-free code if no codeword is a prefix or a suffix of any other codeword. In a fix-free code any finite sequence of codewords can be decoded in both directions, which can improve the robustness to channel noise and speed up the decoding process. In this paper we prove a new sufficient condition of the existence of fix-free codes and improve the upper bound on the redundancy of optimal fix-free codes. △ Less

Submitted 5 August, 2004; originally announced August 2004.

arXiv:cs/0406039 [pdf, ps, other]

Long Nonbinary Codes Exceeding the Gilbert - Varshamov Bound for any Fixed Distance

Authors: Sergey Yekhanin, Ilya Dumer

Abstract: Let A(q,n,d) denote the maximum size of a q-ary code of length n and distance d. We study the minimum asymptotic redundancy ρ(q,n,d)=n-log_q A(q,n,d) as n grows while q and d are fixed. For any d and q<=d-1, long algebraic codes are designed that improve on the BCH codes and have the lowest asymptotic redundancy ρ(q,n,d) <= ((d-3)+1/(d-2)) log_q n known to date. Prior to this work, codes of fixe… ▽ More Let A(q,n,d) denote the maximum size of a q-ary code of length n and distance d. We study the minimum asymptotic redundancy ρ(q,n,d)=n-log_q A(q,n,d) as n grows while q and d are fixed. For any d and q<=d-1, long algebraic codes are designed that improve on the BCH codes and have the lowest asymptotic redundancy ρ(q,n,d) <= ((d-3)+1/(d-2)) log_q n known to date. Prior to this work, codes of fixed distance that asymptotically surpass BCH codes and the Gilbert-Varshamov bound were designed only for distances 4,5 and 6. △ Less

Submitted 23 June, 2004; v1 submitted 21 June, 2004; originally announced June 2004.

Comments: Submitted to IEEE Trans. on Info. Theory

Showing 1–22 of 22 results for author: Yekhanin, S