Search | arXiv e-print repository

Guaranteed Recovery of Unambiguous Clusters

Abstract: Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic c… ▽ More Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the $K$-clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery. △ Less

Submitted 7 May, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: 12 pages, includes minor changes and some new content compared to previous version

arXiv:2401.14277 [pdf, ps, other]

An Instance-Based Approach to the Trace Reconstruction Problem

Authors: Kayvon Mazooji, Ilan Shomorony

Abstract: In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average ca… ▽ More In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average case (over input sequences $s$). In this paper, we propose an alternative, instance-based approach to the problem. We define the "Levenshtein difficulty" of a problem instance $(s,T)$ as the probability that the resulting traces do not provide enough information for correct recovery with full certainty. One can then try to characterize, for a specific $s$, how $T$ needs to scale in order for the Levenshtein difficulty to go to zero, and seek reconstruction algorithms that match this scaling for each $s$. We derive a lower bound on the Levenshtein difficulty, and prove that $T$ needs to scale exponentially fast in $n$ for the Levenshtein difficulty to approach zero for a very broad class of strings. For a class of binary strings with alternating long runs, we design an algorithm whose probability of reconstruction error approaches zero whenever the Levenshtein difficulty approaches zero. For this class, we also prove that the error probability of this algorithm decays to zero at least as fast as the Levenshtein difficulty. △ Less

Submitted 3 November, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 7 pages, part of this paper was presented at the 58th Annual Conference on Information Sciences and Systems (CISS 2024), funding information added in updated document, an error in the presentation of the main results in the CISS 2024 version of the paper is fixed in the updated document

arXiv:2210.10917 [pdf, other]

Substring Density Estimation from Traces

Authors: Kayvon Mazooji, Ilan Shomorony

Abstract: In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map"… ▽ More In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map" that indicates the locations of each length-$k$ substring throughout $s$. We show that $ε^{-2}\cdot \text{poly}(n)$ traces suffice to recover the density map with error at most $ε$. As a result, when restricted to a set of source strings whose minimum "density map distance" is at least $1/\text{poly}(n)$, the trace reconstruction problem can be solved with polynomially many traces. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 22 pages, 3 figures

arXiv:2101.12124 [pdf, other]

Private DNA Sequencing: Hiding Information in Discrete Noise

Authors: Kayvon Mazooji, Roy Dong, Ilan Shomorony

Abstract: When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the s… ▽ More When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions. △ Less

Submitted 3 November, 2024; v1 submitted 28 January, 2021; originally announced January 2021.

Comments: 22 pages, 7 figures, shorter version appeared in proceedings of the 2020 IEEE Information Theory Workshop (ITW), new results and explanations added in updated arXiv document

arXiv:1611.09073 [pdf, other]

On Unique Decoding from Insertions and Deletions

Authors: Kayvon Mazooji

Abstract: In this paper, we study how often unique decoding from $t$ insertions or $t$ deletions occurs for error correcting codes. Insertions and deletions frequently occur in synchronization problems and DNA, a medium which is beginning to be used for long term data storage. We define natural probabilistic channels that make $t$ insertions or $t$ deletions, and study the probability of unique decoding.… ▽ More In this paper, we study how often unique decoding from $t$ insertions or $t$ deletions occurs for error correcting codes. Insertions and deletions frequently occur in synchronization problems and DNA, a medium which is beginning to be used for long term data storage. We define natural probabilistic channels that make $t$ insertions or $t$ deletions, and study the probability of unique decoding. Our most substantial contribution is the derivation of tight upper bounds on the probability of unique decoding for messages passed though these channels. We also consider other aspects of the problem, and derive improved upper bounds for linear codes and VT-codes. △ Less

Submitted 28 September, 2017; v1 submitted 28 November, 2016; originally announced November 2016.

Comments: 12 pages, 4 figures, study of deletion channel added (upper bounds, asymptotics, ect.), tight upper bound on probability of unique decoding for uniform t-insertion channel added (conjecture from previous version proved true), improved upper bounds for VT codes and linear codes added, improved asymptotic analysis

Showing 1–5 of 5 results for author: Mazooji, K