-
Guaranteed Recovery of Unambiguous Clusters
Authors:
Kayvon Mazooji,
Ilan Shomorony
Abstract:
Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic c…
▽ More
Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the $K$-clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.
△ Less
Submitted 7 May, 2025; v1 submitted 22 January, 2025;
originally announced January 2025.
-
An Instance-Based Approach to the Trace Reconstruction Problem
Authors:
Kayvon Mazooji,
Ilan Shomorony
Abstract:
In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average ca…
▽ More
In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average case (over input sequences $s$). In this paper, we propose an alternative, instance-based approach to the problem. We define the "Levenshtein difficulty" of a problem instance $(s,T)$ as the probability that the resulting traces do not provide enough information for correct recovery with full certainty. One can then try to characterize, for a specific $s$, how $T$ needs to scale in order for the Levenshtein difficulty to go to zero, and seek reconstruction algorithms that match this scaling for each $s$. We derive a lower bound on the Levenshtein difficulty, and prove that $T$ needs to scale exponentially fast in $n$ for the Levenshtein difficulty to approach zero for a very broad class of strings. For a class of binary strings with alternating long runs, we design an algorithm whose probability of reconstruction error approaches zero whenever the Levenshtein difficulty approaches zero. For this class, we also prove that the error probability of this algorithm decays to zero at least as fast as the Levenshtein difficulty.
△ Less
Submitted 3 November, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
Substring Density Estimation from Traces
Authors:
Kayvon Mazooji,
Ilan Shomorony
Abstract:
In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map"…
▽ More
In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map" that indicates the locations of each length-$k$ substring throughout $s$. We show that $ε^{-2}\cdot \text{poly}(n)$ traces suffice to recover the density map with error at most $ε$. As a result, when restricted to a set of source strings whose minimum "density map distance" is at least $1/\text{poly}(n)$, the trace reconstruction problem can be solved with polynomially many traces.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Private DNA Sequencing: Hiding Information in Discrete Noise
Authors:
Kayvon Mazooji,
Roy Dong,
Ilan Shomorony
Abstract:
When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the s…
▽ More
When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.
△ Less
Submitted 3 November, 2024; v1 submitted 28 January, 2021;
originally announced January 2021.
-
On Unique Decoding from Insertions and Deletions
Authors:
Kayvon Mazooji
Abstract:
In this paper, we study how often unique decoding from $t$ insertions or $t$ deletions occurs for error correcting codes. Insertions and deletions frequently occur in synchronization problems and DNA, a medium which is beginning to be used for long term data storage.
We define natural probabilistic channels that make $t$ insertions or $t$ deletions, and study the probability of unique decoding.…
▽ More
In this paper, we study how often unique decoding from $t$ insertions or $t$ deletions occurs for error correcting codes. Insertions and deletions frequently occur in synchronization problems and DNA, a medium which is beginning to be used for long term data storage.
We define natural probabilistic channels that make $t$ insertions or $t$ deletions, and study the probability of unique decoding. Our most substantial contribution is the derivation of tight upper bounds on the probability of unique decoding for messages passed though these channels. We also consider other aspects of the problem, and derive improved upper bounds for linear codes and VT-codes.
△ Less
Submitted 28 September, 2017; v1 submitted 28 November, 2016;
originally announced November 2016.