-
Better space-time-robustness trade-offs for set reconciliation
Authors:
Djamal Belazzougui,
Gregory Kucherov,
Stefan Walzer
Abstract:
We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer…
▽ More
We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer insufficient success guarantees for many applications. Here we propose a tunable trade-off between the two approaches combining the efficiency of IBLTs with exponentially decreasing failure probability. The proof relies on a refined analysis of IBLTs proposed in (Baek Tejs Houen et al. SOSA 2023) which has an independent interest. We also propose a modification of our algorithm that enables telling apart the elements of each set in the symmetric difference.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Count-min sketch with variable number of hash functions: an experimental study
Authors:
Éric Fusy,
Gregory Kucherov
Abstract:
Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022],…
▽ More
Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022], we demonstrated that under the uniform distribution of input elements, the error of conservative Count-Min follows two distinct regimes depending on its load factor.
In this work, we provide a series of experimental results providing new insights into the behavior of conservative Count-Min. Our contributions can be seen as twofold. On one hand, we provide a detailed experimental analysis of the behavior of Count-Min sketch in different regimes and under several representative probability distributions of input elements. On the other hand, we demonstrate improvements that can be made by assigning a variable number of hash functions to different elements. This includes, in particular, reduced space of the data structure while still supporting a small error.
△ Less
Submitted 7 September, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
Phase transition in count approximation by Count-Min sketch with conservative updates
Authors:
Éric Fusy,
Gregory Kucherov
Abstract:
Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters. Here we analyse the counting version of Count-Min under a stronger update rule known as \textit{conservative update}, assuming the uniform distribution of input keys. We show that the accuracy of conservative update strategy undergoes a phase transition, depending on the number of dis…
▽ More
Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters. Here we analyse the counting version of Count-Min under a stronger update rule known as \textit{conservative update}, assuming the uniform distribution of input keys. We show that the accuracy of conservative update strategy undergoes a phase transition, depending on the number of distinct keys in the input as a fraction of the size of the Count-Min array. We prove that below the threshold, the relative error is asymptotically $o(1)$ (as opposed to the regular Count-Min strategy), whereas above the threshold, the relative error is $Θ(1)$. The threshold corresponds to the peelability threshold of random $k$-uniform hypergraphs. We demonstrate that even for small number of keys, peelability of the underlying hypergraph is a crucial property to ensure the $o(1)$ error. Finally, we provide an experimental evidence that the phase transition does not extend to non-uniform distributions, in particular to the popular Zipf's distribution.
△ Less
Submitted 14 July, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Efficient tree-structured categorical retrieval
Authors:
Djamal Belazzougui,
Gregory Kucherov
Abstract:
We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$…
▽ More
We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Optimal searching of gapped repeats in a word
Authors:
Maxime Crochemore,
Roman Kolpakov,
Gregory Kucherov
Abstract:
Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subre…
▽ More
Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subrepetition (maximal factors of exponent between $1+δ$ and $2$), our result implies the $O(n/δ)$ bound on their number, which improves the bound of (Kolpakov et al., 2010) by a $\log n$ factor.
We also prove an algorithmic time bound $O(αn+S)$ ($S$ size of the output) for computing all maximal $α$-gapped repeats. Our solution, inspired by (Gawrychowski and Manea, 2015), is different from the recently published proof by (Tanimura et al., 2015) of the same bound. Together with our bound on $S$, this implies an $O(αn)$-time algorithm for computing all maximal $α$-gapped repeats.
△ Less
Submitted 2 October, 2015; v1 submitted 3 September, 2015;
originally announced September 2015.
-
On Maximal Unbordered Factors
Authors:
Gregory Kucherov,
Alexander Loptev,
Tatiana Starikovskaya
Abstract:
Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large…
▽ More
Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large values of $n$). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.
△ Less
Submitted 28 April, 2015;
originally announced April 2015.
-
Spaced seeds improve k-mer-based metagenomic classification
Authors:
Karel Brinda,
Maciej Sykulski,
Gregory Kucherov
Abstract:
Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds…
▽ More
Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.
△ Less
Submitted 9 July, 2015; v1 submitted 22 February, 2015;
originally announced February 2015.
-
Subset seed automaton
Authors:
Gregory Kucherov,
Laurent Noé,
Mikhail Roytberg
Abstract:
We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of…
▽ More
We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of the automaton construction. We also present some experimental results and show that this automaton can be successfully applied to more general situations.
△ Less
Submitted 18 August, 2014;
originally announced August 2014.
-
Approximate String Matching using a Bidirectional Index
Authors:
Gregory Kucherov,
Kamil Salikhov,
Dekel Tsur
Abstract:
We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experi…
▽ More
We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies.
△ Less
Submitted 6 September, 2015; v1 submitted 5 October, 2013;
originally announced October 2013.
-
Using cascading Bloom filters to improve the memory usage for de Brujin graphs
Authors:
Kamil Salikhov,
Gustavo Sacomoto,
Gregory Kucherov
Abstract:
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters.…
▽ More
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.
△ Less
Submitted 21 May, 2013; v1 submitted 28 February, 2013;
originally announced February 2013.
-
Full-fledged Real-Time Indexing for Constant Size Alphabets
Authors:
Gregory Kucherov,
Yakov Nekrich
Abstract:
In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to $T$ in O(1) worst-case time. At any moment, we can report all occurrences of a pattern $P$ in the current text in $O(|P|+k)$ time, where $|P|$ is the length of $P$ and $k$ is the number of occurrences. This resolves,…
▽ More
In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to $T$ in O(1) worst-case time. At any moment, we can report all occurrences of a pattern $P$ in the current text in $O(|P|+k)$ time, where $|P|$ is the length of $P$ and $k$ is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08}).
△ Less
Submitted 6 July, 2013; v1 submitted 16 February, 2013;
originally announced February 2013.
-
On the combinatorics of suffix arrays
Authors:
Gregory Kucherov,
Lilla Tóthmérész,
Stéphane Vialette
Abstract:
We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the chara…
▽ More
We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the characterization of suffix arrays for a special case of binary alphabet given in [2] easily follows from our characterization. Based on our results, we also provide simple proofs for the enumeration results for suffix arrays, obtained in [3]. Our approach to characterizing suffix arrays is the first that exploits their relationship with Burrows-Wheeler permutations.
△ Less
Submitted 18 June, 2012;
originally announced June 2012.
-
Cross-Document Pattern Matching
Authors:
Gregory Kucherov,
Yakov Nekrich,
Tatiana Starikovskaya
Abstract:
We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bou…
▽ More
We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem.
△ Less
Submitted 18 February, 2012;
originally announced February 2012.
-
On-line construction of position heaps
Authors:
Gregory Kucherov
Abstract:
We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkone…
▽ More
We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011].
△ Less
Submitted 4 October, 2012; v1 submitted 8 April, 2011;
originally announced April 2011.
-
Linear pattern matching on sparse suffix trees
Authors:
Roman Kolpakov,
Gregory Kucherov,
Tatiana Starikovskaya
Abstract:
Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ charac…
▽ More
Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ characters ($σ$ the alphabet size), our index takes $O(n/\log_σn)$ space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time $O(m+r^2+r\cdot occ)$, where $m$ is the length of the pattern, $r$ is the actual number of characters stored in a word and $occ$ is the number of pattern occurrences.
△ Less
Submitted 14 March, 2011;
originally announced March 2011.
-
On maximal repetitions of arbitrary exponent
Authors:
Roman Kolpakov,
Gregory Kucherov,
Pascal Ochem
Abstract:
The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.
The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.
△ Less
Submitted 25 June, 2009;
originally announced June 2009.
-
Estimating seed sensitivity on homogeneous alignments
Authors:
Gregory Kucherov,
Laurent Noe,
Yann Ponty
Abstract:
We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies.…
▽ More
We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.
△ Less
Submitted 27 March, 2006;
originally announced March 2006.
-
A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)
Authors:
Gregory Kucherov,
Laurent Noe,
Mikhail Roytberg
Abstract:
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which w…
▽ More
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.
△ Less
Submitted 27 March, 2006;
originally announced March 2006.
-
A unifying framework for seed sensitivity and its application to subset seeds
Authors:
Gregory Kucherov,
Laurent Noé,
Mihkail Roytberg
Abstract:
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which…
▽ More
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds.
△ Less
Submitted 15 September, 2006; v1 submitted 27 January, 2006;
originally announced January 2006.