Search | arXiv e-print repository

Better space-time-robustness trade-offs for set reconciliation

Authors: Djamal Belazzougui, Gregory Kucherov, Stefan Walzer

Abstract: We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer… ▽ More We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer insufficient success guarantees for many applications. Here we propose a tunable trade-off between the two approaches combining the efficiency of IBLTs with exponentially decreasing failure probability. The proof relies on a refined analysis of IBLTs proposed in (Baek Tejs Houen et al. SOSA 2023) which has an independent interest. We also propose a modification of our algorithm that enables telling apart the elements of each set in the symmetric difference. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 19 pages

arXiv:2302.05245 [pdf, other]

Count-min sketch with variable number of hash functions: an experimental study

Authors: Éric Fusy, Gregory Kucherov

Abstract: Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022],… ▽ More Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022], we demonstrated that under the uniform distribution of input elements, the error of conservative Count-Min follows two distinct regimes depending on its load factor. In this work, we provide a series of experimental results providing new insights into the behavior of conservative Count-Min. Our contributions can be seen as twofold. On one hand, we provide a detailed experimental analysis of the behavior of Count-Min sketch in different regimes and under several representative probability distributions of input elements. On the other hand, we demonstrate improvements that can be made by assigning a variable number of hash functions to different elements. This includes, in particular, reduced space of the data structure while still supporting a small error. △ Less

Submitted 7 September, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

Comments: short version to appear in SPIRE'23

arXiv:2203.15496 [pdf, other]

doi 10.1007/978-3-031-30448-4_17

Phase transition in count approximation by Count-Min sketch with conservative updates

Authors: Éric Fusy, Gregory Kucherov

Abstract: Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters. Here we analyse the counting version of Count-Min under a stronger update rule known as \textit{conservative update}, assuming the uniform distribution of input keys. We show that the accuracy of conservative update strategy undergoes a phase transition, depending on the number of dis… ▽ More Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters. Here we analyse the counting version of Count-Min under a stronger update rule known as \textit{conservative update}, assuming the uniform distribution of input keys. We show that the accuracy of conservative update strategy undergoes a phase transition, depending on the number of distinct keys in the input as a fraction of the size of the Count-Min array. We prove that below the threshold, the relative error is asymptotically $o(1)$ (as opposed to the regular Count-Min strategy), whereas above the threshold, the relative error is $Θ(1)$. The threshold corresponds to the peelability threshold of random $k$-uniform hypergraphs. We demonstrate that even for small number of keys, peelability of the underlying hypergraph is a crucial property to ensure the $o(1)$ error. Finally, we provide an experimental evidence that the phase transition does not extend to non-uniform distributions, in particular to the popular Zipf's distribution. △ Less

Submitted 14 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: 19 pages, 4 figures

MSC Class: 68W40 ACM Class: F.2.2

arXiv:2006.01825 [pdf, ps, other]

Efficient tree-structured categorical retrieval

Authors: Djamal Belazzougui, Gregory Kucherov

Abstract: We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$… ▽ More We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: Full version of a paper accepted for presentation at the 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

arXiv:1509.01221 [pdf, ps, other]

Optimal searching of gapped repeats in a word

Authors: Maxime Crochemore, Roman Kolpakov, Gregory Kucherov

Abstract: Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subre… ▽ More Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subrepetition (maximal factors of exponent between $1+δ$ and $2$), our result implies the $O(n/δ)$ bound on their number, which improves the bound of (Kolpakov et al., 2010) by a $\log n$ factor. We also prove an algorithmic time bound $O(αn+S)$ ($S$ size of the output) for computing all maximal $α$-gapped repeats. Our solution, inspired by (Gawrychowski and Manea, 2015), is different from the recently published proof by (Tanimura et al., 2015) of the same bound. Together with our bound on $S$, this implies an $O(αn)$-time algorithm for computing all maximal $α$-gapped repeats. △ Less

Submitted 2 October, 2015; v1 submitted 3 September, 2015; originally announced September 2015.

Comments: 27 pages. arXiv admin note: text overlap with arXiv:1309.4055

arXiv:1504.07406 [pdf, other]

On Maximal Unbordered Factors

Authors: Gregory Kucherov, Alexander Loptev, Tatiana Starikovskaya

Abstract: Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large… ▽ More Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large values of $n$). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string. △ Less

Submitted 28 April, 2015; originally announced April 2015.

Comments: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015)

arXiv:1502.06256 [pdf, other]

doi 10.1093/bioinformatics/btv419

Spaced seeds improve k-mer-based metagenomic classification

Authors: Karel Brinda, Maciej Sykulski, Gregory Kucherov

Abstract: Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds… ▽ More Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics. △ Less

Submitted 9 July, 2015; v1 submitted 22 February, 2015; originally announced February 2015.

Comments: 23 pages

Journal ref: Bioinformatics (2015) 31 (22): 3584-3592

arXiv:1408.6198 [pdf, ps, other]

doi 10.1007/978-3-540-76336-9_18

Subset seed automaton

Authors: Gregory Kucherov, Laurent Noé, Mikhail Roytberg

Abstract: We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of… ▽ More We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of the automaton construction. We also present some experimental results and show that this automaton can be successfully applied to more general situations. △ Less

Submitted 18 August, 2014; originally announced August 2014.

Comments: 12 pages, 2 figures, 2 tables, CIAA 2007, http://hal.inria.fr/inria-00170414/en/

MSC Class: 20M35; 68Q45 ACM Class: F.1.1; F.4.3

Journal ref: LNCS 4783 (2007), pp 180-191

arXiv:1310.1440 [pdf, other]

Approximate String Matching using a Bidirectional Index

Authors: Gregory Kucherov, Kamil Salikhov, Dekel Tsur

Abstract: We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experi… ▽ More We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies. △ Less

Submitted 6 September, 2015; v1 submitted 5 October, 2013; originally announced October 2013.

arXiv:1302.7278 [pdf, other]

Using cascading Bloom filters to improve the memory usage for de Brujin graphs

Authors: Kamil Salikhov, Gustavo Sacomoto, Gregory Kucherov

Abstract: De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters.… ▽ More De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs. △ Less

Submitted 21 May, 2013; v1 submitted 28 February, 2013; originally announced February 2013.

Comments: 12 pages, submitted

ACM Class: E.2; J.3

arXiv:1302.4016 [pdf, other]

Full-fledged Real-Time Indexing for Constant Size Alphabets

Authors: Gregory Kucherov, Yakov Nekrich

Abstract: In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to $T$ in O(1) worst-case time. At any moment, we can report all occurrences of a pattern $P$ in the current text in $O(|P|+k)$ time, where $|P|$ is the length of $P$ and $k$ is the number of occurrences. This resolves,… ▽ More In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to $T$ in O(1) worst-case time. At any moment, we can report all occurrences of a pattern $P$ in the current text in $O(|P|+k)$ time, where $|P|$ is the length of $P$ and $k$ is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08}). △ Less

Submitted 6 July, 2013; v1 submitted 16 February, 2013; originally announced February 2013.

arXiv:1206.3877 [pdf, ps, other]

On the combinatorics of suffix arrays

Authors: Gregory Kucherov, Lilla Tóthmérész, Stéphane Vialette

Abstract: We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the chara… ▽ More We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the characterization of suffix arrays for a special case of binary alphabet given in [2] easily follows from our characterization. Based on our results, we also provide simple proofs for the enumeration results for suffix arrays, obtained in [3]. Our approach to characterizing suffix arrays is the first that exploits their relationship with Burrows-Wheeler permutations. △ Less

Submitted 18 June, 2012; originally announced June 2012.

arXiv:1202.4076 [pdf, ps, other]

doi 10.1007/978-3-642-31265-6_16

Cross-Document Pattern Matching

Authors: Gregory Kucherov, Yakov Nekrich, Tatiana Starikovskaya

Abstract: We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bou… ▽ More We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem. △ Less

Submitted 18 February, 2012; originally announced February 2012.

arXiv:1104.1601 [pdf, ps, other]

On-line construction of position heaps

Authors: Gregory Kucherov

Abstract: We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkone… ▽ More We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011]. △ Less

Submitted 4 October, 2012; v1 submitted 8 April, 2011; originally announced April 2011.

Comments: to appear in Journal of Discrete Algorithms

arXiv:1103.2613 [pdf, other]

doi 10.1109/CCP.2011.45

Linear pattern matching on sparse suffix trees

Authors: Roman Kolpakov, Gregory Kucherov, Tatiana Starikovskaya

Abstract: Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ charac… ▽ More Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ characters ($σ$ the alphabet size), our index takes $O(n/\log_σn)$ space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time $O(m+r^2+r\cdot occ)$, where $m$ is the length of the pattern, $r$ is the actual number of characters stored in a word and $occ$ is the number of pattern occurrences. △ Less

Submitted 14 March, 2011; originally announced March 2011.

arXiv:0906.4750 [pdf, ps, other]

On maximal repetitions of arbitrary exponent

Authors: Roman Kolpakov, Gregory Kucherov, Pascal Ochem

Abstract: The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1. The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1. △ Less

Submitted 25 June, 2009; originally announced June 2009.

Comments: 8 pages, 1 figure

ACM Class: G.2.1

arXiv:cs/0603106 [pdf, ps, other]

doi 10.1109/BIBE.2004.1317369

Estimating seed sensitivity on homogeneous alignments

Authors: Gregory Kucherov, Laurent Noe, Yann Ponty

Abstract: We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies.… ▽ More We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition. △ Less

Submitted 27 March, 2006; originally announced March 2006.

Journal ref: Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE), 387-394, 2004

arXiv:cs/0603105 [pdf, ps, other]

doi 10.1007/11557067_21

A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)

Authors: Gregory Kucherov, Laurent Noe, Mikhail Roytberg

Abstract: We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which w… ▽ More We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds. △ Less

Submitted 27 March, 2006; originally announced March 2006.

Journal ref: Algorithms in Bioinformatics, LNBI 3692 : 251-263, 2005

arXiv:cs/0601116 [pdf, ps, other]

doi 10.1142/S0219720006001977

A unifying framework for seed sensitivity and its application to subset seeds

Authors: Gregory Kucherov, Laurent Noé, Mihkail Roytberg

Abstract: We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which… ▽ More We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds. △ Less

Submitted 15 September, 2006; v1 submitted 27 January, 2006; originally announced January 2006.

Journal ref: Journal of Bioinformatics and Computational Biology 4 (2006) 2, pp 553--569

Showing 1–19 of 19 results for author: Kucherov, G