Skip to main content

Showing 1–19 of 19 results for author: Kucherov, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.09607  [pdf, other

    cs.DS

    Better space-time-robustness trade-offs for set reconciliation

    Authors: Djamal Belazzougui, Gregory Kucherov, Stefan Walzer

    Abstract: We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 19 pages

  2. arXiv:2302.05245  [pdf, other

    cs.DS

    Count-min sketch with variable number of hash functions: an experimental study

    Authors: Éric Fusy, Gregory Kucherov

    Abstract: Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022],… ▽ More

    Submitted 7 September, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

    Comments: short version to appear in SPIRE'23

  3. Phase transition in count approximation by Count-Min sketch with conservative updates

    Authors: Éric Fusy, Gregory Kucherov

    Abstract: Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters. Here we analyse the counting version of Count-Min under a stronger update rule known as \textit{conservative update}, assuming the uniform distribution of input keys. We show that the accuracy of conservative update strategy undergoes a phase transition, depending on the number of dis… ▽ More

    Submitted 14 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 19 pages, 4 figures

    MSC Class: 68W40 ACM Class: F.2.2

  4. arXiv:2006.01825  [pdf, ps, other

    cs.DS

    Efficient tree-structured categorical retrieval

    Authors: Djamal Belazzougui, Gregory Kucherov

    Abstract: We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: Full version of a paper accepted for presentation at the 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

  5. arXiv:1509.01221  [pdf, ps, other

    cs.FL

    Optimal searching of gapped repeats in a word

    Authors: Maxime Crochemore, Roman Kolpakov, Gregory Kucherov

    Abstract: Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subre… ▽ More

    Submitted 2 October, 2015; v1 submitted 3 September, 2015; originally announced September 2015.

    Comments: 27 pages. arXiv admin note: text overlap with arXiv:1309.4055

  6. arXiv:1504.07406  [pdf, other

    cs.DS

    On Maximal Unbordered Factors

    Authors: Gregory Kucherov, Alexander Loptev, Tatiana Starikovskaya

    Abstract: Given a string $S$ of length $n$, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between $n$ and the length of the maximal unbordered factor of $S$. We prove that for the alphabet of size $σ\ge 5$ the expected length of the maximal unbordered factor of a string of length~$n$ is at least $0.99 n$ (for sufficiently large… ▽ More

    Submitted 28 April, 2015; originally announced April 2015.

    Comments: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015)

  7. arXiv:1502.06256  [pdf, other

    q-bio.GN cs.CE cs.LG

    Spaced seeds improve k-mer-based metagenomic classification

    Authors: Karel Brinda, Maciej Sykulski, Gregory Kucherov

    Abstract: Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds… ▽ More

    Submitted 9 July, 2015; v1 submitted 22 February, 2015; originally announced February 2015.

    Comments: 23 pages

    Journal ref: Bioinformatics (2015) 31 (22): 3584-3592

  8. arXiv:1408.6198  [pdf, ps, other

    cs.FL cs.DS q-bio.QM

    Subset seed automaton

    Authors: Gregory Kucherov, Laurent Noé, Mikhail Roytberg

    Abstract: We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of… ▽ More

    Submitted 18 August, 2014; originally announced August 2014.

    Comments: 12 pages, 2 figures, 2 tables, CIAA 2007, http://hal.inria.fr/inria-00170414/en/

    MSC Class: 20M35; 68Q45 ACM Class: F.1.1; F.4.3

    Journal ref: LNCS 4783 (2007), pp 180-191

  9. arXiv:1310.1440  [pdf, other

    cs.DS

    Approximate String Matching using a Bidirectional Index

    Authors: Gregory Kucherov, Kamil Salikhov, Dekel Tsur

    Abstract: We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experi… ▽ More

    Submitted 6 September, 2015; v1 submitted 5 October, 2013; originally announced October 2013.

  10. arXiv:1302.7278  [pdf, other

    cs.DS

    Using cascading Bloom filters to improve the memory usage for de Brujin graphs

    Authors: Kamil Salikhov, Gustavo Sacomoto, Gregory Kucherov

    Abstract: De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters.… ▽ More

    Submitted 21 May, 2013; v1 submitted 28 February, 2013; originally announced February 2013.

    Comments: 12 pages, submitted

    ACM Class: E.2; J.3

  11. arXiv:1302.4016  [pdf, other

    cs.DS

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Authors: Gregory Kucherov, Yakov Nekrich

    Abstract: In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to $T$ in O(1) worst-case time. At any moment, we can report all occurrences of a pattern $P$ in the current text in $O(|P|+k)$ time, where $|P|$ is the length of $P$ and $k$ is the number of occurrences. This resolves,… ▽ More

    Submitted 6 July, 2013; v1 submitted 16 February, 2013; originally announced February 2013.

  12. arXiv:1206.3877  [pdf, ps, other

    cs.DS math.CO

    On the combinatorics of suffix arrays

    Authors: Gregory Kucherov, Lilla Tóthmérész, Stéphane Vialette

    Abstract: We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the chara… ▽ More

    Submitted 18 June, 2012; originally announced June 2012.

  13. Cross-Document Pattern Matching

    Authors: Gregory Kucherov, Yakov Nekrich, Tatiana Starikovskaya

    Abstract: We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bou… ▽ More

    Submitted 18 February, 2012; originally announced February 2012.

  14. arXiv:1104.1601  [pdf, ps, other

    cs.DS

    On-line construction of position heaps

    Authors: Gregory Kucherov

    Abstract: We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkone… ▽ More

    Submitted 4 October, 2012; v1 submitted 8 April, 2011; originally announced April 2011.

    Comments: to appear in Journal of Discrete Algorithms

  15. Linear pattern matching on sparse suffix trees

    Authors: Roman Kolpakov, Gregory Kucherov, Tatiana Starikovskaya

    Abstract: Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ charac… ▽ More

    Submitted 14 March, 2011; originally announced March 2011.

  16. arXiv:0906.4750  [pdf, ps, other

    cs.DM

    On maximal repetitions of arbitrary exponent

    Authors: Roman Kolpakov, Gregory Kucherov, Pascal Ochem

    Abstract: The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.

    Submitted 25 June, 2009; originally announced June 2009.

    Comments: 8 pages, 1 figure

    ACM Class: G.2.1

  17. Estimating seed sensitivity on homogeneous alignments

    Authors: Gregory Kucherov, Laurent Noe, Yann Ponty

    Abstract: We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies.… ▽ More

    Submitted 27 March, 2006; originally announced March 2006.

    Journal ref: Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE), 387-394, 2004

  18. A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)

    Authors: Gregory Kucherov, Laurent Noe, Mikhail Roytberg

    Abstract: We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which w… ▽ More

    Submitted 27 March, 2006; originally announced March 2006.

    Journal ref: Algorithms in Bioinformatics, LNBI 3692 : 251-263, 2005

  19. A unifying framework for seed sensitivity and its application to subset seeds

    Authors: Gregory Kucherov, Laurent Noé, Mihkail Roytberg

    Abstract: We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which… ▽ More

    Submitted 15 September, 2006; v1 submitted 27 January, 2006; originally announced January 2006.

    Journal ref: Journal of Bioinformatics and Computational Biology 4 (2006) 2, pp 553--569