-
Almost optimal searching of maximal subrepetitions in a word
Authors:
Roman Kolpakov
Abstract:
For $0<δ<1$ a $δ$-subrepetition in a word is a factor which exponent is less than~2 but is not less than $1+δ$ (the exponent of the factor is the ratio of the factor length to its minimal period). The $δ$-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. In the paper we propose an algorithm for searching all max…
▽ More
For $0<δ<1$ a $δ$-subrepetition in a word is a factor which exponent is less than~2 but is not less than $1+δ$ (the exponent of the factor is the ratio of the factor length to its minimal period). The $δ$-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. In the paper we propose an algorithm for searching all maximal $δ$-subrepetitions in a word of length~$n$ in $O(\frac{n}δ\log\frac{1}δ)$ time (the lower bound for this time is $Ω(\frac{n}δ)$).
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
On the number of gapped repeats with arbitrary gap
Authors:
Roman Kolpakov
Abstract:
For any functions $f(x)$, $g(x)$ from $\mathbb {N}$ to $\mathbb {R}$ we call repeats $uvu$ such that $g(|u|)\le |v|\le f(|u|)$ as {\it $f,g$-gapped repeats}. We study the possible number of $f,g$-gapped repeats in words of fixed length~$n$. For quite weak conditions on $f(x)$, $g(x)$ we obtain an upper bound on this number which is linear to~$n$.
For any functions $f(x)$, $g(x)$ from $\mathbb {N}$ to $\mathbb {R}$ we call repeats $uvu$ such that $g(|u|)\le |v|\le f(|u|)$ as {\it $f,g$-gapped repeats}. We study the possible number of $f,g$-gapped repeats in words of fixed length~$n$. For quite weak conditions on $f(x)$, $g(x)$ we obtain an upper bound on this number which is linear to~$n$.
△ Less
Submitted 4 January, 2017;
originally announced January 2017.
-
Indexing and querying color sets of images
Authors:
Djamal Belazzougui,
Roman Kolpakov,
Mathieu Raffinot
Abstract:
We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}.
The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints…
▽ More
We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}.
The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${\cal F}$. A rectangle is {\em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $\mathcal{L}.$ We first explain how to determine all the $|\mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2σ)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2σ\log(\frac{|\mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nm\log n+\mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+\log\log n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nm\log n+|{\cal F}|)$ while the query time becomes $O(|f|+\log\log n)$. We eventually consider the special case of squared regions (squares).
△ Less
Submitted 28 August, 2016;
originally announced August 2016.
-
Optimal searching of gapped repeats in a word
Authors:
Maxime Crochemore,
Roman Kolpakov,
Gregory Kucherov
Abstract:
Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subre…
▽ More
Following (Kolpakov et al., 2013; Gawrychowski and Manea, 2015), we continue the study of {\em $α$-gapped repeats} in strings, defined as factors $uvu$ with $|uv|\leq α|u|$. Our main result is the $O(αn)$ bound on the number of {\em maximal} $α$-gapped repeats in a string of length $n$, previously proved to be $O(α^2 n)$ in (Kolpakov et al., 2013). For a closely related notion of maximal $δ$-subrepetition (maximal factors of exponent between $1+δ$ and $2$), our result implies the $O(n/δ)$ bound on their number, which improves the bound of (Kolpakov et al., 2010) by a $\log n$ factor.
We also prove an algorithmic time bound $O(αn+S)$ ($S$ size of the output) for computing all maximal $α$-gapped repeats. Our solution, inspired by (Gawrychowski and Manea, 2015), is different from the recently published proof by (Tanimura et al., 2015) of the same bound. Together with our bound on $S$, this implies an $O(αn)$-time algorithm for computing all maximal $α$-gapped repeats.
△ Less
Submitted 2 October, 2015; v1 submitted 3 September, 2015;
originally announced September 2015.
-
Upper bound on the number of steps for solving the subset sum problem by the Branch-and-Bound method
Authors:
Roman Kolpakov,
Mikhail Posypkin
Abstract:
We study the computational complexity of one of the particular cases of the knapsack problem: the subset sum problem. For solving this problem we consider one of the basic variants of the Branch-and-Bound method in which any sub-problem is decomposed along the free variable with the maximal weight. By the complexity of solving a problem by the Branch-and-Bound method we mean the number of steps re…
▽ More
We study the computational complexity of one of the particular cases of the knapsack problem: the subset sum problem. For solving this problem we consider one of the basic variants of the Branch-and-Bound method in which any sub-problem is decomposed along the free variable with the maximal weight. By the complexity of solving a problem by the Branch-and-Bound method we mean the number of steps required for solving the problem by this method. In the paper we obtain upper bounds on the complexity of solving the subset sum problem by the Branch-and-Bound method. These bounds can be easily computed from the input data of the problem. So these bounds can be used for the the preliminary estimation of the computational resources required for solving the subset sum problem by the Branch-and-Bound method.
△ Less
Submitted 20 June, 2015;
originally announced June 2015.
-
Searching of gapped repeats and subrepetitions in a word
Authors:
Roman Kolpakov,
Mikhail Podolskiy,
Mikhail Posypkin,
Nickolay Khrapov
Abstract:
A gapped repeat is a factor of the form $uvu$ where $u$ and $v$ are nonempty words. The period of the gapped repeat is defined as $|u|+|v|$. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called $α$-gapped if its period is not greater than $α|v|$. A $δ$-subrepetition is a factor which exponen…
▽ More
A gapped repeat is a factor of the form $uvu$ where $u$ and $v$ are nonempty words. The period of the gapped repeat is defined as $|u|+|v|$. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called $α$-gapped if its period is not greater than $α|v|$. A $δ$-subrepetition is a factor which exponent is less than 2 but is not less than $1+δ$ (the exponent of the factor is the quotient of the length and the minimal period of the factor). The $δ$-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. We reveal a close relation between maximal gapped repeats and maximal subrepetitions. Moreover, we show that in a word of length $n$ the number of maximal $α$-gapped repeats is bounded by $O(α^2n)$ and the number of maximal $δ$-subrepetitions is bounded by $O(n/δ^2)$. Using the obtained upper bounds, we propose algorithms for finding all maximal $α$-gapped repeats and all maximal $δ$-subrepetitions in a word of length $n$. The algorithm for finding all maximal $α$-gapped repeats has $O(α^2n)$ time complexity for the case of constant alphabet size and $O(n\log n + α^2n)$ time complexity for the general case. For finding all maximal $δ$-subrepetitions we propose two algorithms. The first algorithm has $O(\frac{n\log\log n}{δ^2})$ time complexity for the case of constant alphabet size and $O(n\log n +\frac{n\log\log n}{δ^2})$ time complexity for the general case. The second algorithm has $O(n\log n+\frac{n}{δ^2}\log \frac{1}δ)$ expected time complexity.
△ Less
Submitted 29 September, 2013; v1 submitted 16 September, 2013;
originally announced September 2013.
-
Various improvements to text fingerprinting
Authors:
Djamal Belazzougui,
Roman Kolpakov,
Mathieu Raffinot
Abstract:
Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (…
▽ More
Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.
△ Less
Submitted 15 January, 2013;
originally announced January 2013.
-
On the number of Dejean words over alphabets of 5, 6, 7, 8, 9 and 10 letters
Authors:
Roman Kolpakov,
Michael Rao
Abstract:
We give lower bounds on the growth rate of Dejean words, i.e. minimally repetitive words, over a k-letter alphabet, for k=5, 6, 7, 8, 9, 10. Put together with the known upper bounds, we estimate these growth rates with the precision of 0,005. As an consequence, we establish the exponential growth of the number of Dejean words over a k-letter alphabet, for k=5, 6, 7, 8, 9, 10.
We give lower bounds on the growth rate of Dejean words, i.e. minimally repetitive words, over a k-letter alphabet, for k=5, 6, 7, 8, 9, 10. Put together with the known upper bounds, we estimate these growth rates with the precision of 0,005. As an consequence, we establish the exponential growth of the number of Dejean words over a k-letter alphabet, for k=5, 6, 7, 8, 9, 10.
△ Less
Submitted 16 May, 2011;
originally announced May 2011.
-
On primary and secondary repetitions in words
Authors:
Roman Kolpakov
Abstract:
Combinatorial properties of maximal repetitions (runs) in formal words are studied. We classify all maximal repetitions in a word as primary and secondary where the set of all primary repetitions determines all the other repetitons in the word. Essential combinatorial properties of primary repetitions are established.
Combinatorial properties of maximal repetitions (runs) in formal words are studied. We classify all maximal repetitions in a word as primary and secondary where the set of all primary repetitions determines all the other repetitons in the word. Essential combinatorial properties of primary repetitions are established.
△ Less
Submitted 27 March, 2011;
originally announced March 2011.
-
Linear pattern matching on sparse suffix trees
Authors:
Roman Kolpakov,
Gregory Kucherov,
Tatiana Starikovskaya
Abstract:
Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ charac…
▽ More
Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to $\log_σn$ characters ($σ$ the alphabet size), our index takes $O(n/\log_σn)$ space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time $O(m+r^2+r\cdot occ)$, where $m$ is the length of the pattern, $r$ is the actual number of characters stored in a word and $occ$ is the number of pattern occurrences.
△ Less
Submitted 14 March, 2011;
originally announced March 2011.
-
On maximal repetitions of arbitrary exponent
Authors:
Roman Kolpakov,
Gregory Kucherov,
Pascal Ochem
Abstract:
The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.
The first two authors have shown [KK99,KK00] that the sum the exponent (and thus the number) of maximal repetitions of exponent at least 2 (also called runs) is linear in the length of the word. The exponent 2 in the definition of a run may seem arbitrary. In this paper, we consider maximal repetitions of exponent strictly greater than 1.
△ Less
Submitted 25 June, 2009;
originally announced June 2009.