-
Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time
Authors:
Hideo Bannai,
Juha Kärkkäinen,
Dominik Köppl,
Marcin Picatkowski
Abstract:
The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT (BBWT) is a bijective variant of it. Although it is known that the BWT can be constructed in linear time for integer alphabets by using a linear time suffix array construction algorithm, it was up to now only conjectured that the BBWT can also be constructe…
▽ More
The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT (BBWT) is a bijective variant of it. Although it is known that the BWT can be constructed in linear time for integer alphabets by using a linear time suffix array construction algorithm, it was up to now only conjectured that the BBWT can also be constructed in linear time. We confirm this conjecture by proposing a construction algorithm that is based on SAIS, improving the best known result of $O(n \lg n /\lg \lg n)$ time to linear.
△ Less
Submitted 22 April, 2021; v1 submitted 16 November, 2019;
originally announced November 2019.
-
Run Compressed Rank/Select for Large Alphabets
Authors:
José Fuentes-Sepúlveda,
Juha Kärkkäinen,
Dmitry Kosolobov,
Simon J. Puglisi
Abstract:
Given a string of length $n$ that is composed of $r$ runs of letters from the alphabet $\{0,1,\ldots,σ{-}1\}$ such that $2 \le σ\le r$, we describe a data structure that, provided $r \le n / \log^{ω(1)} n$, stores the string in $r\log\frac{nσ}{r} + o(r\log\frac{nσ}{r})$ bits and supports select and access queries in $O(\log\frac{\log(n/r)}{\log\log n})$ time and rank queries in…
▽ More
Given a string of length $n$ that is composed of $r$ runs of letters from the alphabet $\{0,1,\ldots,σ{-}1\}$ such that $2 \le σ\le r$, we describe a data structure that, provided $r \le n / \log^{ω(1)} n$, stores the string in $r\log\frac{nσ}{r} + o(r\log\frac{nσ}{r})$ bits and supports select and access queries in $O(\log\frac{\log(n/r)}{\log\log n})$ time and rank queries in $O(\log\frac{\log(nσ/r)}{\log\log n})$ time. We show that $r\log\frac{n(σ-1)}{r} - O(\log\frac{n}{r})$ bits are necessary for any such data structure and, thus, our solution is succinct. We also describe a data structure that uses $(1 + ε)r\log\frac{nσ}{r} + O(r)$ bits, where $ε> 0$ is an arbitrary constant, with the same query times but without the restriction $r \le n / \log^{ω(1)} n$. By simple reductions to the colored predecessor problem, we show that the query times are optimal in the important case $r \ge 2^{\log^δn}$, for an arbitrary constant $δ> 0$. We implement our solution and compare it with the state of the art, showing that the closest competitors consume 31-46% more space.
△ Less
Submitted 26 February, 2018; v1 submitted 8 November, 2017;
originally announced November 2017.
-
On the Size of Lempel-Ziv and Lyndon Factorizations
Authors:
Juha Kärkkäinen,
Dominik Kempa,
Yuto Nakashima,
Simon J. Puglisi,
Arseny M. Shur
Abstract:
Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlapping LZ factorization (which we demonstrate by describing a…
▽ More
Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlapping LZ factorization (which we demonstrate by describing a new, non-trivial family of strings) it is never more than twice the size.
△ Less
Submitted 27 November, 2016;
originally announced November 2016.
-
Linear-time string indexing and analysis in small space
Authors:
Djamal Belazzougui,
Fabio Cunial,
Juha Kärkkäinen,
Veli Mäkinen
Abstract:
The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input s…
▽ More
The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,σ\}^n$ can be built in deterministic $O(n)$ time using just $O(n\logσ)$ bits of space, where $σ\leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\logσ)$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\logσ)$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\logσ)$ bits of space, took $O(n\log{\logσ})$ time for the first two structures, and $O(n\log^εn)$ time for the third, where $ε$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
String Inference from the LCP Array
Authors:
Juha Kärkkäinen,
Marcin Piątkowski,
Simon J. Puglisi
Abstract:
The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape.
In order to better understand the combinato…
▽ More
The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape.
In order to better understand the combinatorics of LCP arrays, we consider the problem of inferring a string from an LCP array, i.e., determining whether a given array of integers is a valid LCP array, and if it is, reconstructing some string or all strings with that LCP array. There are recent studies of inferring a string from a suffix tree shape but using significantly more information (in the form of suffix links) than is available in the LCP array.
We provide two main results. (1) We describe two algorithms for inferring strings from an LCP array when we allow a generalized form of LCP array defined for a multiset of cyclic strings: a linear time algorithm for binary alphabet and a general algorithm with polynomial time complexity for a constant alphabet size. (2) We prove that determining whether a given integer array is a valid LCP array is NP-complete when we require more restricted forms of LCP array defined for a single cyclic or non-cyclic string or a multiset of non-cyclic strings. The result holds whether or not the alphabet is restricted to be binary. In combination, the two results show that the generalized form of LCP array for a multiset of cyclic strings is fundamentally different from the other more restricted forms.
△ Less
Submitted 23 February, 2017; v1 submitted 14 June, 2016;
originally announced June 2016.
-
Document Retrieval on Repetitive String Collections
Authors:
Travis Gagie,
Aleksi Hartikainen,
Kalle Karhu,
Juha Kärkkäinen,
Gonzalo Navarro,
Simon J. Puglisi,
Jouni Sirén
Abstract:
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient docum…
▽ More
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, {\em interleaved LCPs} and {\em precomputed document lists}, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-$k$ document retrieval (find the $k$ documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.
△ Less
Submitted 18 May, 2017; v1 submitted 30 May, 2016;
originally announced May 2016.
-
Lempel-Ziv Decoding in External Memory
Authors:
Djamal Belazzougui,
Juha Kärkkäinen,
Dominik Kempa,
Simon J. Puglisi
Abstract:
Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory compu…
▽ More
Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
△ Less
Submitted 31 January, 2016;
originally announced February 2016.
-
Diverse Palindromic Factorization is NP-Complete
Authors:
Hideo Bannai,
Travis Gagie,
Shunsuke Inenaga,
Juha Karkkainen,
Dominik Kempa,
Marcin Piatkowski,
Simon J. Puglisi,
Shiho Sugimoto
Abstract:
We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization.
We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization.
△ Less
Submitted 16 February, 2017; v1 submitted 13 March, 2015;
originally announced March 2015.
-
Queries on LZ-Bounded Encodings
Authors:
Djamal Belazzougui,
Travis Gagie,
Paweł Gawrychowski,
Juha Kärkkäinen,
Alberto Ordóñez,
Simon J. Puglisi,
Yasuo Tabei
Abstract:
We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compar…
▽ More
We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compared to other data structures supporting such queries.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.
-
Document Counting in Practice
Authors:
Travis Gagie,
Aleksi Hartikainen,
Juha Kärkkäinen,
Gonzalo Navarro,
Simon J. Puglisi,
Jouni Sirén
Abstract:
We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation a…
▽ More
We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5--400, depending on the dataset.
△ Less
Submitted 1 October, 2015; v1 submitted 23 September, 2014;
originally announced September 2014.
-
A Subquadratic Algorithm for Minimum Palindromic Factorization
Authors:
Gabriele Fici,
Travis Gagie,
Juha Kärkkäinen,
Dominik Kempa
Abstract:
We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and…
▽ More
We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and $Ω(n\log n)$ in the worst case. The last result is based on a characterization of the palindromic structure of Zimin words.
△ Less
Submitted 7 August, 2014; v1 submitted 10 March, 2014;
originally announced March 2014.
-
Lempel-Ziv Parsing in External Memory
Authors:
Juha Kärkkäinen,
Dominik Kempa,
Simon J. Puglisi
Abstract:
For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed t…
▽ More
For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.
△ Less
Submitted 4 July, 2013;
originally announced July 2013.
-
Lightweight Lempel-Ziv Parsing
Authors:
Juha Kärkkäinen,
Dominik Kempa,
Simon J. Puglisi
Abstract:
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for h…
▽ More
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.
△ Less
Submitted 6 February, 2013; v1 submitted 5 February, 2013;
originally announced February 2013.
-
Linear Time Lempel-Ziv Factorization: Simple, Fast, Small
Authors:
Juha Kärkkäinen,
Dominik Kempa,
Simon J. Puglisi
Abstract:
Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algor…
▽ More
Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n log n bits less space than any previous linear time algorithm. The algorithms are also practical, simple to implement, and very fast in practice.
△ Less
Submitted 12 December, 2012;
originally announced December 2012.
-
A Compressed Self-Index for Genomic Databases
Authors:
Travis Gagie,
Juha Kärkkäinen,
Yakov Nekrich,
Simon J. Puglisi
Abstract:
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant o…
▽ More
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.
△ Less
Submitted 5 November, 2011;
originally announced November 2011.
-
A Faster Grammar-Based Self-Index
Authors:
Travis Gagie,
Paweł Gawrychowski,
Juha Kärkkäinen,
Yakov Nekrich,
Simon J. Puglisi
Abstract:
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern (P [1..m]), we can list the…
▽ More
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + \occ \log \log n}$ time. If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the $\Oh{m^2}$ term to $\Oh{m \log m}$. All previous self-indexes are larger or slower in the worst case.
△ Less
Submitted 26 September, 2012; v1 submitted 19 September, 2011;
originally announced September 2011.
-
Fixed Block Compression Boosting in FM-Indexes
Authors:
Juha Kärkkäinen,
Simon J. Puglisi
Abstract:
A compressed full-text self-index occupies space close to that of the compressed text and simultaneously allows fast pattern matching and random access to the underlying text. Among the best compressed self-indexes, in theory and in practice, are several members of the FM-index family. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation…
▽ More
A compressed full-text self-index occupies space close to that of the compressed text and simultaneously allows fast pattern matching and random access to the underlying text. Among the best compressed self-indexes, in theory and in practice, are several members of the FM-index family. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes.
△ Less
Submitted 19 April, 2011;
originally announced April 2011.
-
Pattern Kits
Authors:
Travis Gagie,
Kalle Karhu,
Juha Kärkkäinen,
Veli Mäkinen,
Leena Salmela
Abstract:
Suppose we have just performed searches in a self-index for two patterns $A$ and $B$ and now we want to search for their concatenation \A B); how can we best make use of our previous computations? In this paper we consider this problem and, more generally, how we can store a dynamic library of patterns that we can easily manipulate in interesting ways. We give a space- and time-efficient data stru…
▽ More
Suppose we have just performed searches in a self-index for two patterns $A$ and $B$ and now we want to search for their concatenation \A B); how can we best make use of our previous computations? In this paper we consider this problem and, more generally, how we can store a dynamic library of patterns that we can easily manipulate in interesting ways. We give a space- and time-efficient data structure for this problem that is compatible with many of the best self-indexes.
△ Less
Submitted 2 April, 2011; v1 submitted 15 November, 2010;
originally announced November 2010.
-
Counting Colours in Compressed Strings
Authors:
Travis Gagie,
Juha Kärkkäinen
Abstract:
Suppose we are asked to preprocess a string \(s [1..n]\) such that later, given a substring's endpoints, we can quickly count how many distinct characters it contains. In this paper we give a data structure for this problem that takes \(n H_0 (s) + \Oh{n} + \oh{n H_0 (s)}\) bits, where \(H_0 (s)\) is the 0th-order empirical entropy of $s$, and answers queries in $\Oh{\log^{1 + ε} n}$ time for any…
▽ More
Suppose we are asked to preprocess a string \(s [1..n]\) such that later, given a substring's endpoints, we can quickly count how many distinct characters it contains. In this paper we give a data structure for this problem that takes \(n H_0 (s) + \Oh{n} + \oh{n H_0 (s)}\) bits, where \(H_0 (s)\) is the 0th-order empirical entropy of $s$, and answers queries in $\Oh{\log^{1 + ε} n}$ time for any constant \(ε> 0\). We also show how our data structure can be made partially dynamic.
△ Less
Submitted 15 November, 2010;
originally announced November 2010.