-
r*-indexing
Authors:
Travis Gagie
Abstract:
Let $T [1..n]$ be a text over an alphabet of size $σ\in \mathrm{polylog} (n)$, let $r^*$ be the sum of the numbers of runs in the Burrows-Wheeler Transforms of $T$ and its reverse, and let $z$ be the number of phrases in the LZ77 parse of $T$. We show how to store $T$ in $O (r^* \log (n / r^*) + z \log n)$ bits such that, given a pattern $P [1..m]$, we can report the locations of the…
▽ More
Let $T [1..n]$ be a text over an alphabet of size $σ\in \mathrm{polylog} (n)$, let $r^*$ be the sum of the numbers of runs in the Burrows-Wheeler Transforms of $T$ and its reverse, and let $z$ be the number of phrases in the LZ77 parse of $T$. We show how to store $T$ in $O (r^* \log (n / r^*) + z \log n)$ bits such that, given a pattern $P [1..m]$, we can report the locations of the $\mathrm{occ}$ occurrences of $P$ in $T$ in $O (m \log n + \mathrm{occ} \log^εn)$ time. We can also report the position of the leftmost and rightmost occurrences of $P$ in $T$ in the same space and $O (m \log^εn)$ time.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
Compressing Suffix Trees by Path Decompositions
Authors:
Ruben Becker,
Davide Cenzato,
Travis Gagie,
Sung-Hwan Kim,
Ragnar Groot Koerkamp,
Giovanni Manzini,
Nicola Prezza
Abstract:
In this paper, we solve the long-standing problem of designing I/O-efficient compressed indexes. Our solution broadly consists of generalizing suffix sorting and revisiting suffix tree path compression. In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. I…
▽ More
In this paper, we solve the long-standing problem of designing I/O-efficient compressed indexes. Our solution broadly consists of generalizing suffix sorting and revisiting suffix tree path compression. In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. In our approach, instead, we (i) sort the suffix tree's leaves according to a more general priority function $π$ (generalizing suffix sorting), (ii) we build a suffix tree path decomposition prioritizing the leftmost paths in such an order, and (iii) we path-compress the decomposition's paths as pointers to a small subset of the string's suffixes. At this point, we show that the colexicographically-sorted array of those pointers represents a new elegant, simple, and remarkably I/O-efficient compressed suffix tree. For instance, by taking $π$ to be the lexicographic rank of $T$'s suffixes, we can compress the suffix tree topology in $O(r)$ space on top of a $n\logσ+ O(\log n)$-bits text representation while essentially matching the pattern matching I/O complexity of Weiner and McCreight's suffix tree. Another (more practical) solution is obtained by taking $π$ to be the colexicographic rank of $T$'s prefixes and using a fully-compressed random access oracle. The resulting self-index allows us to locate all occurrences of a given query pattern in less space and orders of magnitude faster than the $r$-index.
△ Less
Submitted 17 July, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
Prefix-free parsing for merging big BWTs
Authors:
Diego Diaz-Dominguez,
Travis Gagie,
Veronica Guerrini,
Ben Langmead,
Zsuzsanna Liptak,
Giovanni Manzini,
Francesco Masillo,
Vikram Shivakumar
Abstract:
When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes --…
▽ More
When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes -- then we can drastically reduce PFP's memory footprint by building the BWTs of the small datasets and then merging them into the BWT of the whole dataset.
△ Less
Submitted 6 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
The Trie Measure, Revisited
Authors:
Jarno N. Alanko,
Ruben Becker,
Davide Cenzato,
Travis Gagie,
Sung-Hwan Kim,
Bojana Kodric,
Nicola Prezza
Abstract:
In this paper, we study the following problem: given $n$ subsets $S_1, \dots, S_n$ of an integer universe $U = \{0,\dots, u-1\}$, having total cardinality $N = \sum_{i=1}^n |S_i|$, find a prefix-free encoding $enc : U \rightarrow \{0,1\}^+$ minimizing the so-called trie measure, i.e., the total number of edges in the $n$ binary tries $\mathcal T_1, \dots, \mathcal T_n$, where $\mathcal T_i$ is the…
▽ More
In this paper, we study the following problem: given $n$ subsets $S_1, \dots, S_n$ of an integer universe $U = \{0,\dots, u-1\}$, having total cardinality $N = \sum_{i=1}^n |S_i|$, find a prefix-free encoding $enc : U \rightarrow \{0,1\}^+$ minimizing the so-called trie measure, i.e., the total number of edges in the $n$ binary tries $\mathcal T_1, \dots, \mathcal T_n$, where $\mathcal T_i$ is the trie packing the encoded integers $\{enc(x):x\in S_i\}$. We first observe that this problem is equivalent to that of merging $u$ sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter $0\le a < u$, this encoding sends each $x \in U$ to $(x + a) \mod u$, interpreted as a bit-string of $\log u$ bits. We develop the first efficient algorithms that find the value of $a$ minimizing the trie measure when this encoding is used. Our two algorithms run in $O(u + N\log u)$ and $O(N\log^2 u)$ time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in $O(N+u^3)$ time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
KeBaB: $k$-mer based breaking for finding long MEMs
Authors:
Nathaniel K. Brown,
Lore Depuydt,
Mohsen Zakeri,
Anas Alhadi,
Nour Allam,
Dove Begleiter,
Nithin Bharathi Kabilan Karpagavalli,
Suchith Sridhar Khajjayam,
Hamza Wahed,
Travis Gagie,
Ben Langmead
Abstract:
Long maximal exact matches (MEMs) are used in many genomics applications such as read classification and sequence alignment. Li's ropebwt3 finds long MEMs quickly because it can often ignore much of its input. In this paper we show that a fast and space efficient $k$-mer filtration step using a Bloom filter speeds up MEM-finders such as ropebwt3 even further by letting them ignore even more. We al…
▽ More
Long maximal exact matches (MEMs) are used in many genomics applications such as read classification and sequence alignment. Li's ropebwt3 finds long MEMs quickly because it can often ignore much of its input. In this paper we show that a fast and space efficient $k$-mer filtration step using a Bloom filter speeds up MEM-finders such as ropebwt3 even further by letting them ignore even more. We also show experimentally that our approach can accelerate metagenomic classification without significantly hurting accuracy.
△ Less
Submitted 9 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Tag arrays
Authors:
Travis Gagie
Abstract:
The Burrows-Wheeler Transform (BWT) moves characters with similar contexts in a text together, where a character's context consists of the characters immediately following it. We say that a property has contextual locality if characters with similar contexts tend to have the same or similar values (``tags'') of that property. We argue that if we consider a repetitive text and such a property and t…
▽ More
The Burrows-Wheeler Transform (BWT) moves characters with similar contexts in a text together, where a character's context consists of the characters immediately following it. We say that a property has contextual locality if characters with similar contexts tend to have the same or similar values (``tags'') of that property. We argue that if we consider a repetitive text and such a property and the tags in their characters' BWT order, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.
△ Less
Submitted 22 November, 2024;
originally announced November 2024.
-
Fast and Small Subsampled R-indexes
Authors:
Dustin Cobas,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $r$-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, $O(r)$ where $r$ is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of mild…
▽ More
The $r$-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, $O(r)$ where $r$ is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the $sr$-index, a variant that limits the space to $O(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained subsampling the text positions indexed by the $r$-index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the $sr$-index, because it performs much better on real texts than on synthetic ones: the $sr$-index retains the performance of the $r$-index while using 1.5--4.0 times less space, sharply outperforming {\em virtually every other} compressed index on repetitive texts in both time and space. Only a particular LZ-based index uses less space than the $sr$-index, but it is an order of magnitude slower.
Our second contribution are the $r$-csa and $sr$-csa indexes. Just like the $r$-index adapts the well-known FM-Index to repetitive texts, the $r$-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the $r$-index turn out to fit naturally and efficiently in the CSA framework. The $sr$-csa is the corresponding subsampled version of the $r$-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, we show that the $sr$-csa outperforms the $sr$-index on repetitive texts over those larger alphabets and some DNA texts as well.
△ Less
Submitted 22 September, 2024;
originally announced September 2024.
-
Faster run-length compressed suffix arrays
Authors:
Nathaniel K. Brown,
Travis Gagie,
Giovanni Manzini,
Gonzalo Navarro,
Marinella Sciortino
Abstract:
We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in…
▽ More
We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in $O (\log r_a + \log \log n)$ time, where $r_a$ is the number of runs of copies of $a$ in the BWT. We then show how to modify the RLCSA such that we find the SA interval for $a P$ in only $O (\log r_a)$ time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.
△ Less
Submitted 19 April, 2025; v1 submitted 8 August, 2024;
originally announced August 2024.
-
MIOV: Reordering MOVI for even better locality
Authors:
Peter Perešíni,
Nathaniel K. Brown,
Travis Gagie,
Ben Langmead
Abstract:
We consider how to reorder the rows of Nishimoto and Tabei's move structure such that we more often move from one row to the next in memory.
We consider how to reorder the rows of Nishimoto and Tabei's move structure such that we more often move from one row to the next in memory.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Suffixient Arrays: a New Efficient Suffix Array Compression Technique
Authors:
Davide Cenzato,
Lore Depuydt,
Travis Gagie,
Sung-Hwan Kim,
Giovanni Manzini,
Francisco Olivares,
Nicola Prezza
Abstract:
The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practic…
▽ More
The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practice tend to exhibit poor cache locality and thus significantly slow down queries. In this paper, we propose a new simple and very efficient solution to this problem by presenting the \emph{Suffixient Array}: a tiny subset of the Suffix Array \emph{sufficient} to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available. We prove that: (i) the Suffixient Array length $χ$ is a strong repetitiveness measure, (ii) unlike most existing repetition-aware indexes such as the $r$-index, our new index is efficient in the I/O model, and (iii) Suffixient Arrays can be computed in linear time and compressed working space. We show experimentally that, when using well-established compressed random access data structures on repetitive collections, the Suffixient Array $\SuA$ is \emph{simultaneously} (i) faster and orders of magnitude smaller than the Suffix Array $\SA$ and (ii) smaller and \emph{one to two orders of magnitude faster} than the $r$-index. With an average pattern matching query time as low as 3.5 ns per character, our new index gets very close to the ultimate lower bound: the RAM throughput of our workstation (1.18 ns per character).
△ Less
Submitted 18 March, 2025; v1 submitted 26 July, 2024;
originally announced July 2024.
-
How to Find Long Maximal Exact Matches and Ignore Short Ones
Authors:
Travis Gagie
Abstract:
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least…
▽ More
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least $L$ between a pattern of length $m$ and a text of length $n$ in $O (m)$ time plus extra $O (\log n)$ time only for each MEM of length at least nearly $L$ using a compact index for the text, suitable for pangenomics.
△ Less
Submitted 1 July, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests
Authors:
Dominika Draesslerová,
Omar Ahmed,
Travis Gagie,
Jan Holub,
Ben Langmead,
Giovanni Manzini,
Gonzalo Navarro
Abstract:
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can…
▽ More
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any $k_{\max}$-tuple, for a parameter $k_{\max}$; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
△ Less
Submitted 4 April, 2024; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Suffixient Sets
Authors:
Lore Depuydt,
Travis Gagie,
Ben Langmead,
Giovanni Manzini,
Nicola Prezza
Abstract:
We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most…
▽ More
We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs.
△ Less
Submitted 4 June, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Faster Maximal Exact Matches with Lazy LCP Evaluation
Authors:
Adrián Goga,
Lore Depuydt,
Nathaniel K. Brown,
Jan Fostier,
Travis Gagie,
Gonzalo Navarro
Abstract:
MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the ope…
▽ More
MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Wheeler maps
Authors:
Andrej Baláz,
Travis Gagie,
Adrián Goga,
Simon Heumos,
Gonzalo Navarro,
Alessia Petescia,
Jouni Sirén
Abstract:
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$.…
▽ More
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$. For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number $t$ of runs in the list of tags sorted by their characters' positions in the Burrows-Wheeler Transform (BWT) of $T$. We show how, given a straight-line program with $g$ rules for $T$, we can build an $O(g + r + t)$-space Wheeler map, where $r$ is the number of runs in the BWT of $T$, with which we can preprocess a pattern $P[1..m]$ in $O(m \log n)$ time and then return the $k$ distinct tags for $P[i..j]$ in optimal $O(k)$ time for any given $i$ and $j$. We show various further results related to prioritizing the most frequent tags.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
Another virtue of wavelet forests?
Authors:
Christina Boucher,
Travis Gagie,
Aaron Hong,
Yansong Li,
Norbert Zeh
Abstract:
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirica…
▽ More
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirical entropies of $S$ even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on $S$, or when $T$ is not a BWT at all.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Space-time Trade-offs for the LCP Array of Wheeler DFAs
Authors:
Nicola Cotumaccio,
Travis Gagie,
Dominik Köppl,
Nicola Prezza
Abstract:
Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu…
▽ More
Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.
In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a $ k $-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in $ k $, thus improving a previous bound by Boucher et al. which was linear in $ k $ [DCC 2015].
△ Less
Submitted 19 August, 2024; v1 submitted 9 June, 2023;
originally announced June 2023.
-
Acceleration of FM-index Queries Through Prefix-free Parsing
Authors:
Aaron Hong,
Marco Oliva,
Dominik Köppl,
Hideo Bannai,
Christina Boucher,
Travis Gagie
Abstract:
FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to…
▽ More
FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al.\ proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing -- which takes parameters that let us tune the average length of the phrases -- instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. Our source code is available at https://github.com/marco-oliva/afm .
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Sum-of-Local-Effects Data Structures for Separable Graphs
Authors:
Xing Lyu,
Travis Gagie,
Meng He,
Yakov Nekrich,
Norbert Zeh
Abstract:
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on…
▽ More
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on a vertex. In this paper we show how, given an edge-weighted graph with constant-size separators, we can support the following operations on it in time polylogarithmic in the number of vertices and the number of facilities placed on the vertices, where distances between vertices are measured with respect to the edge weights:
Add (v, f, w, d) places a facility of weight w and with effect radius d onto vertex v.
Remove (v, f) removes a facility f previously placed on v using Add from v.
Sum (v) or Sum (v, d) returns the total weight of all facilities affecting v or, with a distance parameter d, the total weight of all facilities whose effect region intersects the ``circle'' with radius d around v.
Top (v, k) or Top (v, k, d) returns the k facilities of greatest weight that affect v or, with a distance parameter d, whose effect region intersects the ``circle'' with radius d around v.
The weights of the facilities and the operation that Sum uses to ``sum'' them must form a semigroup. For Top queries, the weights must be drawn from a total order.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Computing matching statistics on Wheeler DFAs
Authors:
Alessio Conte,
Nicola Cotumaccio,
Travis Gagie,
Giovanni Manzini,
Nicola Prezza,
Marinella Sciortino
Abstract:
Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho…
▽ More
Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Space-efficient conversions from SLPs
Authors:
Travis Gagie,
Adrián Goga,
Artur Jeż,
Gonzalo Navarro
Abstract:
We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in…
▽ More
We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in $O(n\log g)$ time, where $δ$ is the substring complexity measure of $T$. Finally, we show how to build the LZ parse of $T$ from such a LCG within $O(g_{lc})$ space and in time $O(z\log^2 n \log^2(n/z))$. All our results hold with high probability.
△ Less
Submitted 10 October, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
A fast and simple $O (z \log n)$-space index for finding approximately longest common substrings
Authors:
Nick Fagan,
Jorge Hermo González,
Travis Gagie
Abstract:
We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fr…
▽ More
We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fraction of the length of a longest common substring of $P$ and $T$.
△ Less
Submitted 3 December, 2022; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Space-efficient RLZ-to-LZ77 conversion
Authors:
Travis Gagie
Abstract:
Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space.
Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space.
△ Less
Submitted 3 December, 2022; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Augmented Thresholds for MONI
Authors:
César Martínez-Guardiola,
Nathaniel K. Brown,
Fernando Silva-Coira,
Dominik Köppl,
Travis Gagie,
Susana Ladra
Abstract:
MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these…
▽ More
MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries and thus significantly speeds up MONI in practice while only slightly increasing its size.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Ruler Rolling
Authors:
Xing Lyu,
Travis Gagie,
Meng He
Abstract:
At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrapping} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrapping} is equivalent to partitioning a string of…
▽ More
At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrapping} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrapping} is equivalent to partitioning a string of positive integers into substrings whose sums are increasing such that the last substring sums to at most a given amount. They gave linear-time algorithms for the versions of {\sc Ruler Wrapping} both with and without this assumption. In real life we cannot repeatedly fold a carpenter's ruler 180 degrees in the same direction. In this paper we propose the more realistic problem of {\sc Ruler Rolling}, in which we repeatedly fold the segments 90 degrees in the same direction and thus fold the ruler into a rectangle instead of into an interval. We should report all the Pareto-optimal rollings. We note that if the last straight section of the ruler must be longer than the third to last -- analogously to Gagie et al.'s assumption -- then {\sc Ruler Rolling} is equivalent to partitioning a string of positive integers into substrings such that the sums of the even substrings are increasing, as are the sums of the odd substrings. We give a simple dynamic-programming algorithm that reports all the Pareto-optimal rollings in quadratic time under this assumption. Our algorithm still works even without the assumption, but then we are left with a quadratic number of two-dimensional feasible solutions, so finding the Pareto-optimal ones and increases our running time by a logarithmic factor. If we have a nice objective function, however, we still use quadratic time.
△ Less
Submitted 4 April, 2024; v1 submitted 4 October, 2022;
originally announced October 2022.
-
MARIA: Multiple-alignment $r$-index with aggregation
Authors:
Adrián Goga,
Andrej Baláž,
Alessia Petescia,
Travis Gagie
Abstract:
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM --…
▽ More
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform
Authors:
Travis Gagie,
Giovanni Manzini,
Marinella Sciortino
Abstract:
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.…
▽ More
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT.
We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string).
Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.
△ Less
Submitted 21 August, 2022;
originally announced August 2022.
-
KATKA: A KRAKEN-like tool with $k$ given at query time
Authors:
Travis Gagie,
Sana Kashgouli,
Ben Langmead
Abstract:
We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction…
▽ More
We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction time.
△ Less
Submitted 22 August, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
On representing the degree sequences of sublogarithmic-degree Wheeler graphs
Authors:
Travis Gagie
Abstract:
We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in…
▽ More
We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, then we can store its in- and out-degree sequences $\Din$ and $\Dout$ in $n H_k (\Din) + o (n)$ and $n H_k (\Dout) + o (n)$ bits, for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$, such that querying them for pattern matching in the graph takes constant time.
△ Less
Submitted 22 August, 2022; v1 submitted 16 April, 2022;
originally announced April 2022.
-
Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices
Authors:
Paolo Ferragina,
Travis Gagie,
Dominik Köppl,
Giovanni Manzini,
Gonzalo Navarro,
Manuel Striani,
Francesco Tosoni
Abstract:
As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho…
▽ More
As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input.
To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication.
Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach.
△ Less
Submitted 30 March, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
MONI can find k-MEMs
Authors:
Igor Tatarnikov,
Ardavan Shahrabi Farahani,
Sana Kashgouli,
Travis Gagie
Abstract:
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in…
▽ More
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m \log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m \log n)$.
△ Less
Submitted 21 December, 2022; v1 submitted 10 February, 2022;
originally announced February 2022.
-
RLBWT Tricks
Authors:
Nathaniel K. Brown,
Travis Gagie,
Massimiliano Rossi
Abstract:
Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation…
▽ More
Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation $π$, it stores an $O (r)$-space table -- where $r$ is the number of positions $i$ where either $i = 0$ or $π(i + 1) \neq π(i) + 1$ -- that enables the computation of successive values of $π(i)$ by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing $π(i)$ is constant while maintaining $O (r)$-space.
In this paper we refine Nishimoto and Tabei's approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation $π$ corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.
△ Less
Submitted 13 July, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Ruler Wrapping
Authors:
Travis Gagie,
Mozhgan Saeidi,
Allan Sapucaia
Abstract:
In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourk…
▽ More
In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourke proposed a natural variation of this problem called {\em ruler wrapping}, in which all folded hinges must be folded the same way. In this paper we show O'Rourke's variation has an linear-time solution. We also show how, given a sequence of positive numbers, in linear time we can partition it into the maximum number of substrings whose totals are non-decreasing.
△ Less
Submitted 9 January, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Simple Worst-Case Optimal Adaptive Prefix-Free Coding
Authors:
Travis Gagie
Abstract:
We give a new and simple worst-case optimal algorithm for adaptive prefix-free coding that matches Gagie and Nekrich's bounds except for lower-order terms, and uses no data structures more complicated than a lookup table. Moreover, when Gagie and Nekrich's algorithm is modified for adaptive alphabetic prefix-free coding its decoding time slows down to $O (\log \log n)$ per character, but ours can…
▽ More
We give a new and simple worst-case optimal algorithm for adaptive prefix-free coding that matches Gagie and Nekrich's bounds except for lower-order terms, and uses no data structures more complicated than a lookup table. Moreover, when Gagie and Nekrich's algorithm is modified for adaptive alphabetic prefix-free coding its decoding time slows down to $O (\log \log n)$ per character, but ours can be modified for this problem with no asymptotic slowdown. As far as we know, this gives the first algorithm for this alphabetic problem that is simultaneously worst-case optimal in terms of encoding and decoding time and of encoding length.
△ Less
Submitted 21 April, 2025; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Succinct Euler-Tour Trees
Authors:
Travis Gagie,
Sebastian Wild
Abstract:
We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.
We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.
△ Less
Submitted 29 June, 2021; v1 submitted 11 May, 2021;
originally announced May 2021.
-
A Fast and Small Subsampled R-index
Authors:
Dustin Cobas,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life sc…
▽ More
The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the $sr$-index, a variant that limits the space to $\mathcal{O}(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained by carefully subsampling the text positions indexed by the $r$-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the $sr$-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the $r$-index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the $sr$-index, using about half the space, but they are an order of magnitude slower.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
$r$-indexing Wheeler graphs
Authors:
Travis Gagie
Abstract:
Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reach…
▽ More
Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reachable by directed paths labelled $P$, and then report those vertices in $O (\log \log |G|)$ time per vertex.
△ Less
Submitted 28 January, 2021;
originally announced January 2021.
-
PHONI: Streamed Matching Statistics with Multi-Genome References
Authors:
Christina Boucher,
Travis Gagie,
Tomohiro I,
Dominik Köppl,
Ben Langmead,
Giovanni Manzini,
Gonzalo Navarro,
Alejandro Pacheco,
Massimiliano Rossi
Abstract:
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape…
▽ More
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.
△ Less
Submitted 11 February, 2021; v1 submitted 11 November, 2020;
originally announced November 2020.
-
PFP Data Structures
Authors:
Christina Boucher,
Ondřej Cvacho,
Travis Gagie,
Jan Holub,
Giovanni Manzini,
Gonzalo Navarro,
Massimiliano Rossi
Abstract:
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size…
▽ More
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Faster Dynamic Compressed d-ary Relations
Authors:
Diego Arroyuelo,
Guillermo de Bernardo,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector…
▽ More
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Practical Random Access to SLP-Compressed Texts
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Louisa Seelbach Benkner,
Yoshimasa Takabatake
Abstract:
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at…
▽ More
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.
△ Less
Submitted 19 July, 2020; v1 submitted 15 October, 2019;
originally announced October 2019.
-
Matching reads to many genomes with the $r$-index
Authors:
Taher Mun,
Alan Kuhnle,
Christina Boucher,
Travis Gagie,
Ben Langmead,
Giovanni Manzini
Abstract:
The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f…
▽ More
The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that file; and how to query that index with ri-align.
Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index .
△ Less
Submitted 3 August, 2019;
originally announced August 2019.
-
Rpair: Rescaling RePair with Rsync
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Yoshimasa Takabatake
Abstract:
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess…
▽ More
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Simulating the DNA String Graph in Succinct Space
Authors:
Diego Díaz-Domínguez,
Travis Gagie,
Gonzalo Navarro
Abstract:
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted…
▽ More
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the property of the DNA reverse complements to simulate bi-directionality with some time-space trade-offs. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. Our experimental results show that using k = 100, rBOSS can assemble 185 MB of reads in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.
△ Less
Submitted 29 November, 2019; v1 submitted 29 January, 2019;
originally announced January 2019.
-
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
Authors:
Alan Kuhnle,
Taher Mun,
Christina Boucher,
Travis Gagie,
Ben Langmead,
Giovanni Manzini
Abstract:
While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find…
▽ More
While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Tunneling on Wheeler Graphs
Authors:
Jarno Alanko,
Travis Gagie,
Gonzalo Navarro,
Louisa Seelbach Benkner
Abstract:
The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CP…
▽ More
The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CPM 2018) described how they could be even further compressed using an idea he called tunneling. In this paper we show that tunneled BWTs can still be used for indexing and extend tunneling to the BWTs of Wheeler graphs, a framework that includes all the generalizations mentioned above.
△ Less
Submitted 29 May, 2019; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Relative compression of trajectories
Authors:
Nieves R. Brisaboa,
Travis Gagie,
Adrián Gómez-Brandón,
Gonzalo Navarro,
José R. Paramá
Abstract:
We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficien…
▽ More
We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficiently. We plan that RCT improves in compression and time performance the previous compressed representations in the state of the art.
△ Less
Submitted 12 October, 2018;
originally announced October 2018.
-
Compressing and Indexing Aligned Readsets
Authors:
Travis Gagie,
Garance Gourdel,
Giovanni Manzini
Abstract:
In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result…
▽ More
In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19\%, from 220 million to 178 million, and using the XBWT reduces it by a further 15\%, to 150 million.
△ Less
Submitted 1 June, 2021; v1 submitted 19 September, 2018;
originally announced September 2018.
-
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Authors:
Travis Gagie,
Gonzalo Navarro,
Nicola Prezza
Abstract:
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s…
▽ More
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O(dm log(σ)/we) and O(dm log(σ)/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.
△ Less
Submitted 4 July, 2019; v1 submitted 8 September, 2018;
originally announced September 2018.
-
Tree Path Majority Data Structures
Authors:
Travis Gagie,
Meng He,
Gonzalo Navarro
Abstract:
We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query i…
▽ More
We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query is of size up to $1/τ$. On a $w$-bit RAM, we obtain a linear-space data structure with $O((1/τ)\log^* n \log\log_w σ)$ query time. For any $κ> 1$, we can also build a structure that uses $O(n\log^{[κ]} n)$ space, where $\log^{[κ]} n$ denotes the function that applies logarithm $κ$ times to $n$, and answers queries in time $O((1/τ)\log\log_w σ)$. The construction time of both structures is $O(n\log n)$. We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses $2nH + 4n + o(n)(H+1)$ bits, where $H \le \lgσ$ is the entropy of the label distribution, and can be built in $O(n\log n)$ time. The other uses $nH + O(n) + o(nH)$ bits and is built in $O(n\log n)$ time w.h.p.
△ Less
Submitted 6 September, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.