-
Prefix-free parsing for merging big BWTs
Authors:
Diego Diaz-Dominguez,
Travis Gagie,
Veronica Guerrini,
Ben Langmead,
Zsuzsanna Liptak,
Giovanni Manzini,
Francesco Masillo,
Vikram Shivakumar
Abstract:
When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes --…
▽ More
When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes -- then we can drastically reduce PFP's memory footprint by building the BWTs of the small datasets and then merging them into the BWT of the whole dataset.
△ Less
Submitted 6 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing
Authors:
Diego Diaz-Dominguez
Abstract:
We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite informatio…
▽ More
We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$, the instances $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $\mathcal{T}$'s parse tree that remains the same in all the occurrences of $P$. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ fully in parallel and then merge the resulting grammars into a single compressed output equivalent to $ALG(\mathcal{T})$. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.
△ Less
Submitted 24 February, 2025; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Computing all-vs-all MEMs in grammar-compressed text
Authors:
Diego Diaz-Dominguez,
Leena Salmela
Abstract:
We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free…
▽ More
We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $\mathcal{T}$ incrementally over $\mathcal{G}$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $\mathcal{G}$ from $\mathcal{T}$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of $\mathcal{G}$ in $O(G +occ)$ time and uses $O(\log G(G+occ))$ bits, where $G$ is the grammar size, and $occ$ is the number of MEMs in $\mathcal{T}$. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Computing all-vs-all MEMs in run-length encoded collections of HiFi reads
Authors:
Diego Díaz-Domínguez,
Simon J. Puglisi,
Leena Salmela
Abstract:
We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing…
▽ More
We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing the sequence of symbols for equal-symbol runs and another storing the run lengths. The purpose of the split is to get the BWT of the run symbols and reorder their lengths accordingly. We show that this special BWT, as it encodes the HiFi reads and their reverse complements, supports bi-directional queries for the HiFi reads. Then, we propose a variation of the MEM algorithm of Belazzougui et al. (2013) that exploits the run-length encoding and the implicit bi-directional property of our BWT to compute approximate MEMs. Concretely, if the algorithm finds that two substrings, $a_1 \ldots a_p$ and $b_1 \ldots b_p$, have a MEM, then it reports the MEM only if their corresponding length sequences, $\ell^{a}_1 \ldots \ell^{a}_p$ and $\ell^{b}_1 \ldots \ell^{b}_p$, do not differ beyond an input threshold. We use a simple metric to calculate the similarity of the length sequences that we call the {\em run-length excess}. Our technique facilitates the detection of MEMs with homopolymer errors as it does not require dynamic programming to find approximate matches where the only edits are the lengths of the equal-symbol runs. Finally, we present a method that relies on a geometric data structure to report the text occurrences of the MEMs detected by our algorithm.
△ Less
Submitted 31 August, 2022;
originally announced August 2022.
-
Efficient Construction of the BWT for Repetitive Text Using String Compression
Authors:
Diego Díaz-Domínguez,
Gonzalo Navarro
Abstract:
We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results…
▽ More
We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (\texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10\% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.
△ Less
Submitted 14 August, 2023; v1 submitted 12 April, 2022;
originally announced April 2022.
-
A grammar compressor for collections of reads with applications to the construction of the BWT
Authors:
Diego Díaz-Domínguez,
Gonzalo Navarro
Abstract:
We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments i…
▽ More
We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12\% of extra compression and require less working space and time.
△ Less
Submitted 12 November, 2020;
originally announced November 2020.
-
An Index for Sequencing Reads Based on The Colored de Bruijn Graph
Authors:
Diego Diaz-Domínguez
Abstract:
In this article, we show how to transform a colored de Bruijn graph (dBG) into a practical index for processing massive sets of sequencing reads. Similar to previous works, we encode an instance of a colored dBG of the set using BOSS and a color matrix C. To reduce the space requirements, we devise an algorithm that produces a smaller and more sparse version of C. The novelties in this algorithm a…
▽ More
In this article, we show how to transform a colored de Bruijn graph (dBG) into a practical index for processing massive sets of sequencing reads. Similar to previous works, we encode an instance of a colored dBG of the set using BOSS and a color matrix C. To reduce the space requirements, we devise an algorithm that produces a smaller and more sparse version of C. The novelties in this algorithm are (i) an incomplete coloring of the graph and (ii) a greedy coloring approach that tries to reuse the same colors for different strings when possible. We also propose two algorithms that work on top of the index; one is for reconstructing reads, and the other is for contig assembly. Experimental results show that our data structure uses about half the space of the plain representation of the set (1 Byte per DNA symbol) and that more than 99% of the reads can be reconstructed just from the index.
△ Less
Submitted 29 November, 2019; v1 submitted 6 August, 2019;
originally announced August 2019.
-
Simulating the DNA String Graph in Succinct Space
Authors:
Diego Díaz-Domínguez,
Travis Gagie,
Gonzalo Navarro
Abstract:
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted…
▽ More
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the property of the DNA reverse complements to simulate bi-directionality with some time-space trade-offs. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. Our experimental results show that using k = 100, rBOSS can assemble 185 MB of reads in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.
△ Less
Submitted 29 November, 2019; v1 submitted 29 January, 2019;
originally announced January 2019.
-
Assembling Omnitigs using Hidden-Order de Bruijn Graphs
Authors:
Diego Díaz-Domínguez,
Djamal Belazzougui,
Travis Gagie,
Veli Mäkinen,
Gonzalo Navarro,
Simon J. Puglisi
Abstract:
De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2…
▽ More
De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2015) went further and gave a representation making it possible to navigate in the graph and change order on the fly, up to a maximum $K$, but they can use up to $\lg K$ extra bits per edge because they use an LCP array. In this paper, we replace the LCP array by a succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently. These operations are not enough to let us easily extract unitigs and only unitigs from the graph but they do let us extract a set of safe strings that contains all unitigs. Suppose we are navigating in a variable-order de Bruijn graph representation, following these rules: if there are no outgoing edges then we reduce the order, hoping one appears; if there is exactly one outgoing edge then we take it (increasing the current order, up to $K$); if there are two or more outgoing edges then we stop. Then we traverse a (variable-order) path such that we cross edges only when we have no choice or, equivalently, we generate a string appending characters only when we have no choice. It follows that the strings we extract are safe. Our experiments show we extract a set of strings more informative than the unitigs, while using a reasonable amount of memory.
△ Less
Submitted 14 May, 2018;
originally announced May 2018.