Search | arXiv e-print repository

Compressing Suffix Trees by Path Decompositions

Authors: Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Ragnar Groot Koerkamp, Giovanni Manzini, Nicola Prezza

Abstract: In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. In this paper, we revisit path compression and show that a more careful choice of pointers leads to a new elegant, simple, and remarkably efficient way to compress the suffix tree. We begin by observing t… ▽ More In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. In this paper, we revisit path compression and show that a more careful choice of pointers leads to a new elegant, simple, and remarkably efficient way to compress the suffix tree. We begin by observing that an alternative way to path-compress the suffix trie of $T$ is to decompose it into a set of (disjoint) node-to-leaf paths and then represent each path as a pointer $i$ to one of the string's suffixes $T[i,n]$. At this point, we show that the array $A$ of such indices $i$, sorted by the colexicographic order of the corresponding text prefixes $T[1,i]$, possesses the following properties: (i) it supports \emph{cache-efficient} pattern matching queries via simple binary search on $A$ and random access on $T$, and (ii) it contains a number of entries being proportional to the size of the \emph{compressed text}. Of particular interest is the path decomposition given by the colexicographic rank of $T$'s prefixes. The resulting index is smaller and orders of magnitude faster than the $r$-index on the task of locating all occurrences of a query pattern. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: preliminary incomplete draft. Many details missing!

arXiv:2506.03294 [pdf, ps, other]

Prefix-free parsing for merging big BWTs

Authors: Diego Diaz-Dominguez, Travis Gagie, Veronica Guerrini, Ben Langmead, Zsuzsanna Liptak, Giovanni Manzini, Francesco Masillo, Vikram Shivakumar

Abstract: When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes --… ▽ More When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes -- then we can drastically reduce PFP's memory footprint by building the BWTs of the small datasets and then merging them into the BWT of the whole dataset. △ Less

Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

arXiv:2505.11302 [pdf, ps, other]

Depth first representations of $k^2$-trees

Authors: Gabriel Carmona, Giovanni Manzini

Abstract: The $k^2$-tree is a compact data structure designed to efficiently store sparse binary matrices by leveraging both sparsity and clustering of nonzero elements. This representation supports efficiently navigational operations and complex binary operations, such as matrix-matrix multiplication, while maintaining space efficiency. The standard $k^2$-tree follows a level-by-level representation, which… ▽ More The $k^2$-tree is a compact data structure designed to efficiently store sparse binary matrices by leveraging both sparsity and clustering of nonzero elements. This representation supports efficiently navigational operations and complex binary operations, such as matrix-matrix multiplication, while maintaining space efficiency. The standard $k^2$-tree follows a level-by-level representation, which, while effective, prevents further compression of identical subtrees and it si not cache friendly when accessing individual subtrees. In this work, we introduce some novel depth-first representations of the $k^2$-tree and propose an efficient linear-time algorithm to identify and compress identical subtrees within these structures. Our experimental results show that the use of a depth-first representations is a strategy worth pursuing: for the adjacency matrix of web graphs exploiting the presence of identical subtrees does improve the compression ratio, and for some matrices depth-first representations turns out to be faster than the standard $k^2$-tree in computing the matrix-matrix multiplication. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: extended submission for SPIRE 2025

arXiv:2505.10680 [pdf, ps, other]

Generalization of Repetitiveness Measures for Two-Dimensional Strings

Authors: Lorenzo Carfagna, Giovanni Manzini, Giuseppe Romana, Marinella Sciortino, Cristian Urbina

Abstract: The problem of detecting and measuring the repetitiveness of one-dimensional strings has been extensively studied in data compression and text indexing. Our understanding of these issues has been significantly improved by the introduction of the notion of string attractor [Kempa and Prezza, STOC 2018] and by the results showing the relationship between attractors and other measures of compressibil… ▽ More The problem of detecting and measuring the repetitiveness of one-dimensional strings has been extensively studied in data compression and text indexing. Our understanding of these issues has been significantly improved by the introduction of the notion of string attractor [Kempa and Prezza, STOC 2018] and by the results showing the relationship between attractors and other measures of compressibility. When the input data are structured in a non-linear way, as in two-dimensional strings, inherent redundancy often offers an even richer source for compression. However, systematic studies on repetitiveness measures for two-dimensional strings are still scarce. In this paper we extend to two or more dimensions the main measures of complexity introduced for one-dimensional strings. We distinguish between the measures $δ$ and $γ$, defined in terms of the substrings of the input, and the measures $g$, $g_{rl}$, and $b$, which are based on copy-paste mechanisms. We study the properties and mutual relationships between these two classes and we show that the two classes become incomparable for $d$-dimensional inputs as soon as $d\geq 2$. Moreover, we show that our grammar-based representation of a $d$-dimensional string of size $N$ enables direct access to any symbol in $O(\log N)$ time. We also compare our measures for two-dimensional strings with the 2D Block Tree data structure [Brisaboa et al., Computer J., 2024] and provide some insights for the design of future effective two-dimensional compressors. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 37 pages

arXiv:2409.18620 [pdf, other]

Toward Greener Matrix Operations by Lossless Compressed Formats

Authors: Francesco Tosoni, Philip Bille, Valerio Brunacci, Alessio De Angelis, Paolo Ferragina, Giovanni Manzini

Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed formats for large sparse matrices, focusing specifically on Boolean matrices and real-valued vectors. Through extensive analysis and experiments conducted on ser… ▽ More Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed formats for large sparse matrices, focusing specifically on Boolean matrices and real-valued vectors. Through extensive analysis and experiments conducted on server and edge devices, we found that different matrix compression formats offer distinct trade-offs among space usage, execution time, and energy consumption. Notably, by employing the appropriate compressed format, we can reduce energy consumption by an order of magnitude on both server and single-board computers. Furthermore, our experiments indicate that while data parallelism can enhance execution speed and energy efficiency, achieving simultaneous time and energy efficiency presents partially distinct challenges. Specifically, we show that for certain compression schemes, the optimal degree of parallelism for time does not align with that for energy, thereby challenging prevailing assumptions about a straightforward linear correlation between execution time and energy consumption. Our results have significant implications for software engineers in all domains where SpMV operations are prevalent. They also suggest that similar studies exploring the trade-offs between time, space, and energy for other compressed data structures can substantially contribute to designing more energy-efficient software components. △ Less

Submitted 27 September, 2024; originally announced September 2024.

Comments: 19 pages, 10 figures,2 tables

arXiv:2408.04537 [pdf, other]

Faster run-length compressed suffix arrays

Authors: Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino

Abstract: We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in… ▽ More We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in $O (\log r_a + \log \log n)$ time, where $r_a$ is the number of runs of copies of $a$ in the BWT. We then show how to modify the RLCSA such that we find the SA interval for $a P$ in only $O (\log r_a)$ time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text. △ Less

Submitted 19 April, 2025; v1 submitted 8 August, 2024; originally announced August 2024.

arXiv:2407.18753 [pdf, other]

Suffixient Arrays: a New Efficient Suffix Array Compression Technique

Authors: Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza

Abstract: The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practic… ▽ More The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practice tend to exhibit poor cache locality and thus significantly slow down queries. In this paper, we propose a new simple and very efficient solution to this problem by presenting the \emph{Suffixient Array}: a tiny subset of the Suffix Array \emph{sufficient} to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available. We prove that: (i) the Suffixient Array length $χ$ is a strong repetitiveness measure, (ii) unlike most existing repetition-aware indexes such as the $r$-index, our new index is efficient in the I/O model, and (iii) Suffixient Arrays can be computed in linear time and compressed working space. We show experimentally that, when using well-established compressed random access data structures on repetitive collections, the Suffixient Array $\SuA$ is \emph{simultaneously} (i) faster and orders of magnitude smaller than the Suffix Array $\SA$ and (ii) smaller and \emph{one to two orders of magnitude faster} than the $r$-index. With an average pattern matching query time as low as 3.5 ns per character, our new index gets very close to the ultimate lower bound: the RAM throughput of our workstation (1.18 ns per character). △ Less

Submitted 18 March, 2025; v1 submitted 26 July, 2024; originally announced July 2024.

Comments: 40 pages, 7 figure, 1 table and 7 pseudocodes

arXiv:2404.14235 [pdf, other]

Computing the LCP Array of a Labeled Graph

Authors: Jarno Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, Nicola Prezza

Abstract: The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queri… ▽ More The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queries on the graph's paths. In this paper, we provide the first efficient algorithm building the LCP array of a directed labeled graph with $n$ nodes and $m$ edges labeled over an alphabet of size $σ$. After arguing that the natural generalization of a compact-space LCP-construction algorithm by Beller et al. [J. Discrete Algorithms 2013] runs in time $Ω(nσ)$, we present a new algorithm based on dynamic range stabbing building the LCP array in $O(n\log σ)$ time and $O(n\logσ)$ bits of working space. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2402.06935 [pdf, other]

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro

Abstract: For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can… ▽ More For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any $k_{\max}$-tuple, for a parameter $k_{\max}$; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate. △ Less

Submitted 4 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2312.01359 [pdf, other]

Suffixient Sets

Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs. △ Less

Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2307.02629 [pdf, other]

The landscape of compressibility measures for two-dimensional data

Authors: Lorenzo Carfagna, Giovanni Manzini

Abstract: In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $γ$ measure defined in terms of the smallest string attractor, and the $δ$ measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures $γ_{2D}$ and $δ_{2D}$, as natural generalizations of $γ$ and $δ$, and… ▽ More In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $γ$ measure defined in terms of the smallest string attractor, and the $δ$ measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures $γ_{2D}$ and $δ_{2D}$, as natural generalizations of $γ$ and $δ$, and we initiate the study of their properties. Among other things, we prove that $δ_{2D}$ is monotone and can be computed in linear time, and we show that, although it is still true that $δ_{2D} \leq γ_{2D}$, the gap between the two measures can be $Ω(\sqrt{n})$ and therefore asymptotically larger than the gap between $γ$ and $δ$. To complete the scenario of two-dimensional compressibility measures, we introduce the measure $b_{2D}$ which generalizes to two dimensions the notion of optimal parsing. We prove that, somewhat surprisingly, the relationship between $b_{2D}$ and $γ_{2D}$ is significantly different than in the one-dimensional case. As an application of our results we provide the first analysis of the space usage of the two-dimensional block tree introduced in [Brisaboa et al., Two-dimensional block trees, The computer Journal, 2024]. Our analysis shows that the space usage can be bounded in terms of both $γ_{2D}$ and $δ_{2D}$. Finally, using insights from our analysis, we design the first linear time and space algorithm for constructing the two-dimensional block tree for arbitrary matrices. △ Less

Submitted 20 May, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

arXiv:2301.05338 [pdf, ps, other]

Computing matching statistics on Wheeler DFAs

Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs. △ Less

Submitted 12 January, 2023; originally announced January 2023.

arXiv:2208.09840 [pdf, ps, other]

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

Authors: Travis Gagie, Giovanni Manzini, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.… ▽ More The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT. We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string). Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT. △ Less

Submitted 21 August, 2022; originally announced August 2022.

arXiv:2205.05643 [pdf, other]

A New Class of String Transformations for Compressed Text Indexing

Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

Abstract: Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challeng… ▽ More Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. Among the known BWT variants, the only one that has been recently shown to be a valid alternative to BWT is the Alternating BWT (ABWT), another invertible string transformation introduced about ten years ago in connection with a generalization of Lyndon words. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the myriad virtues of BWT. We show that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT. Although all transformations support pattern search, we show that, in the general case, the transformations within our larger class may take quadratic time for inversion and pattern search. As a further result, we show that the local orderings-based transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string, and we provide an algorithm solving this problem in linear time. △ Less

Submitted 8 May, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:1902.01280

arXiv:2203.14540 [pdf, other]

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

Authors: Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro, Manuel Striani, Francesco Tosoni

Abstract: As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho… ▽ More As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach. △ Less

Submitted 30 March, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

arXiv:2111.04150 [pdf, other]

doi 10.1016/j.cma.2021.114352

Extended virtual element method for two-dimensional linear elastic fracture

Authors: Elena Benvenuti, Andrea Chiozzi, Gianmarco Manzini, N. Sukumar

Abstract: In this paper, we propose an eXtended Virtual Element Method (X-VEM) for two-dimensional linear elastic fracture. This approach, which is an extension of the standard Virtual Element Method (VEM), facilitates mesh-independent modeling of crack discontinuities and elastic crack-tip singularities on general polygonal meshes. For elastic fracture in the X-VEM, the standard virtual element space is au… ▽ More In this paper, we propose an eXtended Virtual Element Method (X-VEM) for two-dimensional linear elastic fracture. This approach, which is an extension of the standard Virtual Element Method (VEM), facilitates mesh-independent modeling of crack discontinuities and elastic crack-tip singularities on general polygonal meshes. For elastic fracture in the X-VEM, the standard virtual element space is augmented by additional basis functions that are constructed by multiplying standard virtual basis functions by suitable enrichment fields, such as asymptotic mixed-mode crack-tip solutions. The design of the X-VEM requires an extended projector that maps functions lying in the extended virtual element space onto a set spanned by linear polynomials and the enrichment fields. An efficient scheme to compute the mixed-mode stress intensity factors using the domain form of the interaction integral is described. The formulation permits integration of weakly singular functions to be performed over the boundary edges of the element. Numerical experiments are conducted on benchmark mixed-mode linear elastic fracture problems that demonstrate the sound accuracy and optimal convergence in energy of the proposed formulation. △ Less

Submitted 7 November, 2021; originally announced November 2021.

arXiv:2104.04096 [pdf, other]

The virtual element method for the coupled system of magneto-hydrodynamics

Authors: Sebastian Naranjo-Alvarez, Vrushali Bokil, Vitaliy Gyrya, Gianmarco Manzini

Abstract: In this work, we review the framework of the Virtual Element Method (VEM) for a model in magneto-hydrodynamics (MHD), that incorporates a coupling between electromagnetics and fluid flow, and allows us to construct novel discretizations for simulating realistic phenomenon in MHD. First, we study two chains of spaces approximating the electromagnetic and fluid flow components of the model. Then, we… ▽ More In this work, we review the framework of the Virtual Element Method (VEM) for a model in magneto-hydrodynamics (MHD), that incorporates a coupling between electromagnetics and fluid flow, and allows us to construct novel discretizations for simulating realistic phenomenon in MHD. First, we study two chains of spaces approximating the electromagnetic and fluid flow components of the model. Then, we show that this VEM approximation will yield divergence free discrete magnetic fields, an important property in any simulation in MHD. We present a linearization strategy to solve the VEM approximation which respects the divergence free condition on the magnetic field. This linearization will require that, at each non-linear iteration, a linear system be solved. We study these linear systems and show that they represent well-posed saddle point problems. We conclude by presenting numerical experiments exploring the performance of the VEM applied to the subsystem describing the electromagnetics. The first set of experiments provide evidence regarding the speed of convergence of the method as well as the divergence-free condition on the magnetic field. In the second set we present a model for magnetic reconnection in a mesh that includes a series of hanging nodes, which we use to calibrate the resolution of the method. The magnetic reconnection phenomenon happens near the center of the domain where the mesh resolution is finer and high resolution is achieved. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Comments: 36 pages, 7 figures

arXiv:2011.05610 [pdf, ps, other]

PHONI: Streamed Matching Statistics with Multi-Genome References

Authors: Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

Abstract: Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape… ▽ More Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. △ Less

Submitted 11 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: Our code is available at https://github.com/koeppl/phoni

arXiv:2009.03675 [pdf, other]

Space efficient merging of de Bruijn graphs and Wheeler graphs

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini

Abstract: The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A… ▽ More The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions. △ Less

Submitted 12 July, 2021; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: 24 pages, 10 figures. arXiv admin note: text overlap with arXiv:1902.02889

arXiv:2006.11687 [pdf, other]

PFP Data Structures

Authors: Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

Abstract: Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size… ▽ More Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB. △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:1910.07145 [pdf, other]

Practical Random Access to SLP-Compressed Texts

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, Yoshimasa Takabatake

Abstract: Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at… ▽ More Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries. △ Less

Submitted 19 July, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: Accepted to SPIRE 2020

arXiv:1908.01263 [pdf, ps, other]

Matching reads to many genomes with the $r$-index

Authors: Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

Abstract: The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f… ▽ More The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that file; and how to query that index with ri-align. Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index . △ Less

Submitted 3 August, 2019; originally announced August 2019.

arXiv:1907.02308 [pdf, ps, other]

The Alternating BWT: an algorithmic perspective

Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several area in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in [Gessel et al. 2012] and studied in the field of C… ▽ More The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several area in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in [Gessel et al. 2012] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in [Giancarlo et al. 2018], where we have shown that BWT and ABWT are part of a larger class of reversible transformations, here we provide a combinatorial and algorithmic study of the novel transform ABWT. We establish a deep analogy between BWT and ABWT by proving they are the only ones in the above mentioned class to be rank-invertible, a novel notion guaranteeing efficient invertibility. In addition, we show that the backward-search procedure can be efficiently generalized to the ABWT; this result implies that also the ABWT can be used as a basis for efficient compressed full text indices. Finally, we prove that the ABWT can be efficiently computed by using a combination of the Difference Cover suffix sorting algorithm [Kärkkäinen et al., 2006] with a linear time algorithm for finding the minimal cyclic rotation of a word with respect to the alternating lexicographical order. △ Less

Submitted 4 July, 2019; originally announced July 2019.

arXiv:1906.00809 [pdf, ps, other]

Rpair: Rescaling RePair with Rsync

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake

Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess… ▽ More Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1905.12987 [pdf, other]

doi 10.1007/978-3-030-32686-9_10

Inducing the Lyndon Array

Authors: Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino, Guilherme P. Telles

Abstract: In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. I… ▽ More In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice. △ Less

Submitted 26 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: Accepted to SPIRE'19

arXiv:1903.01465 [pdf, other]

Lightweight merging of compressed indices based on BWT variants

Authors: Lavinia Egidi, Giovanni Manzini

Abstract: In this paper we propose a flexible and lightweight technique for merging compressed indices based on variants of Burrows-Wheeler transform (BWT), thus addressing the need for algorithms that compute compressed indices over large collections using a limited amount of working memory. Merge procedures make it possible to use an incremental strategy for building large indices based on merging indices… ▽ More In this paper we propose a flexible and lightweight technique for merging compressed indices based on variants of Burrows-Wheeler transform (BWT), thus addressing the need for algorithms that compute compressed indices over large collections using a limited amount of working memory. Merge procedures make it possible to use an incremental strategy for building large indices based on merging indices for progressively larger subcollections. Starting with a known lightweight algorithm for merging BWTs [Holt and McMillan, Bionformatics 2014], we show how to modify it in order to merge, or compute from scratch, also the Longest Common Prefix (LCP) array. We then expand our technique for merging compressed tries and circular/permuterm compressed indices, two compressed data structures for which there were hitherto no known merging algorithms. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Comments: 23 pages. A preliminary version appeared in Proc. SPIRE 2017, Springer Verlag LNCS 10508. arXiv admin note: text overlap with arXiv:1609.04618

arXiv:1902.02889 [pdf, other]

doi 10.1007/978-3-030-32686-9_24

Space-efficient merging of succinct de Bruijn graphs

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini

Abstract: We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019],… ▽ More We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. △ Less

Submitted 26 July, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: Accepted to SPIRE'19

arXiv:1902.01280 [pdf, other]

A New Class of Searchable and Provably Highly Compressible String Transformations

Authors: Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited succes… ▽ More The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the myriad virtues of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1811.06933 [pdf, other]

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Authors: Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

Abstract: While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find… ▽ More While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time. △ Less

Submitted 16 November, 2018; originally announced November 2018.

arXiv:1809.07320 [pdf, other]

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, Giovanni Manzini

Abstract: In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result… ▽ More In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19\%, from 220 million to 178 million, and using the XBWT reduces it by a further 15\%, to 150 million. △ Less

Submitted 1 June, 2021; v1 submitted 19 September, 2018; originally announced September 2018.

arXiv:1805.06821 [pdf, other]

doi 10.4230/LIPIcs.WABI.2018.10

External memory BWT and LCP computation for sequence collections with applications

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, Guilherme P. Telles

Abstract: We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the par… ▽ More We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We prove that our algorithm performs O(n AveLcp) sequential I/Os, where n is the total length of the collection, and AveLcp is the average Longest Common Prefix of the collection. This bound is an improvement over the known algorithms for the same task. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and for collections with relatively small average Longest Common Prefix. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, used with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. To our knowledge, there are no other known external memory algorithms for these problems. △ Less

Submitted 17 May, 2018; originally announced May 2018.

arXiv:1803.11245 [pdf, other]

Prefix-Free Parsing for Building Big BWTs

Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun

Abstract: High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of… ▽ More High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as {\em prefix-free parsing}, that takes a text $T$ as input, and in one-pass generates a dictionary $D$ and a parse $P$ of $T$ with the property that the BWT of $T$ can be constructed from $D$ and $P$ using workspace proportional to their total size and $O (|T|)$-time. Our experiments show that $D$ and $P$ are significantly smaller than $T$ in practice, and thus, can fit in a reasonable internal memory even when $T$ is very large. In particular, we show that with prefix-free parsing we can build an 131-megabyte run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 hours using 21 gigabytes of memory, suggesting that we can build a 6.73 gigabyte index for 1000 complete human-genome haplotypes in approximately 102 hours using about 1 terabyte of memory. △ Less

Submitted 16 November, 2018; v1 submitted 29 March, 2018; originally announced March 2018.

Comments: Preliminary version appeared at WABI '18; full version submitted to a journal

arXiv:1710.10105 [pdf, other]

doi 10.1016/j.jda.2018.08.001

Lyndon Array Construction during Burrows-Wheeler Inversion

Authors: Felipe A. Louza, W. F. Smyth, Giovanni Manzini, Guilherme P. Telles

Abstract: In this paper we present an algorithm to compute the Lyndon array of a string $T$ of length $n$ as a byproduct of the inversion of the Burrows-Wheeler transform of $T$. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that… ▽ More In this paper we present an algorithm to compute the Lyndon array of a string $T$ of length $n$ as a byproduct of the inversion of the Burrows-Wheeler transform of $T$. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that computing the Burrows-Wheeler transform and then constructing the Lyndon array is competitive compared to the known approaches. We also propose a new balanced parenthesis representation for the Lyndon array that uses $2n+o(n)$ bits of space and supports constant time access. This representation can be built in linear time using $O(n)$ words of space, or in $O(n\log n/\log\log n)$ time using asymptotically the same space as $T$. △ Less

Submitted 27 October, 2017; originally announced October 2017.

Journal ref: Journal of Discrete Algorithms, 50 (2018), 2-9

arXiv:1610.02865 [pdf, other]

An Encoding for Order-Preserving Matching

Authors: Travis Gagie, Giovanni Manzini, Rossano Venturini

Abstract: Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in… ▽ More Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the {\em relative order} of their characters is the same: e.g., $4, 1, 3, 2$ and $10, 3, 7, 5$ are an order-preserving match. We show how, given a string $S [1..n]$ over an arbitrary alphabet and a constant $c \geq 1$, we can build an $O (n \log \log n)$-bit encoding such that later, given a pattern $P [1..m]$ with $m \leq \lg^c n$, we can return the number of order-preserving occurrences of $P$ in $S$ in $O (m)$ time. Within the same time bound we can also return the starting position of some order-preserving match for $P$ in $S$ (if such a match exists). We prove that our space bound is within a constant factor of optimal; our query time is optimal if $\log σ= Ω(\log n)$. Our space bound contrasts with the $Ω(n \log n)$ bits needed in the worst case to store $S$ itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to $O (\lg^c n)$ neighbouring characters. △ Less

Submitted 17 February, 2017; v1 submitted 10 October, 2016; originally announced October 2016.

arXiv:1609.04618 [pdf, ps, other]

From H&M to Gap for Lightweight BWT Merging

Authors: Giovanni Manzini

Abstract: Recently, Holt and McMillan [Bionformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a family of strings. In this paper we show that the H&M algorithm can be improved so that, in addition to merging the BWTs, it can also merge the Longest Common Prefix (LCP) arrays. The new algorithm, called Gap because of how it operates, has the s… ▽ More Recently, Holt and McMillan [Bionformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a family of strings. In this paper we show that the H&M algorithm can be improved so that, in addition to merging the BWTs, it can also merge the Longest Common Prefix (LCP) arrays. The new algorithm, called Gap because of how it operates, has the same asymptotic cost as the H&M algorithm and requires additional space only for storing the LCP values. △ Less

Submitted 15 September, 2016; originally announced September 2016.

Comments: 11 pages

ACM Class: F.2.2; E.1

arXiv:1606.05724 [pdf, ps, other]

A Compact Index for Order-Preserving Pattern Matching

Authors: Gianni Decaroli, Travis Gagie, Giovanni Manzini

Abstract: Order-preserving pattern matching was introduced recently but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem we consider the offline version in which we build an index for the reference sequence so that subsequent searc… ▽ More Order-preserving pattern matching was introduced recently but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem we consider the offline version in which we build an index for the reference sequence so that subsequent searches can be completed very efficiently. We propose a space-efficient index that works well in practice despite its lack of good worst-case time bounds. Our solution is based on the new approach of decomposing the indexed sequence into an order component, containing ordering information, and a delta component, containing information on the absolute values. Experiments show that this approach is viable, faster than the available alternatives, and it is the first one offering simultaneously small space usage and fast retrieval. △ Less

Submitted 10 December, 2018; v1 submitted 18 June, 2016; originally announced June 2016.

Comments: 16 pages. A preliminary version appeared in the Proc. IEEE Data Compression Conference, DCC 2017, Snowbird, UT, USA, 2017

ACM Class: F.2.2; E.4; H.3.3

arXiv:1605.06615 [pdf, other]

Efficient and Compact Representations of Some Non-Canonical Prefix-Free Codes

Authors: Antonio Fariña, Travis Gagie, Szymon Grabowski, Giovanni Manzini, Gonzalo Navarro, Alberto Ordóñez

Abstract: For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to symbols. In this paper we first show how, given a probability distribution over an alphabet of $σ$ symbols, we can store an optimal… ▽ More For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to symbols. In this paper we first show how, given a probability distribution over an alphabet of $σ$ symbols, we can store an optimal alphabetic prefix-free code in $\Oh{σ\log L}$ bits such that we can encode and decode any codeword of length $\ell$ in $\Oh{\min (\ell, \log L)}$ time, where $L$ is the maximum codeword length. With $\Oh{2^{L^ε}}$ further bits, for any constant $ε>0$, we can encode and decode $\Oh{\log \ell}$ time. We then show how to store a nearly optimal alphabetic prefix-free code in $o (σ)$ bits such that we can encode and decode in constant time. We also consider a kind of optimal prefix-free code introduced recently where the codewords' lengths are non-decreasing if arranged in lexicographic order of their reverses. We reduce their storage space to $\Oh{σ\log L}$ while maintaining encoding and decoding times in $\Oh{\ell}$. We also show how, with $\Oh{2^{εL}}$ further bits, we can encode and decode in constant time. All of our results hold in the word-RAM model. △ Less

Submitted 1 April, 2021; v1 submitted 21 May, 2016; originally announced May 2016.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. A preliminary version was presented at the 23rd International Symposium on String Processing and Information Retrieval (SPIRE '16)

arXiv:1506.03262 [pdf, other]

Relative Select

Authors: Christina Boucher, Alexander Bowe, Travis Gagie, Giovanni Manzini, Jouni Sirén

Abstract: Motivated by the problem of storing coloured de Bruijn graphs, we show how, if we can already support fast select queries on one string, then we can store a little extra information and support fairly fast select queries on a similar string. Motivated by the problem of storing coloured de Bruijn graphs, we show how, if we can already support fast select queries on one string, then we can store a little extra information and support fairly fast select queries on a similar string. △ Less

Submitted 10 June, 2015; originally announced June 2015.

arXiv:1404.4814 [pdf, ps, other]

Reusing an FM-index

Authors: Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, Jouni Sirén

Abstract: Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems. Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems. △ Less

Submitted 9 May, 2014; v1 submitted 18 April, 2014; originally announced April 2014.

arXiv:1312.3422 [pdf, ps, other]

Compressed Spaced Suffix Arrays

Authors: Travis Gagie, Giovanni Manzini, Daniel Valenzuela

Abstract: Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still suppo… ▽ More Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice. △ Less

Submitted 9 March, 2014; v1 submitted 12 December, 2013; originally announced December 2013.

arXiv:0909.4341 [pdf, ps, other]

Lightweight Data Indexing and Compression in External Memory

Authors: Paolo Ferragina, Travis Gagie, Giovanni Manzini

Abstract: In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n}$ bits of disk working space while all previous approaches use $\Th{n \log n}$ bits of disk working space. Moreover, our algorithms access disk data… ▽ More In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n}$ bits of disk working space while all previous approaches use $\Th{n \log n}$ bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the BWT that uses $\Th{n}$ bits of working space, and a lightweight {\em internal-memory} algorithm for computing the BWT which is the fastest in the literature when the available working space is $\os{n}$ bits. Finally, we prove {\em lower} bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space $\times$ number of passes over the disk data. △ Less

Submitted 24 September, 2009; originally announced September 2009.

Showing 1–41 of 41 results for author: Manzini, G