Skip to main content

Showing 1–41 of 41 results for author: Manzini, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.14734  [pdf, ps, other

    cs.DS

    Compressing Suffix Trees by Path Decompositions

    Authors: Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Ragnar Groot Koerkamp, Giovanni Manzini, Nicola Prezza

    Abstract: In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. In this paper, we revisit path compression and show that a more careful choice of pointers leads to a new elegant, simple, and remarkably efficient way to compress the suffix tree. We begin by observing t… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: preliminary incomplete draft. Many details missing!

  2. arXiv:2506.03294  [pdf, ps, other

    cs.DS

    Prefix-free parsing for merging big BWTs

    Authors: Diego Diaz-Dominguez, Travis Gagie, Veronica Guerrini, Ben Langmead, Zsuzsanna Liptak, Giovanni Manzini, Francesco Masillo, Vikram Shivakumar

    Abstract: When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes --… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  3. arXiv:2505.11302  [pdf, ps, other

    cs.DS

    Depth first representations of $k^2$-trees

    Authors: Gabriel Carmona, Giovanni Manzini

    Abstract: The $k^2$-tree is a compact data structure designed to efficiently store sparse binary matrices by leveraging both sparsity and clustering of nonzero elements. This representation supports efficiently navigational operations and complex binary operations, such as matrix-matrix multiplication, while maintaining space efficiency. The standard $k^2$-tree follows a level-by-level representation, which… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: extended submission for SPIRE 2025

  4. arXiv:2505.10680  [pdf, ps, other

    cs.DS

    Generalization of Repetitiveness Measures for Two-Dimensional Strings

    Authors: Lorenzo Carfagna, Giovanni Manzini, Giuseppe Romana, Marinella Sciortino, Cristian Urbina

    Abstract: The problem of detecting and measuring the repetitiveness of one-dimensional strings has been extensively studied in data compression and text indexing. Our understanding of these issues has been significantly improved by the introduction of the notion of string attractor [Kempa and Prezza, STOC 2018] and by the results showing the relationship between attractors and other measures of compressibil… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 37 pages

  5. arXiv:2409.18620  [pdf, other

    cs.DS cs.PF

    Toward Greener Matrix Operations by Lossless Compressed Formats

    Authors: Francesco Tosoni, Philip Bille, Valerio Brunacci, Alessio De Angelis, Paolo Ferragina, Giovanni Manzini

    Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed formats for large sparse matrices, focusing specifically on Boolean matrices and real-valued vectors. Through extensive analysis and experiments conducted on ser… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: 19 pages, 10 figures,2 tables

  6. arXiv:2408.04537  [pdf, other

    cs.DS

    Faster run-length compressed suffix arrays

    Authors: Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino

    Abstract: We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in… ▽ More

    Submitted 19 April, 2025; v1 submitted 8 August, 2024; originally announced August 2024.

  7. arXiv:2407.18753  [pdf, other

    cs.DS

    Suffixient Arrays: a New Efficient Suffix Array Compression Technique

    Authors: Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza

    Abstract: The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practic… ▽ More

    Submitted 18 March, 2025; v1 submitted 26 July, 2024; originally announced July 2024.

    Comments: 40 pages, 7 figure, 1 table and 7 pseudocodes

  8. arXiv:2404.14235  [pdf, other

    cs.DS

    Computing the LCP Array of a Labeled Graph

    Authors: Jarno Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, Nicola Prezza

    Abstract: The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queri… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  9. arXiv:2402.06935  [pdf, other

    cs.DS q-bio.GN q-bio.PE

    Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

    Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro

    Abstract: For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can… ▽ More

    Submitted 4 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

  10. arXiv:2312.01359  [pdf, other

    cs.DS

    Suffixient Sets

    Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

    Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

  11. arXiv:2307.02629  [pdf, other

    cs.DS math.CO

    The landscape of compressibility measures for two-dimensional data

    Authors: Lorenzo Carfagna, Giovanni Manzini

    Abstract: In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $γ$ measure defined in terms of the smallest string attractor, and the $δ$ measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures $γ_{2D}$ and $δ_{2D}$, as natural generalizations of $γ$ and $δ$, and… ▽ More

    Submitted 20 May, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

  12. arXiv:2301.05338  [pdf, ps, other

    cs.DS

    Computing matching statistics on Wheeler DFAs

    Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

    Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

  13. arXiv:2208.09840  [pdf, ps, other

    cs.DS

    Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

    Authors: Travis Gagie, Giovanni Manzini, Marinella Sciortino

    Abstract: The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.… ▽ More

    Submitted 21 August, 2022; originally announced August 2022.

  14. arXiv:2205.05643  [pdf, other

    cs.DS

    A New Class of String Transformations for Compressed Text Indexing

    Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

    Abstract: Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challeng… ▽ More

    Submitted 8 May, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:1902.01280

  15. arXiv:2203.14540  [pdf, other

    cs.DS

    Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

    Authors: Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro, Manuel Striani, Francesco Tosoni

    Abstract: As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho… ▽ More

    Submitted 30 March, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

  16. Extended virtual element method for two-dimensional linear elastic fracture

    Authors: Elena Benvenuti, Andrea Chiozzi, Gianmarco Manzini, N. Sukumar

    Abstract: In this paper, we propose an eXtended Virtual Element Method (X-VEM) for two-dimensional linear elastic fracture. This approach, which is an extension of the standard Virtual Element Method (VEM), facilitates mesh-independent modeling of crack discontinuities and elastic crack-tip singularities on general polygonal meshes. For elastic fracture in the X-VEM, the standard virtual element space is au… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

  17. arXiv:2104.04096  [pdf, other

    math.NA cs.CE

    The virtual element method for the coupled system of magneto-hydrodynamics

    Authors: Sebastian Naranjo-Alvarez, Vrushali Bokil, Vitaliy Gyrya, Gianmarco Manzini

    Abstract: In this work, we review the framework of the Virtual Element Method (VEM) for a model in magneto-hydrodynamics (MHD), that incorporates a coupling between electromagnetics and fluid flow, and allows us to construct novel discretizations for simulating realistic phenomenon in MHD. First, we study two chains of spaces approximating the electromagnetic and fluid flow components of the model. Then, we… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 36 pages, 7 figures

  18. arXiv:2011.05610  [pdf, ps, other

    cs.DS

    PHONI: Streamed Matching Statistics with Multi-Genome References

    Authors: Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

    Abstract: Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape… ▽ More

    Submitted 11 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

    Comments: Our code is available at https://github.com/koeppl/phoni

  19. arXiv:2009.03675  [pdf, other

    cs.DS

    Space efficient merging of de Bruijn graphs and Wheeler graphs

    Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini

    Abstract: The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A… ▽ More

    Submitted 12 July, 2021; v1 submitted 5 September, 2020; originally announced September 2020.

    Comments: 24 pages, 10 figures. arXiv admin note: text overlap with arXiv:1902.02889

  20. arXiv:2006.11687  [pdf, other

    cs.DS

    PFP Data Structures

    Authors: Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

    Abstract: Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

  21. arXiv:1910.07145  [pdf, other

    cs.DS

    Practical Random Access to SLP-Compressed Texts

    Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, Yoshimasa Takabatake

    Abstract: Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at… ▽ More

    Submitted 19 July, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

    Comments: Accepted to SPIRE 2020

  22. arXiv:1908.01263  [pdf, ps, other

    cs.DS q-bio.GN

    Matching reads to many genomes with the $r$-index

    Authors: Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

    Abstract: The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f… ▽ More

    Submitted 3 August, 2019; originally announced August 2019.

  23. arXiv:1907.02308  [pdf, ps, other

    cs.DS

    The Alternating BWT: an algorithmic perspective

    Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

    Abstract: The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several area in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in [Gessel et al. 2012] and studied in the field of C… ▽ More

    Submitted 4 July, 2019; originally announced July 2019.

  24. arXiv:1906.00809  [pdf, ps, other

    cs.DS

    Rpair: Rescaling RePair with Rsync

    Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake

    Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

  25. Inducing the Lyndon Array

    Authors: Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino, Guilherme P. Telles

    Abstract: In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. I… ▽ More

    Submitted 26 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Accepted to SPIRE'19

  26. arXiv:1903.01465  [pdf, other

    cs.DS

    Lightweight merging of compressed indices based on BWT variants

    Authors: Lavinia Egidi, Giovanni Manzini

    Abstract: In this paper we propose a flexible and lightweight technique for merging compressed indices based on variants of Burrows-Wheeler transform (BWT), thus addressing the need for algorithms that compute compressed indices over large collections using a limited amount of working memory. Merge procedures make it possible to use an incremental strategy for building large indices based on merging indices… ▽ More

    Submitted 4 March, 2019; originally announced March 2019.

    Comments: 23 pages. A preliminary version appeared in Proc. SPIRE 2017, Springer Verlag LNCS 10508. arXiv admin note: text overlap with arXiv:1609.04618

  27. Space-efficient merging of succinct de Bruijn graphs

    Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini

    Abstract: We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019],… ▽ More

    Submitted 26 July, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

    Comments: Accepted to SPIRE'19

  28. arXiv:1902.01280  [pdf, other

    cs.DS

    A New Class of Searchable and Provably Highly Compressible String Transformations

    Authors: Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino

    Abstract: The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited succes… ▽ More

    Submitted 4 February, 2019; originally announced February 2019.

  29. arXiv:1811.06933  [pdf, other

    cs.DS

    Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

    Authors: Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

    Abstract: While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

  30. arXiv:1809.07320  [pdf, other

    cs.DS q-bio.GN

    Compressing and Indexing Aligned Readsets

    Authors: Travis Gagie, Garance Gourdel, Giovanni Manzini

    Abstract: In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result… ▽ More

    Submitted 1 June, 2021; v1 submitted 19 September, 2018; originally announced September 2018.

  31. External memory BWT and LCP computation for sequence collections with applications

    Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, Guilherme P. Telles

    Abstract: We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the par… ▽ More

    Submitted 17 May, 2018; originally announced May 2018.

  32. arXiv:1803.11245  [pdf, other

    cs.DS

    Prefix-Free Parsing for Building Big BWTs

    Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun

    Abstract: High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of… ▽ More

    Submitted 16 November, 2018; v1 submitted 29 March, 2018; originally announced March 2018.

    Comments: Preliminary version appeared at WABI '18; full version submitted to a journal

  33. Lyndon Array Construction during Burrows-Wheeler Inversion

    Authors: Felipe A. Louza, W. F. Smyth, Giovanni Manzini, Guilherme P. Telles

    Abstract: In this paper we present an algorithm to compute the Lyndon array of a string $T$ of length $n$ as a byproduct of the inversion of the Burrows-Wheeler transform of $T$. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that… ▽ More

    Submitted 27 October, 2017; originally announced October 2017.

    Journal ref: Journal of Discrete Algorithms, 50 (2018), 2-9

  34. arXiv:1610.02865  [pdf, other

    cs.DS

    An Encoding for Order-Preserving Matching

    Authors: Travis Gagie, Giovanni Manzini, Rossano Venturini

    Abstract: Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in… ▽ More

    Submitted 17 February, 2017; v1 submitted 10 October, 2016; originally announced October 2016.

  35. arXiv:1609.04618  [pdf, ps, other

    cs.DS

    From H&M to Gap for Lightweight BWT Merging

    Authors: Giovanni Manzini

    Abstract: Recently, Holt and McMillan [Bionformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a family of strings. In this paper we show that the H&M algorithm can be improved so that, in addition to merging the BWTs, it can also merge the Longest Common Prefix (LCP) arrays. The new algorithm, called Gap because of how it operates, has the s… ▽ More

    Submitted 15 September, 2016; originally announced September 2016.

    Comments: 11 pages

    ACM Class: F.2.2; E.1

  36. arXiv:1606.05724  [pdf, ps, other

    cs.DS

    A Compact Index for Order-Preserving Pattern Matching

    Authors: Gianni Decaroli, Travis Gagie, Giovanni Manzini

    Abstract: Order-preserving pattern matching was introduced recently but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem we consider the offline version in which we build an index for the reference sequence so that subsequent searc… ▽ More

    Submitted 10 December, 2018; v1 submitted 18 June, 2016; originally announced June 2016.

    Comments: 16 pages. A preliminary version appeared in the Proc. IEEE Data Compression Conference, DCC 2017, Snowbird, UT, USA, 2017

    ACM Class: F.2.2; E.4; H.3.3

  37. arXiv:1605.06615  [pdf, other

    cs.DS

    Efficient and Compact Representations of Some Non-Canonical Prefix-Free Codes

    Authors: Antonio Fariña, Travis Gagie, Szymon Grabowski, Giovanni Manzini, Gonzalo Navarro, Alberto Ordóñez

    Abstract: For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to symbols. In this paper we first show how, given a probability distribution over an alphabet of $σ$ symbols, we can store an optimal… ▽ More

    Submitted 1 April, 2021; v1 submitted 21 May, 2016; originally announced May 2016.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. A preliminary version was presented at the 23rd International Symposium on String Processing and Information Retrieval (SPIRE '16)

  38. arXiv:1506.03262  [pdf, other

    cs.DS

    Relative Select

    Authors: Christina Boucher, Alexander Bowe, Travis Gagie, Giovanni Manzini, Jouni Sirén

    Abstract: Motivated by the problem of storing coloured de Bruijn graphs, we show how, if we can already support fast select queries on one string, then we can store a little extra information and support fairly fast select queries on a similar string.

    Submitted 10 June, 2015; originally announced June 2015.

  39. arXiv:1404.4814  [pdf, ps, other

    cs.DS

    Reusing an FM-index

    Authors: Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, Jouni Sirén

    Abstract: Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems.

    Submitted 9 May, 2014; v1 submitted 18 April, 2014; originally announced April 2014.

  40. arXiv:1312.3422  [pdf, ps, other

    cs.DS

    Compressed Spaced Suffix Arrays

    Authors: Travis Gagie, Giovanni Manzini, Daniel Valenzuela

    Abstract: Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still suppo… ▽ More

    Submitted 9 March, 2014; v1 submitted 12 December, 2013; originally announced December 2013.

  41. arXiv:0909.4341  [pdf, ps, other

    cs.DS

    Lightweight Data Indexing and Compression in External Memory

    Authors: Paolo Ferragina, Travis Gagie, Giovanni Manzini

    Abstract: In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n}$ bits of disk working space while all previous approaches use $\Th{n \log n}$ bits of disk working space. Moreover, our algorithms access disk data… ▽ More

    Submitted 24 September, 2009; originally announced September 2009.