Skip to main content

Showing 1–9 of 9 results for author: Diaz-Dominguez, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03294  [pdf, ps, other

    cs.DS

    Prefix-free parsing for merging big BWTs

    Authors: Diego Diaz-Dominguez, Travis Gagie, Veronica Guerrini, Ben Langmead, Zsuzsanna Liptak, Giovanni Manzini, Francesco Masillo, Vikram Shivakumar

    Abstract: When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes --… ▽ More

    Submitted 6 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  2. arXiv:2411.12439  [pdf, other

    cs.DS

    Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

    Authors: Diego Diaz-Dominguez

    Abstract: We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite informatio… ▽ More

    Submitted 24 February, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

  3. arXiv:2306.16815  [pdf, other

    cs.IR cs.DS

    Computing all-vs-all MEMs in grammar-compressed text

    Authors: Diego Diaz-Dominguez, Leena Salmela

    Abstract: We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  4. arXiv:2208.14787  [pdf, other

    cs.DS

    Computing all-vs-all MEMs in run-length encoded collections of HiFi reads

    Authors: Diego Díaz-Domínguez, Simon J. Puglisi, Leena Salmela

    Abstract: We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing… ▽ More

    Submitted 31 August, 2022; originally announced August 2022.

    Comments: Accepted in SPIRE'22

  5. arXiv:2204.05969  [pdf, other

    cs.DS

    Efficient Construction of the BWT for Repetitive Text Using String Compression

    Authors: Diego Díaz-Domínguez, Gonzalo Navarro

    Abstract: We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results… ▽ More

    Submitted 14 August, 2023; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: Under peer review

  6. arXiv:2011.07999  [pdf, other

    cs.DS cs.IR

    A grammar compressor for collections of reads with applications to the construction of the BWT

    Authors: Diego Díaz-Domínguez, Gonzalo Navarro

    Abstract: We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments i… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  7. arXiv:1908.02211  [pdf, other

    cs.DS

    An Index for Sequencing Reads Based on The Colored de Bruijn Graph

    Authors: Diego Diaz-Domínguez

    Abstract: In this article, we show how to transform a colored de Bruijn graph (dBG) into a practical index for processing massive sets of sequencing reads. Similar to previous works, we encode an instance of a colored dBG of the set using BOSS and a color matrix C. To reduce the space requirements, we devise an algorithm that produces a smaller and more sparse version of C. The novelties in this algorithm a… ▽ More

    Submitted 29 November, 2019; v1 submitted 6 August, 2019; originally announced August 2019.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

  8. arXiv:1901.10453  [pdf, other

    cs.DS

    Simulating the DNA String Graph in Succinct Space

    Authors: Diego Díaz-Domínguez, Travis Gagie, Gonzalo Navarro

    Abstract: Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted… ▽ More

    Submitted 29 November, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

    MSC Class: J.3; E.1; G.2.2 ACM Class: J.3; E.1; G.2.2

  9. arXiv:1805.05228  [pdf, other

    cs.DS

    Assembling Omnitigs using Hidden-Order de Bruijn Graphs

    Authors: Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, Simon J. Puglisi

    Abstract: De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2… ▽ More

    Submitted 14 May, 2018; originally announced May 2018.