-
CARGO: Effective format-free compressed storage of genomic information
Authors:
Ćukasz Roguski,
Paolo Ribeca
Abstract:
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics…
▽ More
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors, and scale well to multi-TB datasets.
△ Less
Submitted 16 June, 2015;
originally announced June 2015.
-
Faster exact Markovian probability functions for motif occurrences: a DFA-only approach
Authors:
Paolo Ribeca,
Emanuele Raineri
Abstract:
Background: The computation of the statistical properties of motif occurrences has an obviously relevant practical application: for example, patterns that are significantly over- or under-represented in the genome are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing pipelines use fast but approximate scoring functi…
▽ More
Background: The computation of the statistical properties of motif occurrences has an obviously relevant practical application: for example, patterns that are significantly over- or under-represented in the genome are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing pipelines use fast but approximate scoring functions, in spite of the fact that they have been shown to systematically produce incorrect results. A few interesting exact approaches are known, but they are very slow and hence not practical in the case of realistic sequences. Results: We give an exact solution, solely based on deterministic finite-state automata (DFAs), to the problem of finding not only the p-value, but the whole relevant part of the Markovian probability distribution function of a motif in a biological sequence. In particular, the time complexity of the algorithm in the most interesting regimes is far better than that of Nuel (2006), which was the fastest similar exact algorithm known to date; in many cases, even approximate methods are outperformed. Conclusions: DFAs are a standard tool of computer science for the study of patterns, but so far they have been sparingly used in the study of biological motifs. Previous works do propose algorithms involving automata, but there they are used respectively as a first step to build a Finite Markov Chain Imbedding (FMCI), or to write a generating function: whereas we only rely on the concept of DFA to perform the calculations. This innovative approach can realistically be used for exact statistical studies of very long genomes and protein sequences, as we illustrate with some examples on the scale of the human genome.
△ Less
Submitted 24 January, 2008;
originally announced January 2008.
-
The Topology of Pseudoknotted Homopolymers
Authors:
G. Vernizzi,
P. Ribeca,
H. Orland,
A. Zee
Abstract:
We consider the folding of a self-avoiding homopolymer on a lattice, with saturating hydrogen bond interactions. Our goal is to numerically evaluate the statistical distribution of the topological genus of pseudoknotted configurations. The genus has been recently proposed for classifying pseudoknots (and their topological complexity) in the context of RNA folding. We compare our results on the d…
▽ More
We consider the folding of a self-avoiding homopolymer on a lattice, with saturating hydrogen bond interactions. Our goal is to numerically evaluate the statistical distribution of the topological genus of pseudoknotted configurations. The genus has been recently proposed for classifying pseudoknots (and their topological complexity) in the context of RNA folding. We compare our results on the distribution of the genus of pseudoknots, with the theoretical predictions of an existing combinatorial model for an infinitely flexible and stretchable homopolymer. We thus obtain that steric and geometric constraints considerably limit the topological complexity of pseudoknotted configurations, as it occurs for instance in real RNA molecules. We also analyze the scaling properties at large homopolymer length, and the genus distributions above and below the critical temperature between the swollen phase and the compact-globule phase, both in two and three dimensions.
△ Less
Submitted 30 August, 2005; v1 submitted 29 August, 2005;
originally announced August 2005.