-
Efficiently listing bounded length st-paths
Authors:
Romeo Rizzi,
Gustavo Sacomoto,
Marie-France Sagot
Abstract:
The problem of listing the $K$ shortest simple (loopless) $st$-paths in a graph has been studied since the early 1960s. For a non-negatively weighted graph with $n$ vertices and $m$ edges, the most efficient solution is an $O(K(mn + n^2 \log n))$ algorithm for directed graphs by Yen and Lawler [Management Science, 1971 and 1972], and an $O(K(m+n \log n))$ algorithm for the undirected version by Ka…
▽ More
The problem of listing the $K$ shortest simple (loopless) $st$-paths in a graph has been studied since the early 1960s. For a non-negatively weighted graph with $n$ vertices and $m$ edges, the most efficient solution is an $O(K(mn + n^2 \log n))$ algorithm for directed graphs by Yen and Lawler [Management Science, 1971 and 1972], and an $O(K(m+n \log n))$ algorithm for the undirected version by Katoh et al. [Networks, 1982], both using $O(Kn + m)$ space. In this work, we consider a different parameterization for this problem: instead of bounding the number of $st$-paths output, we bound their length. For the bounded length parameterization, we propose new non-trivial algorithms matching the time complexity of the classic algorithms but using only $O(m+n)$ space. Moreover, we provide a unified framework such that the solutions to both parameterizations -- the classic $K$-shortest and the new length-bounded paths -- can be seen as two different traversals of a same tree, a Dijkstra-like and a DFS-like traversal, respectively.
△ Less
Submitted 25 November, 2014;
originally announced November 2014.
-
Computing an Evolutionary Ordering is Hard
Authors:
Laurent Bulteau,
Gustavo Sacomoto,
Blerina Sinaimeri
Abstract:
We prove that computing an evolutionary ordering of a family of sets, i.e. an ordering where each set intersects with --but is not included in-- the union earlier sets, is NP-hard.
We prove that computing an evolutionary ordering of a family of sets, i.e. an ordering where each set intersects with --but is not included in-- the union earlier sets, is NP-hard.
△ Less
Submitted 24 October, 2014;
originally announced October 2014.
-
Amortized $\tilde{O}(|V|)$-Delay Algorithm for Listing Chordless Cycles in Undirected Graphs
Authors:
Rui Ferreira,
Roberto Grossi,
Romeo Rizzi,
Gustavo Sacomoto,
Marie-France Sagot
Abstract:
Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. Motivated also by previous work on the classical problem of listing cycles, we study how to list chordless cycles. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V,E)$ takes $O(|E|^2 +|E|\cdot C)$ time. In this pap…
▽ More
Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. Motivated also by previous work on the classical problem of listing cycles, we study how to list chordless cycles. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V,E)$ takes $O(|E|^2 +|E|\cdot C)$ time. In this paper we provide an algorithm taking $\tilde{O}(|E| + |V |\cdot C)$ time. We also show how to obtain the same complexity for listing all the $P$ chordless $st$-paths in $G$ (where $C$ is replaced by $P$ ).
△ Less
Submitted 6 August, 2014;
originally announced August 2014.
-
Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data
Authors:
Gustavo Sacomoto
Abstract:
In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general mo…
▽ More
In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events. Finally, we show that it enables to identify more correct events than general purpose transcriptome assemblers.
In order to deal with ever-increasing volumes of NGS data, we put an extra effort to make KisSplice as scalable as possible. First, to improve its running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. Then, to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time.
Additionally, we apply the same techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson's algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the classical K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms with the same time complexities but using exponentially less memory than previous approaches.
△ Less
Submitted 23 June, 2014;
originally announced June 2014.
-
Navigating in a sea of repeats in RNA-seq without drowning
Authors:
Gustavo Sacomoto,
Blerina Sinaimeri,
Camille Marchet,
Vincent Miele,
Marie-France Sagot,
Vincent Lacroix
Abstract:
The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeat…
▽ More
The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are twofold. First, we introduce a formal model for repre- senting high copy number repeats in RNA-seq data and exploit its properties for inferring a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying in a de Bruijn graph a subgraph with this charac- teristic is NP-complete. In a second step, we show that in the specific case of a local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs. In particular, we designed and implemented an algorithm to efficiently identify AS events that are not included in repeated regions. Finally, we validate our results using synthetic data. We also give an indication of the usefulness of our method on real data.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNA-seq data
Authors:
Gustavo Sacomoto,
Vincent Lacroix,
Marie-France Sagot
Abstract:
We present a new algorithm for enumerating bubbles with length constraints in directed graphs. This problem arises in transcriptomics, where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. This is the first polynomial-delay algorithm for this problem and we show that in practice, it is faster than previous approaches. This enables us t…
▽ More
We present a new algorithm for enumerating bubbles with length constraints in directed graphs. This problem arises in transcriptomics, where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. This is the first polynomial-delay algorithm for this problem and we show that in practice, it is faster than previous approaches. This enables us to deal with larger instances and therefore to discover novel alternative splicing events, especially long ones, that were previously overseen using existing methods.
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
Using cascading Bloom filters to improve the memory usage for de Brujin graphs
Authors:
Kamil Salikhov,
Gustavo Sacomoto,
Gregory Kucherov
Abstract:
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters.…
▽ More
De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.
△ Less
Submitted 21 May, 2013; v1 submitted 28 February, 2013;
originally announced February 2013.
-
Optimal Listing of Cycles and st-Paths in Undirected Graphs
Authors:
Rui Ferreira,
Roberto Grossi,
Andrea Marino,
Nadia Pisanti,
Romeo Rizzi,
Gustavo Sacomoto
Abstract:
We present the first optimal algorithm for the classical problem of listing all the cycles in an undirected graph. We exploit their properties so that the total cost is the time taken to read the input graph plus the time to list the output, namely, the edges in each of the cycles. The algorithm uses a reduction to the problem of listing all the paths from a vertex s to a vertex t which we also so…
▽ More
We present the first optimal algorithm for the classical problem of listing all the cycles in an undirected graph. We exploit their properties so that the total cost is the time taken to read the input graph plus the time to list the output, namely, the edges in each of the cycles. The algorithm uses a reduction to the problem of listing all the paths from a vertex s to a vertex t which we also solve optimally.
△ Less
Submitted 5 July, 2012; v1 submitted 12 May, 2012;
originally announced May 2012.