-
The theoretical analysis of sequencing bioinformatics algorithms and beyond
Authors:
Paul Medvedev
Abstract:
The theoretical analysis of performance has been an important tool in the engineering of algorithms in many application domains. Its goals are to predict the empirical performance of an algorithm and to be a yardstick that drives the design of novel algorithms that perform well in practice. While these goals have been achieved in many instances, they have not been achieved ubiquitously across cruc…
▽ More
The theoretical analysis of performance has been an important tool in the engineering of algorithms in many application domains. Its goals are to predict the empirical performance of an algorithm and to be a yardstick that drives the design of novel algorithms that perform well in practice. While these goals have been achieved in many instances, they have not been achieved ubiquitously across crucial application domains. I provide a case study in the area of sequencing bioinformatics, an inter-disciplinary field that uses algorithms to extract biological meaning from genome sequencing data. In particular, I give three concrete examples: two showing how theoretical analysis has failed to achieve its goals and one showing how it has been successful. I will then catalog some of the challenges of applying theoretical analysis to sequencing bioinformatics, argue why empirical analysis is not enough, and give a vision for improving the relevance of theoretical analysis to sequencing bioinformatics. By recognizing the problem, understanding its roots, and providing potential solutions, this work can hopefully be a crucial first step towards making theoretical analysis more relevant in sequencing bioinformatics and potentially other fast-paced application domains.
△ Less
Submitted 14 November, 2022; v1 submitted 3 May, 2022;
originally announced May 2022.
-
Theoretical analysis of edit distance algorithms: an applied perspective
Authors:
Paul Medvedev
Abstract:
Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to desi…
▽ More
Given its status as a classic problem and its importance to both theoreticians and practitioners, edit distance provides an excellent lens through which to understand how the theoretical analysis of algorithms impacts practical implementations. From an applied perspective, the goals of theoretical analysis are to predict the empirical performance of an algorithm and to serve as a yardstick to design novel algorithms that perform well in practice. In this paper, we systematically survey the types of theoretical analysis techniques that have been applied to edit distance and evaluate the extent to which each one has achieved these two goals. These techniques include traditional worst-case analysis, worst-case analysis parametrized by edit distance or entropy or compressibility, average-case analysis, semi-random models, and advice-based models. We find that the track record is mixed. On one hand, two algorithms widely used in practice have been born out of theoretical analysis and their empirical performance is captured well by theoretical predictions. On the other hand, all the algorithms developed using theoretical analysis as a yardstick since then have not had any practical relevance. We conclude by discussing the remaining open problems and how they can be tackled.
△ Less
Submitted 30 January, 2023; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Data structures to represent a set of k-long DNA sequences
Authors:
Rayan Chikhi,
Jan Holub,
Paul Medvedev
Abstract:
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique fea…
▽ More
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the last ten years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
△ Less
Submitted 11 June, 2020; v1 submitted 28 March, 2019;
originally announced March 2019.
-
Bipartite Graphs of Small Readability
Authors:
Rayan Chikhi,
Vladan Jovicic,
Stefan Kratsch,
Paul Medvedev,
Martin Milanic,
Sofya Raskhodnikova,
Nithin Varma
Abstract:
We study a parameter of bipartite graphs called readability, introduced by Chikhi et al. (Discrete Applied Mathematics, 2016) and motivated by applications of overlap graphs in bioinformatics. The behavior of the parameter is poorly understood. The complexity of computing it is open and it is not known whether the decision version of the problem is in NP. The only known upper bound on the readabil…
▽ More
We study a parameter of bipartite graphs called readability, introduced by Chikhi et al. (Discrete Applied Mathematics, 2016) and motivated by applications of overlap graphs in bioinformatics. The behavior of the parameter is poorly understood. The complexity of computing it is open and it is not known whether the decision version of the problem is in NP. The only known upper bound on the readability of a bipartite graph (following from a work of Braga and Meidanis, LATIN 2002) is exponential in the maximum degree of the graph.
Graphs that arise in bioinformatic applications have low readability. In this paper, we focus on graph families with readability $o(n)$, where $n$ is the number of vertices. We show that the readability of $n$-vertex bipartite chain graphs is between $Ω(\log n)$ and $O(\sqrt{n})$. We give an efficiently testable characterization of bipartite graphs of readability at most $2$ and completely determine the readability of grids, showing in particular that their readability never exceeds $3$. As a consequence, we obtain a polynomial time algorithm to determine the readability of induced subgraphs of grids. One of the highlights of our techniques is the appearance of Euler's totient function in the analysis of the readability of bipartite chain graphs. We also develop a new technique for proving lower bounds on readability, which is applicable to dense graphs with a large number of distinct degrees.
△ Less
Submitted 12 May, 2018;
originally announced May 2018.
-
Modeling Biological Problems in Computer Science: A Case Study in Genome Assembly
Authors:
Paul Medvedev
Abstract:
As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment, and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the compu…
▽ More
As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment, and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the computer science realm. We show the modeling process step-by-step, including all the intermediate failed attempts.
△ Less
Submitted 2 January, 2018; v1 submitted 16 June, 2017;
originally announced June 2017.
-
TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes
Authors:
Ilia Minkin,
Son Pham,
Paul Medvedev
Abstract:
Motivation: De Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results: In this paper, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construc…
▽ More
Motivation: De Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results: In this paper, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less then a day and eight real primates in less than two hours, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. Availability: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo Contact: [email protected]
△ Less
Submitted 18 February, 2016;
originally announced February 2016.
-
Safe and complete contig assembly via omnitigs
Authors:
Alexandru I. Tomescu,
Paul Medvedev
Abstract:
Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a…
▽ More
Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph $G$ (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from $G$ as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.
△ Less
Submitted 16 August, 2016; v1 submitted 12 January, 2016;
originally announced January 2016.
-
On the readability of overlap digraphs
Authors:
Rayan Chikhi,
Paul Medvedev,
Martin Milanic,
Sofya Raskhodnikova
Abstract:
We introduce the graph parameter readability and study it as a function of the number of vertices in a graph. Given a digraph D, an injective overlap labeling assigns a unique string to each vertex such that there is an arc from x to y if and only if x properly overlaps y. The readability of D is the minimum string length for which an injective overlap labeling exists. In applications that utilize…
▽ More
We introduce the graph parameter readability and study it as a function of the number of vertices in a graph. Given a digraph D, an injective overlap labeling assigns a unique string to each vertex such that there is an arc from x to y if and only if x properly overlaps y. The readability of D is the minimum string length for which an injective overlap labeling exists. In applications that utilize overlap digraphs (e.g., in bioinformatics), readability reflects the length of the strings from which the overlap digraph is constructed. We study the asymptotic behaviour of readability by casting it in purely graph theoretic terms (without any reference to strings). We prove upper and lower bounds on readability for certain graph families and general graphs
△ Less
Submitted 17 April, 2015;
originally announced April 2015.
-
On the representation of de Bruijn graphs
Authors:
Rayan Chikhi,
Antoine Limasset,
Shaun Jackman,
Jared Simpson,
Paul Medvedev
Abstract:
The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these ty…
▽ More
The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.
△ Less
Submitted 6 October, 2014; v1 submitted 21 January, 2014;
originally announced January 2014.
-
Shortest paths between shortest paths and independent sets
Authors:
Marcin Kaminski,
Paul Medvedev,
Martin Milanic
Abstract:
We study problems of reconfiguration of shortest paths in graphs. We prove that the shortest reconfiguration sequence can be exponential in the size of the graph and that it is NP-hard to compute the shortest reconfiguration sequence even when we know that the sequence has polynomial length. Moreover, we also study reconfiguration of independent sets in three different models and analyze relations…
▽ More
We study problems of reconfiguration of shortest paths in graphs. We prove that the shortest reconfiguration sequence can be exponential in the size of the graph and that it is NP-hard to compute the shortest reconfiguration sequence even when we know that the sequence has polynomial length. Moreover, we also study reconfiguration of independent sets in three different models and analyze relationships between these models, observing that shortest path reconfiguration is a special case of independent set reconfiguration in perfect graphs, under any of the three models. Finally, we give polynomial results for restricted classes of graphs (even-hole-free and $P_4$-free graphs).
△ Less
Submitted 7 February, 2011; v1 submitted 26 August, 2010;
originally announced August 2010.
-
The Plane-Width of Graphs
Authors:
Marcin Kaminski,
Paul Medvedev,
Martin Milanic
Abstract:
Map vertices of a graph to (not necessarily distinct) points of the plane so that two adjacent vertices are mapped at least a unit distance apart. The plane-width of a graph is the minimum diameter of the image of the vertex set over all such mappings. We establish a relation between the plane-width of a graph and its chromatic number, and connect it to other well-known areas, including the circ…
▽ More
Map vertices of a graph to (not necessarily distinct) points of the plane so that two adjacent vertices are mapped at least a unit distance apart. The plane-width of a graph is the minimum diameter of the image of the vertex set over all such mappings. We establish a relation between the plane-width of a graph and its chromatic number, and connect it to other well-known areas, including the circular chromatic number and the problem of packing unit discs in the plane. We also investigate how plane-width behaves under various operations, such as homomorphism, disjoint union, complement, and the Cartesian product.
△ Less
Submitted 23 December, 2008;
originally announced December 2008.