Skip to main content

Showing 1–23 of 23 results for author: Mäkinen, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.09752  [pdf, ps, other

    cs.DS

    Finding Maximal Exact Matches in Graphs

    Authors: Nicola Rizzo, Manuel Cáceres, Veli Mäkinen

    Abstract: We study the problem of finding maximal exact matches (MEMs) between a query string $Q$ and a labeled graph $G$. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $κ$ ($κ$-M… ▽ More

    Submitted 3 July, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

    Comments: 21 pages, 3 figures. To be published in the proceedings of WABI 2023. This article supersedes part of arXiv:2302.01748

  2. arXiv:2303.05336  [pdf, other

    cs.DS

    Elastic Founder Graphs Improved and Enhanced

    Authors: Nicola Rizzo, Massimo Equi, Tuukka Norri, Veli Mäkinen

    Abstract: Indexing labeled graphs for pattern matching is a central challenge of pangenomics. Equi et al. (Algorithmica, 2022) developed the Elastic Founder Graph ($\mathsf{EFG}$) representing an alignment of $m$ sequences of length $n$, drawn from alphabet $Σ$ plus the special gap character: the paths spell the original sequences or their recombination. By enforcing the semi-repeat-free property, the… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: 47 pages, 10 figures. Extension of conference papers IWOCA 2022 (https://doi.org/10.1007/978-3-031-06678-8_35 , preprint arXiv:2201.06492), CPM 2022 (https://doi.org/10.4230/LIPIcs.CPM.2022.19 ), and of some results from PhD dissertation projects of Massimo Equi (http://urn.fi/URN:ISBN:978-951-51-8217-3 ) and Tuukka Norri (http://urn.fi/URN:ISBN:978-951-51-8215-9 )

    ACM Class: E.1; E.4; F.1.3; F.2.2

  3. arXiv:2302.01748  [pdf, other

    cs.DS

    Chaining of Maximal Exact Matches in Graphs

    Authors: Nicola Rizzo, Manuel Cáceres, Veli Mäkinen

    Abstract: We show how to chain maximal exact matches (MEMs) between a query string $Q$ and a labeled directed acyclic graph (DAG) $G=(V,E)$ to solve the longest common subsequence (LCS) problem between $Q$ and $G$. We obtain our result via a new symmetric formulation of chaining in DAGs that we solve in $O(m+n+k^2|V| + |E| + kN\log N)$ time, where $m=|Q|$, $n$ is the total length of node labels, $k$ is the… ▽ More

    Submitted 5 July, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

    Comments: 17 pages, 2 figures

  4. Linear Time Construction of Indexable Elastic Founder Graphs

    Authors: Nicola Rizzo, Veli Mäkinen

    Abstract: Pattern matching on graphs has been widely studied lately due to its importance in genomics applications. Unfortunately, even the simplest problem of deciding if a string appears as a subpath of a graph admits a quadratic lower bound under the Orthogonal Vectors Hypothesis (Equi et al. ICALP 2019, SOFSEM 2021). To avoid this bottleneck, the research has shifted towards more specific graph classes,… ▽ More

    Submitted 17 January, 2022; originally announced January 2022.

    Comments: 18 pages, 4 figures

    ACM Class: E.1; E.4; F.2.2

  5. arXiv:2112.13005   

    quant-ph cs.CC cs.DS

    Quantum Linear Algorithm for Edit Distance Using the Word QRAM Model

    Authors: Massimo Equi, Arianne Meijer-van de Griend, Veli Mäkinen

    Abstract: Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor $w$, where $w$ is the computer word size. For example, edit distance of two strings of length $n$ can be solved in $O(n^2/w)$ time. In a reasonable classical model of computation, one can assume $w=Θ(\log n)$. There are conditional lower bounds for such problems stating that speed-ups with factor $n^ε$ for a… ▽ More

    Submitted 6 February, 2023; v1 submitted 24 December, 2021; originally announced December 2021.

    Comments: An incorrect assumption invalidates the results

    MSC Class: 81P68 ACM Class: E.1; E.4; F.1.3; F.2.2

  6. arXiv:2102.12822  [pdf, other

    cs.DS cs.CC

    Algorithms and Complexity on Indexing Founder Graphs

    Authors: Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, Veli Mäkinen

    Abstract: We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alp… ▽ More

    Submitted 10 June, 2022; v1 submitted 25 February, 2021; originally announced February 2021.

    Comments: This is an extended full version of WABI 2020 paper (https://doi.org/10.4230/LIPIcs.WABI.2020.7), whose preprint is in arXiv:2005.09342, and of ISAAC 2021 paper (to appear)

    ACM Class: E.1; E.4; F.1.3; F.2.2

  7. arXiv:2006.05871  [pdf, other

    cs.DS

    Tailoring r-index for metagenomics

    Authors: Dustin Cobas, Veli Mäkinen, Massimiliano Rossi

    Abstract: A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with $k$-mer hashing-based ps… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

    Comments: 17 pages, 2 figures, 1 table

  8. arXiv:2005.09342  [pdf, other

    cs.DS

    Linear Time Construction of Indexable Founder Block Graphs

    Authors: Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, Alexandru I. Tomescu

    Abstract: We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

    ACM Class: E.1; F.2.2; J.3

  9. arXiv:2002.00629  [pdf, other

    cs.CC

    Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

    Authors: Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu

    Abstract: We consider the following string matching problem on a node-labeled graph $G=(V,E)$: given a pattern string $P$, decide whether there exists a path in $G$ whose concatenation of node labels equals $P$. This is a basic primitive in various problems in bioinformatics, graph databases, or networks. The hardness results of Backurs and Indyk (FOCS 2016) imply that this problem cannot be solved in bette… ▽ More

    Submitted 4 March, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

    ACM Class: E.1; F.1; F.2.2; G.2.2

  10. arXiv:2001.06864  [pdf, other

    cs.DS

    Chaining with overlaps revisited

    Authors: Veli Mäkinen, Kristoffer Sahlin

    Abstract: Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several $O(n \log n)$ time algorithms to solve this problem optimally, where $n$ is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overla… ▽ More

    Submitted 24 April, 2020; v1 submitted 19 January, 2020; originally announced January 2020.

    Comments: Final version to appear in CPM 2020

    ACM Class: F.2.2; J.3

  11. arXiv:1902.03560  [pdf, other

    cs.CC cs.DS

    On the Complexity of Exact Pattern Matching in Graphs: Determinism and Zig-Zag Matching

    Authors: Massimo Equi, Roberto Grossi, Alexandru I. Tomescu, Veli Mäkinen

    Abstract: Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G=(V,E)$ that spell the same string as the given pattern $P[1..m]$. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, query operations in graph databases, and analysis of heterogeneous networks, where the nodes of some paths must match a sequenc… ▽ More

    Submitted 10 February, 2019; originally announced February 2019.

    Comments: Further developments on our previous work: arXiv:1901.05264

    ACM Class: E.1; F.1; F.2.2; G.2.2; H.2.3; H.2.8; H.3.3; J.3

  12. arXiv:1901.05264  [pdf, other

    cs.CC cs.DS

    On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

    Authors: Massimo Equi, Roberto Grossi, Veli Mäkinen

    Abstract: Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G=(V,E)$ that spell the same string as the pattern $P[1..m]$. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must matc… ▽ More

    Submitted 3 June, 2020; v1 submitted 16 January, 2019; originally announced January 2019.

    Comments: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14. However, the proof of Lemma 14 is correct if you assume that the graph used in the reduction is a DAG. Hence, since the problem is already quadratic for a DAG and a binary alphabet, it has to be quadratic also for a general graph and a binary alphabet

    ACM Class: E.1; F.1; F.2.2; G.2.2; H.2.3; H.2.8; H.3.3; J.3

  13. arXiv:1805.05228  [pdf, other

    cs.DS

    Assembling Omnitigs using Hidden-Order de Bruijn Graphs

    Authors: Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, Simon J. Puglisi

    Abstract: De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2… ▽ More

    Submitted 14 May, 2018; originally announced May 2018.

  14. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

    Authors: Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen

    Abstract: Given a threshold $L$ and a set $\mathcal{R} = \{R_1, \ldots, R_m\}$ of $m$ haplotype sequences, each having length $n$, the minimum segmentation problem for founder reconstruction is to partition the sequences into disjoint segments $\mathcal{R}[i_1{+}1,i_2], \mathcal{R}[i_2{+}1, i_3], \ldots, \mathcal{R}[i_{r-1}{+}1, i_r]$, where $0 = i_1 < \cdots < i_r = n$ and $\mathcal{R}[i_{j-1}{+}1, i_j]$ i… ▽ More

    Submitted 8 January, 2019; v1 submitted 9 May, 2018; originally announced May 2018.

    Journal ref: In Proc. WABI 2018

  15. arXiv:1705.08754  [pdf, other

    cs.DS

    Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-Linear Chaining Extended

    Authors: Anna Kuosmanen, Topi Paavilainen, Travis Gagie, Rayan Chikhi, Alexandru I. Tomescu, Veli Mäkinen

    Abstract: Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigor… ▽ More

    Submitted 29 January, 2018; v1 submitted 24 May, 2017; originally announced May 2017.

    ACM Class: G.2.2; F.2.2; J.3

  16. Hardness of Covering Alignment: Phase Transition in Post-Sequence Genomics

    Authors: Romeo Rizzi, Massimo Cairo, Veli Mäkinen, Alexandru I. Tomescu, Daniel Valenzuela

    Abstract: Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes… ▽ More

    Submitted 22 May, 2018; v1 submitted 15 November, 2016; originally announced November 2016.

    Journal ref: IEEE/ACM Trans. on Computational Biology and Bioinformatics, 30 April 2018

  17. arXiv:1609.06378  [pdf, ps, other

    cs.DS

    Linear-time string indexing and analysis in small space

    Authors: Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, Veli Mäkinen

    Abstract: The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input s… ▽ More

    Submitted 20 September, 2016; originally announced September 2016.

    Comments: Journal submission (52 pages, 2 figures)

  18. arXiv:1607.04909  [pdf, other

    cs.DS

    Fully Dynamic de Bruijn Graphs

    Authors: Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali

    Abstract: We present a space- and time-efficient fully dynamic implementation de Bruijn graphs, which can also support fixed-length jumbled pattern matching.

    Submitted 19 July, 2016; v1 submitted 17 July, 2016; originally announced July 2016.

    Comments: Presented at the 23rd edition of the International Symposium on String Processing and Information Retrieval (SPIRE 2016)

  19. arXiv:1508.07820  [pdf, other

    cs.DS

    Interval scheduling maximizing minimum coverage

    Authors: Veli Mäkinen, Valeria Staneva, Alexandru Tomescu, Daniel Valenzuela

    Abstract: In the classical interval scheduling type of problems, a set of $n$ jobs, characterized by their start and end time, need to be executed by a set of machines, under various constraints. In this paper we study a new variant in which the jobs need to be assigned to at most $k$ identical machines, such that the minimum number of machines that are busy at the same time is maximized. This is relevant i… ▽ More

    Submitted 30 October, 2015; v1 submitted 31 August, 2015; originally announced August 2015.

  20. arXiv:1307.7811  [pdf, other

    q-bio.QM cs.CE cs.DS

    A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

    Authors: Alexandru I. Tomescu, Anna Kuosmanen, Romeo Rizzi, Veli Mäkinen

    Abstract: RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, whi… ▽ More

    Submitted 30 July, 2013; originally announced July 2013.

    Comments: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

  21. arXiv:1011.3491  [pdf, other

    cs.DS

    Pattern Kits

    Authors: Travis Gagie, Kalle Karhu, Juha Kärkkäinen, Veli Mäkinen, Leena Salmela

    Abstract: Suppose we have just performed searches in a self-index for two patterns $A$ and $B$ and now we want to search for their concatenation \A B); how can we best make use of our previous computations? In this paper we consider this problem and, more generally, how we can store a dynamic library of patterns that we can easily manipulate in interesting ways. We give a space- and time-efficient data stru… ▽ More

    Submitted 2 April, 2011; v1 submitted 15 November, 2010; originally announced November 2010.

  22. arXiv:1010.2656  [pdf, other

    cs.DS cs.CE q-bio.QM

    Indexing Finite Language Representation of Population Genotypes

    Authors: Jouni Sirén, Niko Välimäki, Veli Mäkinen

    Abstract: With the recent advances in DNA sequencing, it is now possible to have complete genomes of individuals sequenced and assembled. This rich and focused genotype information can be used to do different population-wide studies, now first time directly on whole genome level. We propose a way to index population genotype information together with the complete genome sequence, so that one can use the ind… ▽ More

    Submitted 7 September, 2011; v1 submitted 13 October, 2010; originally announced October 2010.

    Comments: This is the full version of the paper that was presented at WABI 2011. The implementation is available at http://www.cs.helsinki.fi/group/suds/gcsa/

  23. arXiv:0907.2089  [pdf, other

    cs.DB cs.IR

    Fast In-Memory XPath Search over Compressed Text and Tree Indexes

    Authors: A. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Siren, N. Välimäki

    Abstract: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can efficiently be implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts of querying the text of the document, plus some parts of querying the tree structure.… ▽ More

    Submitted 5 October, 2011; v1 submitted 12 July, 2009; originally announced July 2009.