Skip to main content

Showing 1–19 of 19 results for author: Karkkainen, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:1911.06985  [pdf, ps, other

    cs.DS

    Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time

    Authors: Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, Marcin Picatkowski

    Abstract: The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT (BBWT) is a bijective variant of it. Although it is known that the BWT can be constructed in linear time for integer alphabets by using a linear time suffix array construction algorithm, it was up to now only conjectured that the BBWT can also be constructe… ▽ More

    Submitted 22 April, 2021; v1 submitted 16 November, 2019; originally announced November 2019.

  2. arXiv:1711.02910  [pdf, ps, other

    cs.DS

    Run Compressed Rank/Select for Large Alphabets

    Authors: José Fuentes-Sepúlveda, Juha Kärkkäinen, Dmitry Kosolobov, Simon J. Puglisi

    Abstract: Given a string of length $n$ that is composed of $r$ runs of letters from the alphabet $\{0,1,\ldots,σ{-}1\}$ such that $2 \le σ\le r$, we describe a data structure that, provided $r \le n / \log^{ω(1)} n$, stores the string in $r\log\frac{nσ}{r} + o(r\log\frac{nσ}{r})$ bits and supports select and access queries in $O(\log\frac{\log(n/r)}{\log\log n})$ time and rank queries in… ▽ More

    Submitted 26 February, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. 10 pages, 1 figure, 4 tables; published in DCC'2018

  3. On the Size of Lempel-Ziv and Lyndon Factorizations

    Authors: Juha Kärkkäinen, Dominik Kempa, Yuto Nakashima, Simon J. Puglisi, Arseny M. Shur

    Abstract: Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlapping LZ factorization (which we demonstrate by describing a… ▽ More

    Submitted 27 November, 2016; originally announced November 2016.

    Comments: 12 pages

  4. arXiv:1609.06378  [pdf, ps, other

    cs.DS

    Linear-time string indexing and analysis in small space

    Authors: Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, Veli Mäkinen

    Abstract: The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input s… ▽ More

    Submitted 20 September, 2016; originally announced September 2016.

    Comments: Journal submission (52 pages, 2 figures)

  5. arXiv:1606.04573  [pdf, ps, other

    cs.DS

    String Inference from the LCP Array

    Authors: Juha Kärkkäinen, Marcin Piątkowski, Simon J. Puglisi

    Abstract: The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape. In order to better understand the combinato… ▽ More

    Submitted 23 February, 2017; v1 submitted 14 June, 2016; originally announced June 2016.

    Comments: Added algorithm for general alphabets

    ACM Class: F.2.2; G.2.1; G.2.2

  6. arXiv:1605.09362  [pdf, other

    cs.IR

    Document Retrieval on Repetitive String Collections

    Authors: Travis Gagie, Aleksi Hartikainen, Kalle Karhu, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, Jouni Sirén

    Abstract: Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient docum… ▽ More

    Submitted 18 May, 2017; v1 submitted 30 May, 2016; originally announced May 2016.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Accepted to the Information Retrieval Journal

  7. Lempel-Ziv Decoding in External Memory

    Authors: Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory compu… ▽ More

    Submitted 31 January, 2016; originally announced February 2016.

  8. Diverse Palindromic Factorization is NP-Complete

    Authors: Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Karkkainen, Dominik Kempa, Marcin Piatkowski, Simon J. Puglisi, Shiho Sugimoto

    Abstract: We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization.

    Submitted 16 February, 2017; v1 submitted 13 March, 2015; originally announced March 2015.

  9. arXiv:1412.0967  [pdf, other

    cs.DS

    Queries on LZ-Bounded Encodings

    Authors: Djamal Belazzougui, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez, Simon J. Puglisi, Yasuo Tabei

    Abstract: We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compar… ▽ More

    Submitted 2 December, 2014; originally announced December 2014.

  10. arXiv:1409.6780  [pdf, other

    cs.DS

    Document Counting in Practice

    Authors: Travis Gagie, Aleksi Hartikainen, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, Jouni Sirén

    Abstract: We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation a… ▽ More

    Submitted 1 October, 2015; v1 submitted 23 September, 2014; originally announced September 2014.

    Comments: This is a slightly extended version of the paper that was presented at DCC 2015. The implementations are available at http://jltsiren.kapsi.fi/rlcsa and https://github.com/ahartik/succinct

  11. A Subquadratic Algorithm for Minimum Palindromic Factorization

    Authors: Gabriele Fici, Travis Gagie, Juha Kärkkäinen, Dominik Kempa

    Abstract: We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and… ▽ More

    Submitted 7 August, 2014; v1 submitted 10 March, 2014; originally announced March 2014.

    Comments: Accepted for publication in Journal of Discrete Algorithms

  12. Lempel-Ziv Parsing in External Memory

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed t… ▽ More

    Submitted 4 July, 2013; originally announced July 2013.

    Comments: 10 pages

  13. Lightweight Lempel-Ziv Parsing

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for h… ▽ More

    Submitted 6 February, 2013; v1 submitted 5 February, 2013; originally announced February 2013.

    Comments: 12 pages

  14. Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algor… ▽ More

    Submitted 12 December, 2012; originally announced December 2012.

  15. arXiv:1111.1355  [pdf, ps, other

    cs.DS

    A Compressed Self-Index for Genomic Databases

    Authors: Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi

    Abstract: Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant o… ▽ More

    Submitted 5 November, 2011; originally announced November 2011.

  16. arXiv:1109.3954  [pdf, other

    cs.DS

    A Faster Grammar-Based Self-Index

    Authors: Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi

    Abstract: To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can store a self-index for $S$ in $\Oh{r + z \log \log n}$ space such that, given a pattern (P [1..m]), we can list the… ▽ More

    Submitted 26 September, 2012; v1 submitted 19 September, 2011; originally announced September 2011.

    Comments: journal version of LATA '12 paper

  17. arXiv:1104.3810  [pdf, ps, other

    cs.DS cs.IR

    Fixed Block Compression Boosting in FM-Indexes

    Authors: Juha Kärkkäinen, Simon J. Puglisi

    Abstract: A compressed full-text self-index occupies space close to that of the compressed text and simultaneously allows fast pattern matching and random access to the underlying text. Among the best compressed self-indexes, in theory and in practice, are several members of the FM-index family. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation… ▽ More

    Submitted 19 April, 2011; originally announced April 2011.

  18. arXiv:1011.3491  [pdf, other

    cs.DS

    Pattern Kits

    Authors: Travis Gagie, Kalle Karhu, Juha Kärkkäinen, Veli Mäkinen, Leena Salmela

    Abstract: Suppose we have just performed searches in a self-index for two patterns $A$ and $B$ and now we want to search for their concatenation \A B); how can we best make use of our previous computations? In this paper we consider this problem and, more generally, how we can store a dynamic library of patterns that we can easily manipulate in interesting ways. We give a space- and time-efficient data stru… ▽ More

    Submitted 2 April, 2011; v1 submitted 15 November, 2010; originally announced November 2010.

  19. arXiv:1011.3480  [pdf, ps, other

    cs.DS

    Counting Colours in Compressed Strings

    Authors: Travis Gagie, Juha Kärkkäinen

    Abstract: Suppose we are asked to preprocess a string \(s [1..n]\) such that later, given a substring's endpoints, we can quickly count how many distinct characters it contains. In this paper we give a data structure for this problem that takes \(n H_0 (s) + \Oh{n} + \oh{n H_0 (s)}\) bits, where \(H_0 (s)\) is the 0th-order empirical entropy of $s$, and answers queries in $\Oh{\log^{1 + ε} n}$ time for any… ▽ More

    Submitted 15 November, 2010; originally announced November 2010.