-
RePair Grammars are the Smallest Grammars for Fibonacci Words
Authors:
Takuya Mieno,
Shunsuke Inenaga,
Takashi Horiyama
Abstract:
Grammar-based compression is a loss-less data compression scheme that represents a given string $w$ by a context-free grammar that generates only $w$. While computing the smallest grammar which generates a given string $w$ is NP-hard in general, a number of polynomial-time grammar-based compressors which work well in practice have been proposed. RePair, proposed by Larsson and Moffat in 1999, is a…
▽ More
Grammar-based compression is a loss-less data compression scheme that represents a given string $w$ by a context-free grammar that generates only $w$. While computing the smallest grammar which generates a given string $w$ is NP-hard in general, a number of polynomial-time grammar-based compressors which work well in practice have been proposed. RePair, proposed by Larsson and Moffat in 1999, is a grammar-based compressor which recursively replaces all possible occurrences of a most frequently occurring bigrams in the string. Since there can be multiple choices of the most frequent bigrams to replace, different implementations of RePair can result in different grammars. In this paper, we show that the smallest grammars generating the Fibonacci words $F_k$ can be completely characterized by RePair, where $F_k$ denotes the $k$-th Fibonacci word. Namely, all grammars for $F_k$ generated by any implementation of RePair are the smallest grammars for $F_k$, and no other grammars can be the smallest for $F_k$. To the best of our knowledge, Fibonacci words are the first non-trivial infinite family of strings for which RePair is optimal.
△ Less
Submitted 14 April, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Counting Lyndon Subsequences
Authors:
Ryo Hirakawa,
Yuto Nakashima,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
Counting substrings/subsequences that preserve some property (e.g., palindromes, squares) is an important mathematical interest in stringology. Recently, Glen et al. studied the number of Lyndon factors in a string. A string $w = uv$ is called a Lyndon word if it is the lexicographically smallest among all of its conjugates $vu$. In this paper, we consider a more general problem "counting Lyndon s…
▽ More
Counting substrings/subsequences that preserve some property (e.g., palindromes, squares) is an important mathematical interest in stringology. Recently, Glen et al. studied the number of Lyndon factors in a string. A string $w = uv$ is called a Lyndon word if it is the lexicographically smallest among all of its conjugates $vu$. In this paper, we consider a more general problem "counting Lyndon subsequences". We show (1) the maximum total number of Lyndon subsequences in a string, (2) the expected total number of Lyndon subsequences in a string, (3) the expected number of distinct Lyndon subsequences in a string.
△ Less
Submitted 13 July, 2021; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Combinatorics of minimal absent words for a sliding window
Authors:
Tooru Akagi,
Yuki Kuhara,
Takuya Mieno,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur in $T$ but the proper substrings of $w$ occur in $T$. For example, let $Σ= \{\mathtt{a, b, c}\}$ be the alphabet. Then, the set of MAWs for string $w = \mathtt{abaab}$ is $\{\mathtt{aaa, aaba, bab, bb, c}\}$. In this paper, we study combinatorial properties of MAWs in the sliding window model, namely, h…
▽ More
A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur in $T$ but the proper substrings of $w$ occur in $T$. For example, let $Σ= \{\mathtt{a, b, c}\}$ be the alphabet. Then, the set of MAWs for string $w = \mathtt{abaab}$ is $\{\mathtt{aaa, aaba, bab, bb, c}\}$. In this paper, we study combinatorial properties of MAWs in the sliding window model, namely, how the set of MAWs changes when a sliding window of fixed length $d$ is shifted over the input string $T$ of length $n$, where $1 \leq d < n$. We present \emph{tight} upper and lower bounds on the maximum number of changes in the set of MAWs for a sliding window over $T$, both in the cases of general alphabets and binary alphabets. Our bounds improve on the previously known best bounds [Crochemore et al., 2020].
△ Less
Submitted 15 April, 2022; v1 submitted 18 May, 2021;
originally announced May 2021.