Search | arXiv e-print repository

Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

Authors: Shunsuke Inenaga, Takuya Mieno, Hiroki Arimura, Mitsuru Funakoshi, Yuta Fujishige

Abstract: A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ tha… ▽ More A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(\mathsf{e}_\min)$ space, where $\mathsf{e}_\min$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $\mathsf{e}_\min < 2n$, and for highly repetitive strings $\mathsf{e}_\min$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG. △ Less

Submitted 19 May, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted for IWOCA 2024

arXiv:2310.06446 [pdf, other]

Rule Mining for Correcting Classification Models

Authors: Hirofumi Suzuki, Hiroaki Iwashita, Takuya Takagi, Yuta Fujishige, Satoshi Hara

Abstract: Machine learning models need to be continually updated or corrected to ensure that the prediction accuracy remains consistently high. In this study, we consider scenarios where developers should be careful to change the prediction results by the model correction, such as when the model is part of a complex system or software. In such scenarios, the developers want to control the specification of t… ▽ More Machine learning models need to be continually updated or corrected to ensure that the prediction accuracy remains consistently high. In this study, we consider scenarios where developers should be careful to change the prediction results by the model correction, such as when the model is part of a complex system or software. In such scenarios, the developers want to control the specification of the corrections. To achieve this, the developers need to understand which subpopulations of the inputs get inaccurate predictions by the model. Therefore, we propose correction rule mining to acquire a comprehensive list of rules that describe inaccurate subpopulations and how to correct them. We also develop an efficient correction rule mining algorithm that is a combination of frequent itemset mining and a unique pruning technique for correction rules. We observed that the proposed algorithm found various rules which help to collect data insufficiently learned, directly correct model outputs, and analyze concept drift. △ Less

Submitted 14 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2307.01428 [pdf, other]

Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets

Authors: Yuta Fujishige, Yuki Tsujimaru, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: The directed acyclic word graph (DAWG) of a string $y$ of length $n$ is the smallest (partial) DFA which recognizes all suffixes of $y$ with only $O(n)$ nodes and edges. In this paper, we show how to construct the DAWG for the input string $y$ from the suffix tree for $y$, in $O(n)$ time for integer alphabets of polynomial size in $n$. In so doing, we first describe a folklore algorithm which, giv… ▽ More The directed acyclic word graph (DAWG) of a string $y$ of length $n$ is the smallest (partial) DFA which recognizes all suffixes of $y$ with only $O(n)$ nodes and edges. In this paper, we show how to construct the DAWG for the input string $y$ from the suffix tree for $y$, in $O(n)$ time for integer alphabets of polynomial size in $n$. In so doing, we first describe a folklore algorithm which, given the suffix tree for $y$, constructs the DAWG for the reversed string of $y$ in $O(n)$ time. Then, we present our algorithm that builds the DAWG for $y$ in $O(n)$ time for integer alphabets, from the suffix tree for $y$. We also show that a straightforward modification to our DAWG construction algorithm leads to the first $O(n)$-time algorithm for constructing the affix tree of a given string $y$ over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. We then discuss how our constructions can lead to linear-time algorithms for building other text indexing structures, such as linear-size suffix tries and symmetric CDAWGs in linear time in the case of integer alphabets. As a further application to our $O(n)$-time DAWG construction algorithm, we show that the set $\mathsf{MAW}(y)$ of all minimal absent words (MAWs) of $y$ can be computed in optimal, input- and output-sensitive $O(n + |\mathsf{MAW}(y)|)$ time and $O(n)$ working space for integer alphabets. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: This is an extended version of the paper "Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets" from MFCS 2016

arXiv:1909.02804 [pdf, ps, other]

Minimal Unique Substrings and Minimal Absent Words in a Sliding Window

Authors: Takuya Mieno, Yuki Kuhara, Tooru Akagi, Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: A substring $u$ of a string $T$ is called a minimal unique substring (MUS) of $T$ if $u$ occurs exactly once in $T$ and any proper substring of $u$ occurs at least twice in $T$. A string $w$ is called a minimal absent word (MAW) of $T$ if $w$ does not occur in $T$ and any proper substring of $w$ occurs in $T$. In this paper, we study the problems of computing MUSs and MAWs in a sliding window over… ▽ More A substring $u$ of a string $T$ is called a minimal unique substring (MUS) of $T$ if $u$ occurs exactly once in $T$ and any proper substring of $u$ occurs at least twice in $T$. A string $w$ is called a minimal absent word (MAW) of $T$ if $w$ does not occur in $T$ and any proper substring of $w$ occurs in $T$. In this paper, we study the problems of computing MUSs and MAWs in a sliding window over a given string $T$. We first show how the set of MUSs can change in a sliding window over $T$, and present an $O(n\logσ)$-time and $O(d)$-space algorithm to compute MUSs in a sliding window of width $d$ over $T$, where $σ$ is the maximum number of distinct characters in every window. We then give tight upper and lower bounds on the maximum number of changes in the set of MAWs in a sliding window over $T$. Our bounds improve on the previous results in [Crochemore et al., 2017]. △ Less

Submitted 13 September, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

arXiv:1705.09779 [pdf, ps, other]

Linear-size CDAWG: new repetition-aware indexing and grammar compression

Authors: Takuya Takagi, Keisuke Goto, Yuta Fujishige, Shunsuke Inenaga, Hiroki Arimura

Abstract: In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-tim… ▽ More In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-time pattern matching. Here, $\tilde e_T$ is the number of all extensions of maximal repeats in $T$, $n$ and $m$ are respectively the lengths of the text $T$ and a given pattern, $σ$ is the alphabet size, and $occ$ is the number of occurrences of the pattern in $T$. The repetitiveness measure $\tilde e_T$ is known to be much smaller than the text length $n$ for highly repetitive text. For constant alphabets, our L-CDAWGs achieve $O(m + occ)$ pattern matching time with $O(e_T^r \log n)$ bits of space, which improves the pattern matching time of Belazzougui et al.'s run-length BWT-CDAWGs by a factor of $\log \log n$, with the same space complexity. Here, $e_T^r$ is the number of right extensions of maximal repeats in $T$. As a byproduct, our result gives a way of constructing an SLP of size $O(\tilde e_T)$ for a given text $T$ in $O(n + \tilde e_T \log σ)$ time. △ Less

Submitted 27 July, 2017; v1 submitted 27 May, 2017; originally announced May 2017.

Comments: 12 pages, 2 figures

arXiv:1703.04954 [pdf, other]

Faster STR-IC-LCS computation via RLE

Authors: Keita Kuboi, Yuta Fujishige, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: The constrained LCS problem asks one to find a longest common subsequence of two input strings $A$ and $B$ with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string $C$ as a substring. Given two strings $A$ and $B$ of respective lengths $M$ and $N$, and a constraint string $C$ of length at most… ▽ More The constrained LCS problem asks one to find a longest common subsequence of two input strings $A$ and $B$ with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string $C$ as a substring. Given two strings $A$ and $B$ of respective lengths $M$ and $N$, and a constraint string $C$ of length at most $\min\{M, N\}$, the best known algorithm for the STR-IC-LCS problem, proposed by Deorowicz~({\em Inf. Process. Lett.}, 11:423--426, 2012), runs in $O(MN)$ time. In this work, we present an $O(mN + nM)$-time solution to the STR-IC-LCS problem, where $m$ and $n$ denote the sizes of the run-length encodings of $A$ and $B$, respectively. Since $m \leq M$ and $n \leq N$ always hold, our algorithm is always as fast as Deorowicz's algorithm, and is faster when input strings are compressible via RLE. △ Less

Submitted 15 March, 2017; originally announced March 2017.

Showing 1–6 of 6 results for author: Fujishige, Y