-
A parallel parser for regular expressions
Authors:
Angelo Borsotti,
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the R…
▽ More
Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Minimizing speculation overhead in a parallel recognizer for regular texts
Authors:
Angelo Borsotti,
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results an…
▽ More
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup compared to a serial algorithm. Existing data-parallel DFA-based recognizers suffer from the excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions. Our data-parallel algorithm is based on the new FA type called reduced interface DFA (RI-DFA), which minimizes the speculation overhead without incurring in the penalty of nondeterministic transitions or of impractically enlarged DFA machines. The algorithm is proved to be correct and theoretically efficient, because it combines the state-reduction of an NFA with the speed of deterministic transitions, thus improving on both DFA-based and NFA-based existing implementations. The practical applicability of the RI-DFA approach is confirmed by a quantitative comparison of the number of starting states for a large public benchmark of complex FAs. On multi-core computing architectures, the RI-DFA recognizer is much faster than the NFA-based one on all benchmarks, while it matches the DFA-based one on some benchmarks and performs much better on some others. The extra time cost needed to construct an RI-DFA compared to a DFA is moderate and is compatible with a practical use.
△ Less
Submitted 10 January, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
Parsing methods streamlined
Authors:
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
This paper has the goals (1) of unifying top-down parsing with shift-reduce parsing to yield a single simple and consistent framework, and (2) of producing provably correct parsing methods, deterministic as well as tabular ones, for extended context-free grammars (EBNF) represented as state-transition networks. Departing from the traditional way of presenting as independent algorithms the determin…
▽ More
This paper has the goals (1) of unifying top-down parsing with shift-reduce parsing to yield a single simple and consistent framework, and (2) of producing provably correct parsing methods, deterministic as well as tabular ones, for extended context-free grammars (EBNF) represented as state-transition networks. Departing from the traditional way of presenting as independent algorithms the deterministic bottom-up LR(1), the top-down LL(1) and the general tabular (Earley) parsers, we unify them in a coherent minimalist framework. We present a simple general construction method for EBNF ELR(1) parsers, where the new category of convergence conflicts is added to the classical shift-reduce and reduce-reduce conflicts; we prove its correctness and show two implementations by deterministic push-down machines and by vector-stack machines, the latter to be also used for Earley parsers. Then the Beatty's theoretical characterization of LL(1) grammars is adapted to derive the extended ELL(1 parsing method, first by minimizing the ELR(1) parser and then by simplifying its state information. Through using the same notations in the ELR(1) case, the extended Earley parser is obtained. Since all the parsers operate on compatible representations, it is feasible to combine them into mixed mode algorithms.
△ Less
Submitted 29 September, 2013;
originally announced September 2013.