-
A parallel parser for regular expressions
Authors:
Angelo Borsotti,
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the R…
▽ More
Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Minimizing speculation overhead in a parallel recognizer for regular texts
Authors:
Angelo Borsotti,
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results an…
▽ More
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup compared to a serial algorithm. Existing data-parallel DFA-based recognizers suffer from the excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions. Our data-parallel algorithm is based on the new FA type called reduced interface DFA (RI-DFA), which minimizes the speculation overhead without incurring in the penalty of nondeterministic transitions or of impractically enlarged DFA machines. The algorithm is proved to be correct and theoretically efficient, because it combines the state-reduction of an NFA with the speed of deterministic transitions, thus improving on both DFA-based and NFA-based existing implementations. The practical applicability of the RI-DFA approach is confirmed by a quantitative comparison of the number of starting states for a large public benchmark of complex FAs. On multi-core computing architectures, the RI-DFA recognizer is much faster than the NFA-based one on all benchmarks, while it matches the DFA-based one on some benchmarks and performs much better on some others. The extra time cost needed to construct an RI-DFA compared to a DFA is moderate and is compatible with a practical use.
△ Less
Submitted 10 January, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
Two-dimensional Dyck words
Authors:
Stefano Crespi Reghizzi,
Antonio Restivo,
Pierluigi San Pietro
Abstract:
We propose different ways of lifting the notion of Dyck language from words to 2-dimensional (2D) pictures, by means of new definitions of increasing comprehensiveness. Two of the proposals are based on alternative definitions of a Dyck language, which are equivalent over words but not on pictures. First, the property that any two pairs of matching parentheses are well-nested or disjoint is rephra…
▽ More
We propose different ways of lifting the notion of Dyck language from words to 2-dimensional (2D) pictures, by means of new definitions of increasing comprehensiveness. Two of the proposals are based on alternative definitions of a Dyck language, which are equivalent over words but not on pictures. First, the property that any two pairs of matching parentheses are well-nested or disjoint is rephrased for rectangular boxes and leads to the well-nested Dyck, $DW_k$. This is a generalization of the known Chinese box language, but, unlike the Chinese boxes, $DW_k$ is not recognizable by a tiling system. Second, the Dyck cancellation rule is rephrased as a neutralization rule, mapping a quadruple of symbols representing the corners of a subpicture onto neutral symbols.The neutralizable Dyck language $DN_k$ is obtained by iterating neutralizations, starting from 2-by-2 subpictures, until the picture is wholly neutralized. Third, we define the Dyck crossword $DC_k$ as the row-column combination of Dyck word languages, which prescribes that each column and row is a Dyck word. The relation between matching parentheses is represented in $DC_k$ by an edge of a graph situated on the picture grid. Such edges form a circuit, of path length multiple of four, of alternating row and column matches. Length-four circuits have rectangular shape, while longer ones exhibit a large variety of forms. A proper subset of $DC_k$, called quaternate, is also introduced by excluding all circuits of length greater than 4. We prove that $DN_k$ properly includes $DW_k$, and that it coincides with the quaternate $DC_k$ such that the neutralizability relation between subpictures induces a partial order. The 2D languages well-nested, neutralizable, quaternate and Dyck crossword are ordered by strict inclusions. This work can be also seen as a first step towards the definition of context-free picture languages.
△ Less
Submitted 21 September, 2023; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Aperiodicity, Star-freeness, and First-order Logic Definability of Operator Precedence Languages
Authors:
Dino Mandrioli,
Matteo Pradella,
Stefano Crespi Reghizzi
Abstract:
A classic result in formal language theory is the equivalence among non-counting, or aperiodic, regular languages, and languages defined through star-free regular expressions, or first-order logic. Past attempts to extend this result beyond the realm of regular languages have met with difficulties: for instance it is known that star-free tree languages may violate the non-counting property and the…
▽ More
A classic result in formal language theory is the equivalence among non-counting, or aperiodic, regular languages, and languages defined through star-free regular expressions, or first-order logic. Past attempts to extend this result beyond the realm of regular languages have met with difficulties: for instance it is known that star-free tree languages may violate the non-counting property and there are aperiodic tree languages that cannot be defined through first-order logic. We extend such classic equivalence results to a significant family of deterministic context-free languages, the operator-precedence languages (OPL), which strictly includes the widely investigated visibly pushdown, alias input-driven, family and other structured context-free languages. The OP model originated in the '60s for defining programming languages and is still used by high performance compilers; its rich algebraic properties have been investigated initially in connection with grammar learning and recently completed with further closure properties and with monadic second order logic definition. We introduce an extension of regular expressions, the OP-expressions (OPE) which define the OPLs and, under the star-free hypothesis, define first-order definable and non-counting OPLs. Then, we prove, through a fairly articulated grammar transformation, that aperiodic OPLs are first-order definable. Thus, the classic equivalence of star-freeness, aperiodicity, and first-order definability is established for the large and powerful class of OPLs. We argue that the same approach can be exploited to obtain analogous results for visibly pushdown languages too.
△ Less
Submitted 21 November, 2023; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Deque languages, automata and planar graphs
Authors:
Stefano Crespi Reghizzi,
Pierluigi San Pietro
Abstract:
The memory of a deque (double ended queue) automaton is more general than a queue or two stacks; to avoid overgeneralization, we consider quasi-real-time operation. Normal forms of such automata are given. Deque languages form an AFL but not a full one. We define the characteristic deque language, CDL, which combines Dyck and AntiDyck (or FIFO) languages, and homomorphically characterizes the dequ…
▽ More
The memory of a deque (double ended queue) automaton is more general than a queue or two stacks; to avoid overgeneralization, we consider quasi-real-time operation. Normal forms of such automata are given. Deque languages form an AFL but not a full one. We define the characteristic deque language, CDL, which combines Dyck and AntiDyck (or FIFO) languages, and homomorphically characterizes the deque languages. The notion of deque graph, from graph theory, well represents deque computation by means of a planar hamiltonian graph on a cylinder, with edges visualizing producer-consumer relations for deque symbols. We give equivalent definitions of CDL by labelled deque graphs, by cancellation rules, and by means of shuffle and intersection of simpler languages. The labeled deque graph of a sentence generalizes traditional syntax trees. The layout of deque computations on a cylinder is remindful of 3D models used in theoretical (bio)chemistry.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
-
Non-erasing Chomsky-Sch{ü}tzenberger theorem with grammar-independent alphabet
Authors:
Stefano Crespi Reghizzi,
Pierluigi San Pietro
Abstract:
The famous theorem by Chomsky and Schützenberger (CST) says that every context-free language $L$ over an alphabet $Σ$ is representable as $h(D \cap R)$, where $D$ is a Dyck language over a set $Ω$ of brackets, $R$ is a local language and $h$ is an alphabetic homomorphism that erases unboundedly many symbols. Berstel found that the number of erasures can be linearly limited if the grammar is in Gre…
▽ More
The famous theorem by Chomsky and Schützenberger (CST) says that every context-free language $L$ over an alphabet $Σ$ is representable as $h(D \cap R)$, where $D$ is a Dyck language over a set $Ω$ of brackets, $R$ is a local language and $h$ is an alphabetic homomorphism that erases unboundedly many symbols. Berstel found that the number of erasures can be linearly limited if the grammar is in Greibach normal form; Berstel and Boasson (and later, independently, Okhotin) proved a non-erasing variant of CST for grammars in Double Greibach Normal Form. In all these CST statements, however, the size of the Dyck alphabet $Ω$ depends on the grammar size for $L$. In the Stanley variant of the CST, $|Ω|$ only depends on $|Σ|$ and not on the grammar, but the homomorphism erases many more symbols than in the other versions of CST; also, the regular language $R$ is strictly locally testable but not local. We prove a new version of CST which combines both features of being non-erasing and of using a grammar-independent alphabet. In our construction, $|Ω|$ is polynomial in $|Σ|$, namely $O(|Σ|^{46})$, and the regular language $R$ is strictly locally testable. Using a recent generalization of Medvedev's homomorphic characterization of regular languages, we prove that the degree in the polynomial dependence of $|Ω|$ on $|Σ|$ may be reduced to just 2 in the case of linear grammars in Double Greibach Normal Form.
△ Less
Submitted 10 May, 2018;
originally announced May 2018.
-
Higher-Order Operator Precedence Languages
Authors:
Stefano Crespi Reghizzi,
Matteo Pradella
Abstract:
Floyd's Operator Precedence (OP) languages are a deterministic context-free family having many desirable properties. They are locally and parallely parsable, and languages having a compatible structure are closed under Boolean operations, concatenation and star; they properly include the family of Visibly Pushdown (or Input Driven) languages. OP languages are based on three relations between any t…
▽ More
Floyd's Operator Precedence (OP) languages are a deterministic context-free family having many desirable properties. They are locally and parallely parsable, and languages having a compatible structure are closed under Boolean operations, concatenation and star; they properly include the family of Visibly Pushdown (or Input Driven) languages. OP languages are based on three relations between any two consecutive terminal symbols, which assign syntax structure to words. We extend such relations to k-tuples of consecutive terminal symbols, by using the model of strictly locally testable regular languages of order k at least 3. The new corresponding class of Higher-order Operator Precedence languages (HOP) properly includes the OP languages, and it is still included in the deterministic (also in reverse) context free family. We prove Boolean closure for each subfamily of structurally compatible HOP languages. In each subfamily, the top language is called max-language. We show that such languages are defined by a simple cancellation rule and we prove several properties, in particular that max-languages make an infinite hierarchy ordered by parameter k. HOP languages are a candidate for replacing OP languages in the various applications where they have have been successful though sometimes too restrictive.
△ Less
Submitted 21 August, 2017; v1 submitted 25 May, 2017;
originally announced May 2017.
-
Commutative Languages and their Composition by Consensual Methods
Authors:
Stefano Crespi Reghizzi,
Pierluigi San Pietro
Abstract:
Commutative languages with the semilinear property (SLIP) can be naturally recognized by real-time NLOG-SPACE multi-counter machines. We show that unions and concatenations of such languages can be similarly recognized, relying on -- and further developing, our recent results on the family of consensually regular (CREG) languages. A CREG language is defined by a regular language on the alphabe…
▽ More
Commutative languages with the semilinear property (SLIP) can be naturally recognized by real-time NLOG-SPACE multi-counter machines. We show that unions and concatenations of such languages can be similarly recognized, relying on -- and further developing, our recent results on the family of consensually regular (CREG) languages. A CREG language is defined by a regular language on the alphabet that includes the terminal alphabet and its marked copy. New conditions, for ensuring that the union or concatenation of CREG languages is closed, are presented and applied to the commutative SLIP languages. The paper contributes to the knowledge of the CREG family, and introduces novel techniques for language composition, based on arithmetic congruences that act as language signatures. Open problems are listed.
△ Less
Submitted 21 May, 2014;
originally announced May 2014.
-
Parsing methods streamlined
Authors:
Luca Breveglieri,
Stefano Crespi Reghizzi,
Angelo Morzenti
Abstract:
This paper has the goals (1) of unifying top-down parsing with shift-reduce parsing to yield a single simple and consistent framework, and (2) of producing provably correct parsing methods, deterministic as well as tabular ones, for extended context-free grammars (EBNF) represented as state-transition networks. Departing from the traditional way of presenting as independent algorithms the determin…
▽ More
This paper has the goals (1) of unifying top-down parsing with shift-reduce parsing to yield a single simple and consistent framework, and (2) of producing provably correct parsing methods, deterministic as well as tabular ones, for extended context-free grammars (EBNF) represented as state-transition networks. Departing from the traditional way of presenting as independent algorithms the deterministic bottom-up LR(1), the top-down LL(1) and the general tabular (Earley) parsers, we unify them in a coherent minimalist framework. We present a simple general construction method for EBNF ELR(1) parsers, where the new category of convergence conflicts is added to the classical shift-reduce and reduce-reduce conflicts; we prove its correctness and show two implementations by deterministic push-down machines and by vector-stack machines, the latter to be also used for Earley parsers. Then the Beatty's theoretical characterization of LL(1) grammars is adapted to derive the extended ELL(1 parsing method, first by minimizing the ELR(1) parser and then by simplifying its state information. Through using the same notations in the ELR(1) case, the extended Earley parser is obtained. Since all the parsers operate on compatible representations, it is feasible to combine them into mixed mode algorithms.
△ Less
Submitted 29 September, 2013;
originally announced September 2013.
-
From Regular to Strictly Locally Testable Languages
Authors:
Stefano Crespi Reghizzi,
Pierluigi San Pietro
Abstract:
A classical result (often credited to Y. Medvedev) states that every language recognized by a finite automaton is the homomorphic image of a local language, over a much larger so-called local alphabet, namely the alphabet of the edges of the transition graph. Local languages are characterized by the value k=2 of the sliding window width in the McNaughton and Papert's infinite hierarchy of stri…
▽ More
A classical result (often credited to Y. Medvedev) states that every language recognized by a finite automaton is the homomorphic image of a local language, over a much larger so-called local alphabet, namely the alphabet of the edges of the transition graph. Local languages are characterized by the value k=2 of the sliding window width in the McNaughton and Papert's infinite hierarchy of strictly locally testable languages (k-slt). We generalize Medvedev's result in a new direction, studying the relationship between the width and the alphabetic ratio telling how much larger the local alphabet is. We prove that every regular language is the image of a k-slt language on an alphabet of doubled size, where the width logarithmically depends on the automaton size, and we exhibit regular languages for which any smaller alphabetic ratio is insufficient. More generally, we express the trade-off between alphabetic ratio and width as a mathematical relation derived from a careful encoding of the states. At last we mention some directions for theoretical development and application.
△ Less
Submitted 17 August, 2011;
originally announced August 2011.
-
A unifying approach to picture grammars
Authors:
Matteo Pradella,
Alessandra Cherubini,
Stefano Crespi Reghizzi
Abstract:
Several old and recent classes of picture grammars, that variously extend context-free string grammars in two dimensions, are based on rules that rewrite arrays of pixels. Such grammars can be unified and extended using a tiling based approach, whereby the right part of a rule is formalized by means of a finite set of permitted tiles. We focus on a simple type of tiling,named regional, and define…
▽ More
Several old and recent classes of picture grammars, that variously extend context-free string grammars in two dimensions, are based on rules that rewrite arrays of pixels. Such grammars can be unified and extended using a tiling based approach, whereby the right part of a rule is formalized by means of a finite set of permitted tiles. We focus on a simple type of tiling,named regional, and define the corresponding regional tile grammars. They include both Siromoney's (or Matz's) Kolam grammars and their generalization by Prusa, as well as Drewes's grid grammars. Regionally defined pictures can be recognized with polynomial-time complexity by an algorithm extending the CKY one for strings. Regional tile grammars and languages are strictly included into our previous tile grammars and languages, and are incomparable with Giammarresi-Restivo tiling systems (or Wang systems).
△ Less
Submitted 8 January, 2011; v1 submitted 15 October, 2009;
originally announced October 2009.
-
Algebraic properties of structured context-free languages: old approaches and novel developments
Authors:
Stefano Crespi Reghizzi,
Dino Mandrioli
Abstract:
The historical research line on the algebraic properties of structured CF languages initiated by McNaughton's Parenthesis Languages has recently attracted much renewed interest with the Balanced Languages, the Visibly Pushdown Automata languages (VPDA), the Synchronized Languages, and the Height-deterministic ones. Such families preserve to a varying degree the basic algebraic properties of Regu…
▽ More
The historical research line on the algebraic properties of structured CF languages initiated by McNaughton's Parenthesis Languages has recently attracted much renewed interest with the Balanced Languages, the Visibly Pushdown Automata languages (VPDA), the Synchronized Languages, and the Height-deterministic ones. Such families preserve to a varying degree the basic algebraic properties of Regular languages: boolean closure, closure under reversal, under concatenation, and Kleene star. We prove that the VPDA family is strictly contained within the Floyd Grammars (FG) family historically known as operator precedence. Languages over the same precedence matrix are known to be closed under boolean operations, and are recognized by a machine whose pop or push operations on the stack are purely determined by terminal letters. We characterize VPDA's as the subclass of FG having a peculiarly structured set of precedence relations, and balanced grammars as a further restricted case. The non-counting invariance property of FG has a direct implication for VPDA too.
△ Less
Submitted 13 July, 2009;
originally announced July 2009.
-
Formal semantics of language and the Richard-Berry paradox
Authors:
Stefano Crespi Reghizzi
Abstract:
The classical logical antinomy known as Richard-Berry paradox is combined with plausible assumptions about the size i.e. the descriptional complexity of Turing machines formalizing certain sentences, to show that formalization of language leads to contradiction.
The classical logical antinomy known as Richard-Berry paradox is combined with plausible assumptions about the size i.e. the descriptional complexity of Turing machines formalizing certain sentences, to show that formalization of language leads to contradiction.
△ Less
Submitted 24 July, 2008;
originally announced July 2008.