-
The Complexity of Aggregates over Extractions by Regular Expressions
Authors:
Johannes Doleschal,
Benny Kimelfeld,
Wim Martens
Abstract:
Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile…
▽ More
Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).
△ Less
Submitted 8 August, 2023; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Weight Annotation in Information Extraction
Authors:
Johannes Doleschal,
Benny Kimelfeld,
Wim Martens,
Liat Peterfreund
Abstract:
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of…
▽ More
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of VSet-automata -- a restricted class of transducers that mark the endpoints of selected spans.
In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted VSet-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of ordered semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.
△ Less
Submitted 28 January, 2022; v1 submitted 30 August, 2019;
originally announced August 2019.
-
Split-Correctness in Information Extraction
Authors:
Johannes Doleschal,
Benny Kimelfeld,
Wim Martens,
Frank Neven,
Matthias Niewerth
Abstract:
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with…
▽ More
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with split constraints.
△ Less
Submitted 20 May, 2021; v1 submitted 8 October, 2018;
originally announced October 2018.