-
Comparing Downward Fragments of the Relational Calculus with Transitive Closure on Trees
Authors:
Jelle Hellings,
Marc Gyssens,
Yuqing Wu,
Dirk Van Gucht,
Jan Van den Bussche,
Stijn Vansummeren,
George H. L. Fletcher
Abstract:
Motivated by the continuing interest in the tree data model, we study the expressive power of downward navigational query languages on trees and chains. Basic navigational queries are built from the identity relation and edge relations using composition and union. We study the effects on relative expressiveness when we add transitive closure, projections, coprojections, intersection, and differenc…
▽ More
Motivated by the continuing interest in the tree data model, we study the expressive power of downward navigational query languages on trees and chains. Basic navigational queries are built from the identity relation and edge relations using composition and union. We study the effects on relative expressiveness when we add transitive closure, projections, coprojections, intersection, and difference; this for boolean queries and path queries on labeled and unlabeled structures. In all cases, we present the complete Hasse diagram. In particular, we establish, for each query language fragment that we study on trees, whether it is closed under difference and intersection.
△ Less
Submitted 4 March, 2018;
originally announced March 2018.
-
G-CORE: A Core for Future Graph Query Languages
Authors:
Renzo Angles,
Marcelo Arenas,
Pablo Barceló,
Peter Boncz,
George H. L. Fletcher,
Claudio Gutierrez,
Tobias Lindaaker,
Marcus Paradies,
Stefan Plantikow,
Juan Sequeda,
Oskar van Rest,
Hannes Voigt
Abstract:
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class…
▽ More
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
△ Less
Submitted 6 December, 2017; v1 submitted 5 December, 2017;
originally announced December 2017.
-
gMark: Schema-Driven Generation of Graphs and Queries
Authors:
Guillaume Bagan,
Angela Bonifati,
Radu Ciucanu,
George H. L. Fletcher,
Aurélien Lemay,
Nicky Advokaat
Abstract:
Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this paper, we present the design and eng…
▽ More
Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this paper, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.
△ Less
Submitted 6 December, 2016; v1 submitted 26 November, 2015;
originally announced November 2015.
-
Structural characterizations of the navigational expressiveness of relation algebras on a tree
Authors:
George H. L. Fletcher,
Marc Gyssens,
Jan Paredaens,
Dirk Van Gucht,
Yuqing Wu
Abstract:
Given a document D in the form of an unordered node-labeled tree, we study the expressiveness on D of various basic fragments of XPath, the core navigational language on XML documents. Working from the perspective of these languages as fragments of Tarski's relation algebra, we give characterizations, in terms of the structure of D, for when a binary relation on its nodes is definable by an expres…
▽ More
Given a document D in the form of an unordered node-labeled tree, we study the expressiveness on D of various basic fragments of XPath, the core navigational language on XML documents. Working from the perspective of these languages as fragments of Tarski's relation algebra, we give characterizations, in terms of the structure of D, for when a binary relation on its nodes is definable by an expression in these algebras. Since each pair of nodes in such a relation represents a unique path in D, our results therefore capture the sets of paths in D definable in each of the fragments. We refer to this perspective on language semantics as the "global view." In contrast with this global view, there is also a "local view" where one is interested in the nodes to which one can navigate starting from a particular node in the document. In this view, we characterize when a set of nodes in D can be defined as the result of applying an expression to a given node of D. All these definability results, both in the global and the local view, are obtained by using a robust two-step methodology, which consists of first characterizing when two nodes cannot be distinguished by an expression in the respective fragments of XPath, and then bootstrapping these characterizations to the desired results.
△ Less
Submitted 11 February, 2015;
originally announced February 2015.
-
Relative Expressive Power of Navigational Querying on Graphs
Authors:
George H. L. Fletcher,
Marc Gyssens,
Dirk Leinders,
Dimitri Surinx,
Jan Van den Bussche,
Dirk Van Gucht,
Stijn Vansummeren,
Yuqing Wu
Abstract:
Motivated by both established and new applications, we study navigational query languages for graphs (binary relations). The simplest language has only the two operators union and composition, together with the identity relation. We make more powerful languages by adding any of the following operators: intersection; set difference; projection; coprojection; converse; and the diversity relation. Al…
▽ More
Motivated by both established and new applications, we study navigational query languages for graphs (binary relations). The simplest language has only the two operators union and composition, together with the identity relation. We make more powerful languages by adding any of the following operators: intersection; set difference; projection; coprojection; converse; and the diversity relation. All these operators map binary relations to binary relations. We compare the expressive power of all resulting languages. We do this not only for general path queries (queries where the result may be any binary relation) but also for boolean or yes/no queries (expressed by the nonemptiness of an expression). For both cases, we present the complete Hasse diagram of relative expressiveness. In particular the Hasse diagram for boolean queries contains some nontrivial separations and a few surprising collapses.
△ Less
Submitted 28 November, 2014; v1 submitted 31 January, 2014;
originally announced January 2014.
-
Similarity and bisimilarity notions appropriate for characterizing indistinguishability in fragments of the calculus of relations
Authors:
George H. L. Fletcher,
Marc Gyssens,
Dirk Leinders,
Jan Van den Bussche,
Dirk Van Gucht,
Stijn Vansummeren
Abstract:
Motivated by applications in databases, this paper considers various fragments of the calculus of binary relations. The fragments are obtained by leaving out, or keeping in, some of the standard operators, along with some derived operators such as set difference, projection, coprojection, and residuation. For each considered fragment, a characterization is obtained for when two given binary relati…
▽ More
Motivated by applications in databases, this paper considers various fragments of the calculus of binary relations. The fragments are obtained by leaving out, or keeping in, some of the standard operators, along with some derived operators such as set difference, projection, coprojection, and residuation. For each considered fragment, a characterization is obtained for when two given binary relational structures are indistinguishable by expressions in that fragment. The characterizations are based on appropriately adapted notions of simulation and bisimulation.
△ Less
Submitted 28 March, 2014; v1 submitted 9 October, 2012;
originally announced October 2012.
-
External memory bisimulation reduction of big graphs
Authors:
Yongming Luo,
George H. L. Fletcher,
Jan Hidders,
Yuqing Wu,
Paul De Bra
Abstract:
In this paper, we present, to our knowledge, the first known I/O efficient solutions for computing the k-bisimulation partition of a massive directed graph, and performing maintenance of such a partition upon updates to the underlying graph. Ubiquitous in the theory and application of graph data, bisimulation is a robust notion of node equivalence which intuitively groups together nodes in a graph…
▽ More
In this paper, we present, to our knowledge, the first known I/O efficient solutions for computing the k-bisimulation partition of a massive directed graph, and performing maintenance of such a partition upon updates to the underlying graph. Ubiquitous in the theory and application of graph data, bisimulation is a robust notion of node equivalence which intuitively groups together nodes in a graph which share fundamental structural features. k-bisimulation is the standard variant of bisimulation where the topological features of nodes are only considered within a local neighborhood of radius $k\geqslant 0$.
The I/O cost of our partition construction algorithm is bounded by $O(k\cdot \mathit{sort}(|\et|) + k\cdot scan(|\nt|) + \mathit{sort}(|\nt|))$, while our maintenance algorithms are bounded by $O(k\cdot \mathit{sort}(|\et|) + k\cdot \mathit{sort}(|\nt|))$. The space complexity bounds are $O(|\nt|+|\et|)$ and $O(k\cdot|\nt|+k\cdot|\et|)$, resp. Here, $|\et|$ and $|\nt|$ are the number of disk pages occupied by the input graph's edge set and node set, resp., and $\mathit{sort}(n)$ and $\mathit{scan}(n)$ are the cost of sorting and scanning, resp., a file occupying $n$ pages in external memory. Empirical analysis on a variety of massive real-world and synthetic graph datasets shows that our algorithms perform efficiently in practice, scaling gracefully as graphs grow in size.
△ Less
Submitted 2 May, 2013; v1 submitted 2 October, 2012;
originally announced October 2012.
-
I/O efficient bisimulation partitioning on very large directed acyclic graphs
Authors:
Jelle Hellings,
George H. L. Fletcher,
Herman Haverkort
Abstract:
In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the no…
▽ More
In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory.
Our general algorithm has a worst-case IO-complexity of O(Sort(|N| + |E|)), where |N| and |E| are the numbers of nodes and edges, resp., in the data graph and Sort(n) is the number of accesses to external memory needed to sort an input of size n. We also study specializations of this algorithm to common variations of bisimulation for tree-structured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of external-memory bisimulation algorithms. To this end, the full open-source C++ implementation has been made freely available.
△ Less
Submitted 5 December, 2011;
originally announced December 2011.
-
A role-free approach to indexing large RDF data sets in secondary memory for efficient SPARQL evaluation
Authors:
George H. L. Fletcher,
Peter W. Beck
Abstract:
Massive RDF data sets are becoming commonplace. RDF data is typically generated in social semantic domains (such as personal information management) wherein a fixed schema is often not available a priori. We propose a simple Three-way Triple Tree (TripleT) secondary-memory indexing technique to facilitate efficient SPARQL query evaluation on such data sets. The novelty of TripleT is that (1) the…
▽ More
Massive RDF data sets are becoming commonplace. RDF data is typically generated in social semantic domains (such as personal information management) wherein a fixed schema is often not available a priori. We propose a simple Three-way Triple Tree (TripleT) secondary-memory indexing technique to facilitate efficient SPARQL query evaluation on such data sets. The novelty of TripleT is that (1) the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and (2) the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state of the art on RDF indexing, in terms of both storage and query processing costs.
△ Less
Submitted 7 November, 2008;
originally announced November 2008.