-
How Expressive are Knowledge Graph Foundation Models?
Authors:
Xingyue Huang,
Pablo Barceló,
Michael M. Bronstein,
İsmail İlkan Ceylan,
Mikhail Galkin,
Juan L Reutter,
Miguel Romero Orth
Abstract:
Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show…
▽ More
Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on the motifs that are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature are binary, as the representations are learned based on how pairs of relations interact, which limits the model's expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains.
△ Less
Submitted 9 June, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
A Neuro-Symbolic Framework for Answering Graph Pattern Queries in Knowledge Graphs
Authors:
Tamara Cucumides,
Daniel Daza,
Pablo Barceló,
Michael Cochez,
Floris Geerts,
Juan L Reutter,
Miguel Romero
Abstract:
The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most n…
▽ More
The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most neuro-symbolic query processors are constrained to tree-like graph pattern queries. These queries admit a bottom-up execution with constant values or anchors at the leaves and the target variable at the root. While expressive, tree-like queries fail to capture critical properties in knowledge graphs, such as the existence of multiple edges between entities or the presence of triangles. We introduce a framework for answering arbitrary graph pattern queries over incomplete knowledge graphs, encompassing both cyclic queries and tree-like queries with existentially quantified leaves. These classes of queries are vital for practical applications but are beyond the scope of most current neuro-symbolic models. Our approach employs an approximation scheme that facilitates acyclic traversals for cyclic patterns, thereby embedding additional symbolic bias into the query execution process. Our experimental evaluation demonstrates that our framework performs competitively on three datasets, effectively handling cyclic queries through our approximation strategy. Additionally, it maintains the performance of existing neuro-symbolic models on anchored tree-like queries and extends their capabilities to queries with existentially quantified variables.
△ Less
Submitted 5 June, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Expressiveness and Approximation Properties of Graph Neural Networks
Authors:
Floris Geerts,
Juan L. Reutter
Abstract:
Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNN architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in…
▽ More
Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNN architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the Weisfeiler-Leman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes and the nesting depth of summations, bounds on the separation power in terms of the WL-tests readily follow. We use tensor language to define Higher-Order Message-Passing Neural Networks (or k-MPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WL-tests. We also provide insights in what is needed to boost the separation power of GNNs.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
Optimal Joins using Compact Data Structures
Authors:
Gonzalo Navarro,
Juan L. Reutter,
Javiel Rojas-Ledesma
Abstract:
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or…
▽ More
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B$+$-trees. Either way, this means spending an extra amount of storage space that may be non-negligible.
We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quad tree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal.
△ Less
Submitted 9 January, 2020; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Assessing Achievability of Queries and Constraints
Authors:
Rada Chirkova,
Jon Doyle,
Juan L. Reutter
Abstract:
Assessing and improving the quality of data in data-intensive systems are fundamental challenges that have given rise to numerous applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between the tools that address issues in…
▽ More
Assessing and improving the quality of data in data-intensive systems are fundamental challenges that have given rise to numerous applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between the tools that address issues in these areas. Our focus is on the problem of determining whether there exist sequences of data-transforming procedures that, when applied to the (untransformed) input data, would yield data satisfying the conditions required for performing the task in question. Our goal is to develop a framework that would address this problem, starting with the relational setting.
In this paper we abstract data-processing tools as black-box procedures. This abstraction describes procedures by a specification of which parts of the database might be modified by the procedure, as well as by the constraints that specify the required states of the database before and after applying the procedure. We then proceed to study fundamental algorithmic questions arising in this context, such as understanding when one can guarantee that sequences of procedures apply to original or transformed data, when they succeed at improving the data, and when knowledge bases can represent the outcomes of procedures. Finally, we turn to the problem of determining whether the application of a sequence of procedures to a database results in the satisfaction of properties specified by either queries or constraints. We show that this problem is decidable for some broad and realistic classes of procedures and properties, even when procedures are allowed to alter the schema of instances.
△ Less
Submitted 9 December, 2017;
originally announced December 2017.
-
A Framework for Assessing Achievability of Data-Quality Constraints
Authors:
Rada Chirkova,
Jon Doyle,
Juan L. Reutter
Abstract:
Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are now reasonably well understood in isolation, not much attention has been given to the interplay between the tools addressing issues in these areas. We…
▽ More
Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are now reasonably well understood in isolation, not much attention has been given to the interplay between the tools addressing issues in these areas. We focus on the problem of determining whether the available data-processing procedures can be used together to bring about the desired quality of the given data. For instance, consider an organization introducing new data-analysis tasks. Depending on the tasks, it may be a priority to determine whether the data can be processed and transformed using the available data-processing tools to satisfy certain properties or quality assurances needed for the success of the task. Here, while the organization may control some of its tools, some other tools may be external or proprietary, with only basic information available on how they process data. The problem is then, how to decide which tools to apply, and in which order, to make the data ready for the new tasks?
Toward addressing this problem, we develop a new framework that abstracts data-processing tools as black-box procedures with only some of the properties exposed, such as the applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. We show how common tasks such as data cleaning and data migration are encapsulated into our framework and, as a proof of concept, we study basic properties of the framework for the case of procedures described by standard relational constraints. While reasoning in this framework may be computationally infeasible in general, we show that there exist well-behaved special cases with potential practical applications.
△ Less
Submitted 27 March, 2017;
originally announced March 2017.
-
JSON: data model, query languages and schema specification
Authors:
Pierre Bourhis,
Juan L. Reutter,
Fernando Suárez,
Domagoj Vrgoč
Abstract:
Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight que…
▽ More
Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms.
△ Less
Submitted 9 January, 2017;
originally announced January 2017.
-
Containment of Nested Regular Expressions
Authors:
Juan L. Reutter
Abstract:
Nested regular expressions (NREs) have been proposed as a powerful formalism for querying RDFS graphs, but research in a more general graph database context has been scarce, and static analysis results are currently lacking. In this paper we investigate the problem of containment of NREs, and show that it can be solved in PSPACE, i.e., the same complexity as the problem of containment of regular e…
▽ More
Nested regular expressions (NREs) have been proposed as a powerful formalism for querying RDFS graphs, but research in a more general graph database context has been scarce, and static analysis results are currently lacking. In this paper we investigate the problem of containment of NREs, and show that it can be solved in PSPACE, i.e., the same complexity as the problem of containment of regular expressions or regular path queries (RPQs).
△ Less
Submitted 19 June, 2013; v1 submitted 9 April, 2013;
originally announced April 2013.