-
Goal-Driven Query Answering over First- and Second-Order Dependencies with Equality
Authors:
Efthymia Tsamoura,
Boris Motik
Abstract:
Query answering over data with dependencies plays a central role in most applications of dependencies. The problem is commonly solved by using a suitable variant of the chase algorithm to compute a universal model of the dependencies and the data and thus explicate all knowledge implicit in the dependencies. After this preprocessing step, an arbitrary conjunctive query over the dependencies and th…
▽ More
Query answering over data with dependencies plays a central role in most applications of dependencies. The problem is commonly solved by using a suitable variant of the chase algorithm to compute a universal model of the dependencies and the data and thus explicate all knowledge implicit in the dependencies. After this preprocessing step, an arbitrary conjunctive query over the dependencies and the data can be answered by evaluating it the computed universal model. If, however, the query to be answered is fixed and known in advance, computing the universal model is often inefficient as many inferences made during this process can be irrelevant to a given query. In such cases, a goal-driven approach, which avoids drawing unnecessary inferences, promises to be more efficient and thus preferable in practice.
In this paper we present what we believe to be the first technique for goal-driven query answering over first- and second-order dependencies with equality reasoning. Our technique transforms the input dependencies so that applying the chase to the output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the following three novel techniques. First, we present a variant of the singularisation technique by Marnette [60] that is applicable to second-order dependencies and that corrects an incompleteness of a related formulation by ten Cate et al. [74]. Second, we present a relevance analysis technique that can eliminate from the input dependencies that provably do not contribute to query answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality reasoning. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be orders of magnitude faster than computing the full universal model.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Authors:
Jonathan Feldstein,
Paulius Dilkas,
Vaishak Belle,
Efthymia Tsamoura
Abstract:
Integrating symbolic techniques with statistical ones is a long-standing problem in artificial intelligence. The motivation is that the strengths of either area match the weaknesses of the other, and $\unicode{x2013}$ by combining the two $\unicode{x2013}$ the weaknesses of either method can be limited. Neuro-symbolic AI focuses on this integration where the statistical methods are in particular n…
▽ More
Integrating symbolic techniques with statistical ones is a long-standing problem in artificial intelligence. The motivation is that the strengths of either area match the weaknesses of the other, and $\unicode{x2013}$ by combining the two $\unicode{x2013}$ the weaknesses of either method can be limited. Neuro-symbolic AI focuses on this integration where the statistical methods are in particular neural networks. In recent years, there has been significant progress in this research field, where neuro-symbolic systems outperformed logical or neural models alone. Yet, neuro-symbolic AI is, comparatively speaking, still in its infancy and has not been widely adopted by machine learning practitioners. In this survey, we present the first mapping of neuro-symbolic techniques into families of frameworks based on their architectures, with several benefits: Firstly, it allows us to link different strengths of frameworks to their respective architectures. Secondly, it allows us to illustrate how engineers can augment their neural networks while treating the symbolic methods as black-boxes. Thirdly, it allows us to map most of the field so that future researchers can identify closely related frameworks.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Efficiently Learning Probabilistic Logical Models by Cheaply Ranking Mined Rules
Authors:
Jonathan Feldstein,
Dominic Phillips,
Efthymia Tsamoura
Abstract:
Probabilistic logical models are a core component of neurosymbolic AI and are important in their own right for tasks that require high explainability. Unlike neural networks, logical theories that underlie the model are often handcrafted using domain expertise, making their development costly and prone to errors. While there are algorithms that learn logical theories from data, they are generally…
▽ More
Probabilistic logical models are a core component of neurosymbolic AI and are important in their own right for tasks that require high explainability. Unlike neural networks, logical theories that underlie the model are often handcrafted using domain expertise, making their development costly and prone to errors. While there are algorithms that learn logical theories from data, they are generally prohibitively expensive, limiting their applicability in real-world settings. Here, we introduce precision and recall for logical rules and define their composition as rule utility -- a cost-effective measure of the predictive power of logical theories. We also introduce SPECTRUM, a scalable framework for learning logical theories from relational data. Its scalability derives from a linear-time algorithm that mines recurrent subgraphs in the data graph along with a second algorithm that, using the cheap utility measure, efficiently ranks rules derived from these subgraphs. Finally, we prove theoretical guarantees on the utility of the learnt logical theory. As a result, we demonstrate across various tasks that SPECTRUM scales to larger datasets, often learning more accurate logical theories on CPUs in < 1% the runtime of SOTA neural network approaches on GPUs.
△ Less
Submitted 28 February, 2025; v1 submitted 24 September, 2024;
originally announced September 2024.
-
On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning
Authors:
Kaifu Wang,
Efthymia Tsamoura,
Dan Roth
Abstract:
*Multi-Instance Partial Label Learning* (MI-PLL) is a weakly-supervised learning setting encompassing *partial label learning*, *latent structural learning*, and *neurosymbolic learning*. Unlike supervised learning, in MI-PLL, the inputs to the classifiers at training-time are tuples of instances $\mathbf{x}$. At the same time, the supervision signal is generated by a function $σ$ over the (hidden…
▽ More
*Multi-Instance Partial Label Learning* (MI-PLL) is a weakly-supervised learning setting encompassing *partial label learning*, *latent structural learning*, and *neurosymbolic learning*. Unlike supervised learning, in MI-PLL, the inputs to the classifiers at training-time are tuples of instances $\mathbf{x}$. At the same time, the supervision signal is generated by a function $σ$ over the (hidden) gold labels of $\mathbf{x}$. In this work, we make multiple contributions towards addressing a problem that hasn't been studied so far in the context of MI-PLL: that of characterizing and mitigating *learning imbalances*, i.e., major differences in the errors occurring when classifying instances of different classes (aka *class-specific risks*). In terms of theory, we derive class-specific risk bounds for MI-PLL, while making minimal assumptions. Our theory reveals a unique phenomenon: that $σ$ can greatly impact learning imbalances. This result is in sharp contrast with previous research on supervised and weakly-supervised learning, which only studies learning imbalances under the prism of data imbalances. On the practical side, we introduce a technique for estimating the marginal of the hidden labels using only MI-PLL data. Then, we introduce algorithms that mitigate imbalances at training- and testing-time, by treating the marginal of the hidden labels as a constraint. We demonstrate the effectiveness of our techniques using strong baselines from neurosymbolic and long-tail learning, suggesting performance improvements of up to 14\%.
△ Less
Submitted 15 December, 2024; v1 submitted 13 July, 2024;
originally announced July 2024.
-
On Learning Latent Models with Multi-Instance Weak Supervision
Authors:
Kaifu Wang,
Efthymia Tsamoura,
Dan Roth
Abstract:
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $σ$ of labels associated with multiple input instances. We formulate this problem as \emph{multi-instance Partial Label Learning (multi-instance PLL)}, which is an extension to the standard PLL problem. Our problem is met in different fields, including latent structural learning and…
▽ More
We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function $σ$ of labels associated with multiple input instances. We formulate this problem as \emph{multi-instance Partial Label Learning (multi-instance PLL)}, which is an extension to the standard PLL problem. Our problem is met in different fields, including latent structural learning and neuro-symbolic integration. Despite the existence of many learning techniques, limited theoretical analysis has been dedicated to this problem. In this paper, we provide the first theoretical study of multi-instance PLL with possibly an unknown transition $σ$. Our main contributions are as follows. Firstly, we propose a necessary and sufficient condition for the learnability of the problem. This condition non-trivially generalizes and relaxes the existing small ambiguity degree in the PLL literature, since we allow the transition to be deterministic. Secondly, we derive Rademacher-style error bounds based on a top-$k$ surrogate loss that is widely used in the neuro-symbolic literature. Furthermore, we conclude with empirical experiments for learning under unknown transitions. The empirical results align with our theoretical findings; however, they also expose the issue of scalability in the weak supervision literature.
△ Less
Submitted 13 July, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Parallel Neurosymbolic Integration with Concordia
Authors:
Jonathan Feldstein,
Modestas Jurčius,
Efthymia Tsamoura
Abstract:
Parallel neurosymbolic architectures have been applied effectively in NLP by distilling knowledge from a logic theory into a deep model.However, prior art faces several limitations including supporting restricted forms of logic theories and relying on the assumption of independence between the logic and the deep network. We present Concordia, a framework overcoming the limitations of prior art. Co…
▽ More
Parallel neurosymbolic architectures have been applied effectively in NLP by distilling knowledge from a logic theory into a deep model.However, prior art faces several limitations including supporting restricted forms of logic theories and relying on the assumption of independence between the logic and the deep network. We present Concordia, a framework overcoming the limitations of prior art. Concordia is agnostic both to the deep network and the logic theory offering support for a wide range of probabilistic theories. Our framework can support supervised training of both components and unsupervised training of the neural component. Concordia has been successfully applied to tasks beyond NLP and data classification, improving the accuracy of state-of-the-art on collective activity detection, entity linking and recommendation tasks.
△ Less
Submitted 14 June, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Probabilistic Reasoning at Scale: Trigger Graphs to the Rescue
Authors:
Efthymia Tsamoura,
Jaehun Lee,
Jacopo Urbani
Abstract:
The role of uncertainty in data management has become more prominent than ever before, especially because of the growing importance of machine learning-driven applications that produce large uncertain databases. A well-known approach to querying such databases is to blend rule-based reasoning with uncertainty. However, techniques proposed so far struggle with large databases. In this paper, we add…
▽ More
The role of uncertainty in data management has become more prominent than ever before, especially because of the growing importance of machine learning-driven applications that produce large uncertain databases. A well-known approach to querying such databases is to blend rule-based reasoning with uncertainty. However, techniques proposed so far struggle with large databases. In this paper, we address this problem by presenting a new technique for probabilistic reasoning that exploits Trigger Graphs (TGs) -- a notion recently introduced for the non-probabilistic setting. The intuition is that TGs can effectively store a probabilistic model by avoiding an explicit materialization of the lineage and by grouping together similar derivations of the same fact. Firstly, we show how TGs can be adapted to support the possible world semantics. Then, we describe techniques for efficiently computing a probabilistic model, and formally establish the correctness of our approach. We also present an extensive empirical evaluation using a prototype called LTGs. Our comparison against other leading engines shows that LTGs is not only faster, even against approximate reasoning techniques, but can also reason over probabilistic databases that existing engines cannot scale to.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Principled and Efficient Motif Finding for Structure Learning of Lifted Graphical Models
Authors:
Jonathan Feldstein,
Dominic Phillips,
Efthymia Tsamoura
Abstract:
Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. It consists in automatically learning a logical theory from data. The basis for structure learning is mining repeating patterns in the data, known as structural motifs. Finding these patterns reduces the exponential search space and therefore guides the learning of formulas. D…
▽ More
Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. It consists in automatically learning a logical theory from data. The basis for structure learning is mining repeating patterns in the data, known as structural motifs. Finding these patterns reduces the exponential search space and therefore guides the learning of formulas. Despite the importance of motif learning, it is still not well understood. We present the first principled approach for mining structural motifs in lifted graphical models, languages that blend first-order logic with probabilistic models, which uses a stochastic process to measure the similarity of entities in the data. Our first contribution is an algorithm, which depends on two intuitive hyperparameters: one controlling the uncertainty in the entity similarity measure, and one controlling the softness of the resulting rules. Our second contribution is a preprocessing step where we perform hierarchical clustering on the data to reduce the search space to the most relevant data. Our third contribution is to introduce an O(n ln n) (in the size of the entities in the data) algorithm for clustering structurally-related data. We evaluate our approach using standard benchmarks and show that we outperform state-of-the-art structure learning approaches by up to 6% in terms of accuracy and up to 80% in terms of runtime.
△ Less
Submitted 18 June, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
Scalable Regularization of Scene Graph Generation Models using Symbolic Theories
Authors:
Davide Buffelli,
Efthymia Tsamoura
Abstract:
Several techniques have recently aimed to improve the performance of deep learning models for Scene Graph Generation (SGG) by incorporating background knowledge. State-of-the-art techniques can be divided into two families: one where the background knowledge is incorporated into the model in a subsymbolic fashion, and another in which the background knowledge is maintained in symbolic form. Despit…
▽ More
Several techniques have recently aimed to improve the performance of deep learning models for Scene Graph Generation (SGG) by incorporating background knowledge. State-of-the-art techniques can be divided into two families: one where the background knowledge is incorporated into the model in a subsymbolic fashion, and another in which the background knowledge is maintained in symbolic form. Despite promising results, both families of techniques face several shortcomings: the first one requires ad-hoc, more complex neural architectures increasing the training or inference cost; the second one suffers from limited scalability w.r.t. the size of the background knowledge. Our work introduces a regularization technique for injecting symbolic background knowledge into neural SGG models that overcomes the limitations of prior art. Our technique is model-agnostic, does not incur any cost at inference time, and scales to previously unmanageable background knowledge sizes. We demonstrate that our technique can improve the accuracy of state-of-the-art SGG models, by up to 33%.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Materializing Knowledge Bases via Trigger Graphs
Authors:
Efthymia Tsamoura,
David Carral,
Enrico Malizia,
Jacopo Urbani
Abstract:
The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs), like Knowledge Graphs (KGs), to tackle important tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of…
▽ More
The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs), like Knowledge Graphs (KGs), to tackle important tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of the rules avoiding redundant computations. We present the results of an extensive theoretical and empirical study that seeks to answer when and how TGs can be computed and what are the benefits of TGs when applied over real-world KBs. Our results include introducing algorithms that compute (minimal) TGs. We implemented our approach in a new engine, and our experiments show that it can be significantly more efficient than the chase enabling us to materialize KBs with 17B facts in less than 40 min on commodity machines.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
Neural-Symbolic Integration: A Compositional Perspective
Authors:
Efthymia Tsamoura,
Loizos Michael
Abstract:
Despite significant progress in the development of neural-symbolic frameworks, the question of how to integrate a neural and a symbolic system in a \emph{compositional} manner remains open. Our work seeks to fill this gap by treating these two systems as black boxes to be integrated as modules into a single architecture, without making assumptions on their internal structure and semantics. Instead…
▽ More
Despite significant progress in the development of neural-symbolic frameworks, the question of how to integrate a neural and a symbolic system in a \emph{compositional} manner remains open. Our work seeks to fill this gap by treating these two systems as black boxes to be integrated as modules into a single architecture, without making assumptions on their internal structure and semantics. Instead, we expect only that each module exposes certain methods for accessing the functions that the module implements: the symbolic module exposes a deduction method for computing the function's output on a given input, and an abduction method for computing the function's inputs for a given output; the neural module exposes a deduction method for computing the function's output on a given input, and an induction method for updating the function given input-output training instances. We are, then, able to show that a symbolic module -- with any choice for syntax and semantics, as long as the deduction and abduction methods are exposed -- can be cleanly integrated with a neural module, and facilitate the latter's efficient training, achieving empirical performance that exceeds that of previous work.
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
Balancing expressiveness and inexpressiveness in view design
Authors:
Michael Benedikt,
Pierre Bourhis,
Louis Jachiet,
Efthymia Tsamoura
Abstract:
We study the design of data publishing mechanisms that allow a collection of autonomous distributed datasources to collaborate to support queries. A common mechanism for data publishing is via views: functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting int…
▽ More
We study the design of data publishing mechanisms that allow a collection of autonomous distributed datasources to collaborate to support queries. A common mechanism for data publishing is via views: functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting integrated queries. In deciding what data to expose to users, two considerations must be balanced. The views must be sufficiently expressive to support queries that users want to ask -- the utility of the publishing mechanism. But there may also be some expressiveness restriction. Here we consider two restrictions, a minimal information requirement, saying that the views should reveal as little as possible while supporting the utility query, and a non-disclosure requirement, formalizing the need to prevent external users from computing information that data owners do not want revealed.
We investigate the problem of designing views that satisfy both an expressiveness and an inexpressiveness requirement, for views in a restricted declarative language (conjunctive queries), and for arbitrary views.
△ Less
Submitted 27 June, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Beyond the Grounding Bottleneck: Datalog Techniques for Inference in Probabilistic Logic Programs (Technical Report)
Authors:
Efthymia Tsamoura,
Victor Gutierrez-Basulto,
Angelika Kimmig
Abstract:
State-of-the-art inference approaches in probabilistic logic programming typically start by computing the relevant ground program with respect to the queries of interest, and then use this program for probabilistic inference using knowledge compilation and weighted model counting. We propose an alternative approach that uses efficient Datalog techniques to integrate knowledge compilation with forw…
▽ More
State-of-the-art inference approaches in probabilistic logic programming typically start by computing the relevant ground program with respect to the queries of interest, and then use this program for probabilistic inference using knowledge compilation and weighted model counting. We propose an alternative approach that uses efficient Datalog techniques to integrate knowledge compilation with forward reasoning with a non-ground program. This effectively eliminates the grounding bottleneck that so far has prohibited the application of probabilistic logic programming in query answering scenarios over knowledge graphs, while also providing fast approximations on classical benchmarks in the field.
△ Less
Submitted 18 November, 2019;
originally announced November 2019.
-
Repairing mappings under policy views
Authors:
Angela Bonifati,
Ugo Comignani,
Efthymia Tsamoura
Abstract:
The problem of data exchange involves a source schema, a target schema and a set of mappings from transforming the data between the two schemas. We study the problem of data exchange in the presence of privacy restrictions on the source. The privacy restrictions are expressed as a set of policy views representing the information that is safe to expose over all instances of the source. We propose a…
▽ More
The problem of data exchange involves a source schema, a target schema and a set of mappings from transforming the data between the two schemas. We study the problem of data exchange in the presence of privacy restrictions on the source. The privacy restrictions are expressed as a set of policy views representing the information that is safe to expose over all instances of the source. We propose a protocol that provides formal privacy guarantees and is data-independent, i.e., if certain criteria are met, then the protocol guarantees that the mappings leak no sensitive information independently of the data that lies in the source. We also propose an algorithm for repairing an input mapping w.r.t. a set of policy views, in cases where the input mapping leaks sensitive information. The empirical evaluation of our work shows that the proposed algorithm is quite efficient, repairing sets of 300 s-t tgds in an average time of 5s on a commodity machine. To the best of our knowledge, our work is the first one that studies the problems of exchanging data and repairing mappings under such privacy restrictions. Furthermore, our work is the first to provide practical algorithms for a logical privacy-preservation paradigm, described as an open research challenge in previous work on this area.
△ Less
Submitted 21 March, 2019;
originally announced March 2019.
-
Goal-Driven Query Answering for Existential Rules with Equality
Authors:
Michael Benedikt,
Boris Motik,
Efthymia Tsamoura
Abstract:
Inspired by the magic sets for Datalog, we present a novel goal-driven approach for answering queries over terminating existential rules with equality (aka TGDs and EGDs). Our technique improves the performance of query answering by pruning the consequences that are not relevant for the query. This is challenging in our setting because equalities can potentially affect all predicates in a dataset.…
▽ More
Inspired by the magic sets for Datalog, we present a novel goal-driven approach for answering queries over terminating existential rules with equality (aka TGDs and EGDs). Our technique improves the performance of query answering by pruning the consequences that are not relevant for the query. This is challenging in our setting because equalities can potentially affect all predicates in a dataset. We address this problem by combining the existing singularization technique with two new ingredients: an algorithm for identifying the rules relevant to a query and a new magic sets algorithm. We show empirically that our technique can significantly improve the performance of query answering, and that it can mean the difference between answering a query in a few seconds or not being able to process the query at all.
△ Less
Submitted 20 November, 2017; v1 submitted 14 November, 2017;
originally announced November 2017.