Search | arXiv e-print repository

Feature Relevancy, Necessity and Usefulness: Complexity and Algorithms

Authors: Tomás Capdevielle, Santiago Cifuentes

Abstract: Given a classification model and a prediction for some input, there are heuristic strategies for ranking features according to their importance in regard to the prediction. One common approach to this task is rooted in propositional logic and the notion of \textit{sufficient reason}. Through this concept, the categories of relevant and necessary features were proposed in order to identify the cruc… ▽ More Given a classification model and a prediction for some input, there are heuristic strategies for ranking features according to their importance in regard to the prediction. One common approach to this task is rooted in propositional logic and the notion of \textit{sufficient reason}. Through this concept, the categories of relevant and necessary features were proposed in order to identify the crucial aspects of the input. This paper improves the existing techniques and algorithms for deciding which are the relevant and/or necessary features, showing in particular that necessity can be detected efficiently in complex models such as neural networks. We also generalize the notion of relevancy and study associated problems. Moreover, we present a new global notion (i.e. that intends to explain whether a feature is important for the behavior of the model in general, not depending on a particular input) of \textit{usefulness} and prove that it is related to relevancy and necessity. Furthermore, we develop efficient algorithms for detecting it in decision trees and other more complex models, and experiment on three datasets to analyze its practical utility. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 22 pages, 7 figures

MSC Class: 68T01 ACM Class: I.2.0

arXiv:2410.13937 [pdf, ps, other]

Quantum computational complexity of matrix functions

Authors: Santiago Cifuentes, Samson Wang, Thais L. Silva, Mario Berta, Leandro Aolita

Abstract: We investigate the dividing line between classical and quantum computational power in estimating properties of matrix functions. More precisely, we study the computational complexity of two primitive problems: given a function $f$ and a Hermitian matrix $A$, compute a matrix element of $f(A)$ or compute a local measurement on $f(A)|0\rangle^{\otimes n}$, with $|0\rangle^{\otimes n}$ an $n$-qubit r… ▽ More We investigate the dividing line between classical and quantum computational power in estimating properties of matrix functions. More precisely, we study the computational complexity of two primitive problems: given a function $f$ and a Hermitian matrix $A$, compute a matrix element of $f(A)$ or compute a local measurement on $f(A)|0\rangle^{\otimes n}$, with $|0\rangle^{\otimes n}$ an $n$-qubit reference state vector, in both cases up to additive approximation error. We consider four functions -- monomials, Chebyshev polynomials, the time evolution function, and the inverse function -- and probe the complexity across a broad landscape covering different problem input regimes. Namely, we consider two types of matrix inputs (sparse and Pauli access), matrix properties (norm, sparsity), the approximation error, and function-specific parameters. We identify BQP-complete forms of both problems for each function and then toggle the problem parameters to easier regimes to see where hardness remains, or where the problem becomes classically easy. As part of our results, we make concrete a hierarchy of hardness across the functions; in parameter regimes where we have classically efficient algorithms for monomials, all three other functions remain robustly BQP-hard, or hard under usual computational complexity assumptions. In identifying classically easy regimes, among others, we show that for any polynomial of degree $\mathrm{poly}(n)$ both problems can be efficiently classically simulated when $A$ has $O(\log n)$ non-zero coefficients in the Pauli basis. This contrasts with the fact that the problems are BQP-complete in the sparse access model even for constant row sparsity, whereas the stated Pauli access efficiently constructs sparse access with row sparsity $O(\log n)$. Our work provides a catalog of efficient quantum and classical algorithms for fundamental linear-algebra tasks. △ Less

Submitted 22 April, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

Comments: 10+30 pages, 1 table, 2 figures

arXiv:2407.01834 [pdf, other]

A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers

Authors: Valentin Barriere, Sebastian Cifuentes

Abstract: In this paper, we apply a method to quantify biases associated with named entities from various countries. We create counterfactual examples with small perturbations on target-domain data instead of relying on templates or specific datasets for bias detection. On widely used classifiers for subjectivity analysis, including sentiment, emotion, hate speech, and offensive text using Twitter data, our… ▽ More In this paper, we apply a method to quantify biases associated with named entities from various countries. We create counterfactual examples with small perturbations on target-domain data instead of relying on templates or specific datasets for bias detection. On widely used classifiers for subjectivity analysis, including sentiment, emotion, hate speech, and offensive text using Twitter data, our results demonstrate positive biases related to the language spoken in a country across all classifiers studied. Notably, the presence of certain country names in a sentence can strongly influence predictions, up to a 23\% change in hate speech detection and up to a 60\% change in the prediction of negative emotions such as anger. We hypothesize that these biases stem from the training data of pre-trained language models (PLMs) and find correlations between affect predictions and PLMs likelihood in English and unknown languages like Basque and Maori, revealing distinct patterns with exacerbate correlations. Further, we followed these correlations in-between counterfactual examples from a same sentence to remove the syntactical component, uncovering interesting results suggesting the impact of the pre-training data was more important for English-speaking-country names. Our anonymized code is [https://anonymous.4open.science/r/biases_ppl-576B/README.md](available here). △ Less

Submitted 23 November, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

Comments: updated EMNLP camera ready version

Journal ref: (2024), Proceedings of EMNLP, 569-579

arXiv:2402.09265 [pdf, ps, other]

Computational Complexity of Preferred Subset Repairs on Data-Graphs

Authors: Nina Pardal, Santiago Cifuentes, Edwin Pin, Maria Vanina Martinez, Sergio Abriola

Abstract: Preferences are a pivotal component in practical reasoning, especially in tasks that involve decision-making over different options or courses of action that could be pursued. In this work, we focus on repairing and querying inconsistent knowledge bases in the form of graph databases, which involves finding a way to solve conflicts in the knowledge base and considering answers that are entailed fr… ▽ More Preferences are a pivotal component in practical reasoning, especially in tasks that involve decision-making over different options or courses of action that could be pursued. In this work, we focus on repairing and querying inconsistent knowledge bases in the form of graph databases, which involves finding a way to solve conflicts in the knowledge base and considering answers that are entailed from every possible repair, respectively. Without a priori domain knowledge, all possible repairs are equally preferred. Though that may be adequate for some settings, it seems reasonable to establish and exploit some form of preference order among the potential repairs. We study the problem of computing prioritized repairs over graph databases with data values, using a notion of consistency based on GXPath expressions as integrity constraints. We present several preference criteria based on the standard subset repair semantics, incorporating weights, multisets, and set-based priority levels. We show that it is possible to maintain the same computational complexity as in the case where no preference criterion is available for exploitation. Finally, we explore the complexity of consistent query answering in this setting and obtain tight lower and upper bounds for all the preference criteria introduced. △ Less

Submitted 27 May, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Appendix

MSC Class: 68P15; 68T27; 03B70; 68T37

arXiv:2401.12731 [pdf, other]

The Distributional Uncertainty of the SHAP score in Explainable Machine Learning

Authors: Santiago Cifuentes, Leopoldo Bertossi, Nina Pardal, Sergio Abriola, Maria Vanina Martinez, Miguel Romero

Abstract: Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is g… ▽ More Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring. △ Less

Submitted 13 August, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: In ECAI 2024 proceedings

MSC Class: 68T37; 68T27

arXiv:2304.00931 [pdf, ps, other]

Data-graph repairs: the preferred approach

Authors: Sergio Abriola, Santiago Cifuentes, Nina Pardal, Edwin Pin

Abstract: Repairing inconsistent knowledge bases is a task that has been assessed, with great advances over several decades, from within the knowledge representation and reasoning and the database theory communities. As information becomes more complex and interconnected, new types of repositories, representation languages and semantics are developed in order to be able to query and reason about it. Graph d… ▽ More Repairing inconsistent knowledge bases is a task that has been assessed, with great advances over several decades, from within the knowledge representation and reasoning and the database theory communities. As information becomes more complex and interconnected, new types of repositories, representation languages and semantics are developed in order to be able to query and reason about it. Graph databases provide an effective way to represent relationships among data, and allow processing and querying these connections efficiently. In this work, we focus on the problem of computing preferred (subset and superset) repairs for graph databases with data values, using a notion of consistency based on a set of Reg-GXPath expressions as integrity constraints. Specifically, we study the problem of computing preferred repairs based on two different preference criteria, one based on weights and the other based on multisets, showing that in most cases it is possible to retain the same computational complexity as in the case where no preference criterion is available for exploitation. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: arXiv admin note: text overlap with arXiv:2206.07504

Report number: Vol-3409; urn:nbn:de:0074-3409-5 MSC Class: 03B70; 68Q55; 68P05

Journal ref: Proceedings of the 15th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW 2023)

arXiv:2211.05259 [pdf, ps, other]

Complexity of solving a system of difference constraints with variables restricted to a finite set

Authors: Santiago Cifuentes, Francisco J. Soulignac, Pablo Terlisky

Abstract: Fishburn developed an algorithm to solve a system of $m$ difference constraints whose $n$ unknowns must take values from a set with $k$ real numbers [Solving a system of difference constraints with variables restricted to a finite set, Inform Process Lett 82 (3) (2002) 143--144]. We provide an implementation of Fishburn's algorithm that runs in $O(n+km)$ time. Fishburn developed an algorithm to solve a system of $m$ difference constraints whose $n$ unknowns must take values from a set with $k$ real numbers [Solving a system of difference constraints with variables restricted to a finite set, Inform Process Lett 82 (3) (2002) 143--144]. We provide an implementation of Fishburn's algorithm that runs in $O(n+km)$ time. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: 4 pages

MSC Class: 68W05; 68W40

arXiv:2206.07504 [pdf, ps, other]

doi 10.1613/jair.1.13994

On the complexity of finding set repairs for data-graphs

Authors: Sergio Abriola, Santiago Cifuentes, María Vanina Martínez, Nina Pardal, Edwin Pin

Abstract: In the deeply interconnected world we live in, pieces of information link domains all around us. As graph databases embrace effectively relationships among data and allow processing and querying these connections efficiently, they are rapidly becoming a popular platform for storage that supports a wide range of domains and applications. As in the relational case, it is expected that data preserves… ▽ More In the deeply interconnected world we live in, pieces of information link domains all around us. As graph databases embrace effectively relationships among data and allow processing and querying these connections efficiently, they are rapidly becoming a popular platform for storage that supports a wide range of domains and applications. As in the relational case, it is expected that data preserves a set of integrity constraints that define the semantic structure of the world it represents. When a database does not satisfy its integrity constraints, a possible approach is to search for a 'similar' database that does satisfy the constraints, also known as a repair. In this work, we study the problem of computing subset and superset repairs for graph databases with data values using a notion of consistency based on a set of Reg-GXPath expressions as integrity constraints. We show that for positive fragments of Reg-GXPath these problems admit a polynomial-time algorithm, while the full expressive power of the language renders them intractable. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: 35 pages , including Appendix

ACM Class: H.2; I.2

arXiv:2109.14112 [pdf, ps, other]

doi 10.1016/j.ijar.2023.108948

An epistemic approach to model uncertainty in data-graphs

Authors: Sergio Abriola, Santiago Cifuentes, María Vanina Martínez, Nina Pardal, Edwin Pin

Abstract: Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from errors and discrepancies with respect to the real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases,… ▽ More Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from errors and discrepancies with respect to the real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases, in order to capture the idea that the observed (unclean) graph database is actually the noisy version of a clean one that correctly models the world but that we know partially. As the factors that may be involved in the observation can be many, e.g, all different types of clerical errors or unintended transformations of the data, we assume a probabilistic model that describes the distribution over all possible ways in which the clean (uncertain) database could have been polluted. Based on this model we define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity when considering that the transformation of the database can be caused by either removing (subset) or adding (superset) nodes and edges. △ Less

Submitted 7 January, 2023; v1 submitted 28 September, 2021; originally announced September 2021.

Comments: 35 pages

ACM Class: I.2.4; G.3; E.1

Showing 1–9 of 9 results for author: Cifuentes, S