-
Hoeffding decomposition of black-box models with dependent inputs
Authors:
Marouane Il Idrissi,
Nicolas Bousquet,
Fabrice Gamboa,
Bertrand Iooss,
Jean-Michel Loubes
Abstract:
Performing an additive decomposition of arbitrary functions of random elements is paramount for global sensitivity analysis and, therefore, the interpretation of black-box models. The well-known seminal work of Hoeffding characterized the summands in such a decomposition in the particular case of mutually independent inputs. Going beyond the framework of independent inputs has been an ongoing chal…
▽ More
Performing an additive decomposition of arbitrary functions of random elements is paramount for global sensitivity analysis and, therefore, the interpretation of black-box models. The well-known seminal work of Hoeffding characterized the summands in such a decomposition in the particular case of mutually independent inputs. Going beyond the framework of independent inputs has been an ongoing challenge in the literature. Existing solutions have so far required constraining assumptions or suffer from a lack of interpretability. In this paper, we generalize Hoeffding's decomposition for dependent inputs under very mild conditions. For that purpose, we propose a novel framework to handle dependencies based on probability theory, functional analysis, and combinatorics. It allows for characterizing two reasonable assumptions on the dependence structure of the inputs: non-perfect functional dependence and non-degenerate stochastic dependence. We then show that any square-integrable, real-valued function of random elements respecting these two assumptions can be uniquely additively decomposed and offer a characterization of the summands using oblique projections. We then introduce and discuss the theoretical properties and practical benefits of the sensitivity indices that ensue from this decomposition. Finally, the decomposition is analytically illustrated on bivariate functions of Bernoulli inputs.
△ Less
Submitted 11 September, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
On the coalitional decomposition of parameters of interest
Authors:
Marouane Il Idrissi,
Nicolas Bousquet,
Fabrice Gamboa,
Bertrand Iooss,
Jean-Michel Loubes
Abstract:
Understanding the behavior of a black-box model with probabilistic inputs can be based on the decomposition of a parameter of interest (e.g., its variance) into contributions attributed to each coalition of inputs (i.e., subsets of inputs). In this paper, we produce conditions for obtaining unambiguous and interpretable decompositions of very general parameters of interest. This allows to recover…
▽ More
Understanding the behavior of a black-box model with probabilistic inputs can be based on the decomposition of a parameter of interest (e.g., its variance) into contributions attributed to each coalition of inputs (i.e., subsets of inputs). In this paper, we produce conditions for obtaining unambiguous and interpretable decompositions of very general parameters of interest. This allows to recover known decompositions, holding under weaker assumptions than stated in the literature.
△ Less
Submitted 6 January, 2023;
originally announced January 2023.
-
Quantile-constrained Wasserstein projections for robust interpretability of numerical and machine learning models
Authors:
Marouane Il Idrissi,
Nicolas Bousquet,
Fabrice Gamboa,
Bertrand Iooss,
Jean-Michel Loubes
Abstract:
Robustness studies of black-box models is recognized as a necessary task for numerical models based on structural equations and predictive models learned from data. These studies must assess the model's robustness to possible misspecification of regarding its inputs (e.g., covariate shift). The study of black-box models, through the prism of uncertainty quantification (UQ), is often based on sensi…
▽ More
Robustness studies of black-box models is recognized as a necessary task for numerical models based on structural equations and predictive models learned from data. These studies must assess the model's robustness to possible misspecification of regarding its inputs (e.g., covariate shift). The study of black-box models, through the prism of uncertainty quantification (UQ), is often based on sensitivity analysis involving a probabilistic structure imposed on the inputs, while ML models are solely constructed from observed data. Our work aim at unifying the UQ and ML interpretability approaches, by providing relevant and easy-to-use tools for both paradigms. To provide a generic and understandable framework for robustness studies, we define perturbations of input information relying on quantile constraints and projections with respect to the Wasserstein distance between probability measures, while preserving their dependence structure. We show that this perturbation problem can be analytically solved. Ensuring regularity constraints by means of isotonic polynomial approximations leads to smoother perturbations, which can be more suitable in practice. Numerical experiments on real case studies, from the UQ and ML fields, highlight the computational feasibility of such studies and provide local and global insights on the robustness of black-box models to input perturbations.
△ Less
Submitted 23 September, 2022;
originally announced September 2022.
-
Detecting and modeling worst-case dependence structures between random inputs of computational reliability models
Authors:
Nazih Benoumechiara,
Bertrand Michel,
Philippe Saint-Pierre,
Nicolas Bousquet
Abstract:
Uncertain information on input parameters of reliability models is usually modeled by considering these parameters as random, and described by marginal distributions and a dependence structure of these variables. In numerous real-world applications, while information is mainly provided by marginal distributions, typically from samples , little is really known on the dependence structure itself. Fa…
▽ More
Uncertain information on input parameters of reliability models is usually modeled by considering these parameters as random, and described by marginal distributions and a dependence structure of these variables. In numerous real-world applications, while information is mainly provided by marginal distributions, typically from samples , little is really known on the dependence structure itself. Faced with this problem of incomplete or missing information, risk studies are often conducted by considering independence of input variables, at the risk of including irrelevant situations. This approach is especially used when reliability functions are considered as black-box computational models. Such analyses remain weakened in absence of in-depth model exploration, at the possible price of a strong risk misestimation. Considering the frequent case where the reliability output is a quantile, this article provides a methodology to improve risk assessment, by exploring a set of pessimistic dependencies using a copula-based strategy. In dimension greater than two, a greedy algorithm is provided to build input regular vine copulas reaching a minimum quantile to which a reliability admissible limit value can be compared, by selecting pairwise components of sensitive influence on the result. The strategy is tested over toy models and a real industrial case-study. The results highlight that current approaches can provide non-conservative results, and that a nontrivial dependence structure can be exhibited to define a worst-case scenario.
△ Less
Submitted 27 April, 2018;
originally announced April 2018.
-
An innovative solution for breast cancer textual big data analysis
Authors:
Nicolas Thiebaut,
Antoine Simoulin,
Karl Neuberger,
Issam Ibnouhsein,
Nicolas Bousquet,
Nathalie Reix,
Sébastien Molière,
Carole Mathelin
Abstract:
The digitalization of stored information in hospitals now allows for the exploitation of medical data in text format, as electronic health records (EHRs), initially gathered for other purposes than epidemiology. Manual search and analysis operations on such data become tedious. In recent years, the use of natural language processing (NLP) tools was highlighted to automatize the extraction of infor…
▽ More
The digitalization of stored information in hospitals now allows for the exploitation of medical data in text format, as electronic health records (EHRs), initially gathered for other purposes than epidemiology. Manual search and analysis operations on such data become tedious. In recent years, the use of natural language processing (NLP) tools was highlighted to automatize the extraction of information contained in EHRs, structure it and perform statistical analysis on this structured information. The main difficulties with the existing approaches is the requirement of synonyms or ontology dictionaries, that are mostly available in English only and do not include local or custom notations. In this work, a team composed of oncologists as domain experts and data scientists develop a custom NLP-based system to process and structure textual clinical reports of patients suffering from breast cancer. The tool relies on the combination of standard text mining techniques and an advanced synonym detection method. It allows for a global analysis by retrieval of indicators such as medical history, tumor characteristics, therapeutic responses, recurrences and prognosis. The versatility of the method allows to obtain easily new indicators, thus opening up the way for retrospective studies with a substantial reduction of the amount of manual work. With no need for biomedical annotators or pre-defined ontologies, this language-agnostic method reached an good extraction accuracy for several concepts of interest, according to a comparison with a manually structured file, without requiring any existing corpus with local or new notations.
△ Less
Submitted 6 December, 2017;
originally announced December 2017.
-
Bayesian prior elicitation and selection for extreme values
Authors:
Nicolas Bousquet,
Merlin Keller
Abstract:
A major issue of extreme value analysis is the determination of the shape parameter $ξ$ common to Generalized Extreme Value (GEV) and Generalized Pareto (GP) distributions, which drives the tail behavior, and is of major impact on the estimation of return levels and periods. Many practitioners make the choice of a Bayesian framework to conduct this assessment for accounting of parametric uncertain…
▽ More
A major issue of extreme value analysis is the determination of the shape parameter $ξ$ common to Generalized Extreme Value (GEV) and Generalized Pareto (GP) distributions, which drives the tail behavior, and is of major impact on the estimation of return levels and periods. Many practitioners make the choice of a Bayesian framework to conduct this assessment for accounting of parametric uncertainties, which are typically high in such analyses characterized by a low number of observations. Nonetheless, such approaches can provide large credibility domains for $ξ$, including negative and positive values, which does not allow to conclude on the nature of the tail. Considering the block maxima framework, a generic approach of the determination of the value and sign of $ξ$ arises from model selection between the Fréchet, Gumbel and Weibull possible domains of attraction conditionally to observations. Opposite to the common choice of the GEV as an appropriate model for {\it sampling} extreme values, this model selection must be conducted with great care. The elicitation of proper, informative and easy-to use priors is conducted based on the following principle: for all parameter dimensions they act as posteriors of noninformative priors and virtual samples. Statistics of these virtual samples can be assessed from prior predictive information, and a compatibility rule can be carried out to complete the calibration, even though they are only semi-conjugated. Besides, the model selection is conducted using a mixture encompassing framework, which allows to tackle the computation of Bayes factors. Motivating by a real case-study involving the elicitation of expert knowledge on meteorological magnitudes, the overall methodology is illustrated by toy examples too.
△ Less
Submitted 17 June, 2018; v1 submitted 2 December, 2017;
originally announced December 2017.
-
An adaptive kriging method for solving nonlinear inverse statistical problems
Authors:
Shuai Fu,
Mathieu Couplet,
Nicolas Bousquet
Abstract:
In various industrial contexts, estimating the distribution of unobserved random vectors Xi from some noisy indirect observations H(Xi) + Ui is required. If the relation between Xi and the quantity H(Xi), measured with the error Ui, is implemented by a CPU-consuming computer model H, a major practical difficulty is to perform the statistical inference with a relatively small number of runs of H. F…
▽ More
In various industrial contexts, estimating the distribution of unobserved random vectors Xi from some noisy indirect observations H(Xi) + Ui is required. If the relation between Xi and the quantity H(Xi), measured with the error Ui, is implemented by a CPU-consuming computer model H, a major practical difficulty is to perform the statistical inference with a relatively small number of runs of H. Following Fu et al. (2014), a Bayesian statistical framework is considered to make use of possible prior knowledge on the parameters of the distribution of the Xi, which is assumed Gaussian. Moreover, a Markov Chain Monte Carlo (MCMC) algorithm is carried out to estimate their posterior distribution by replacing H by a kriging metamodel build from a limited number of simulated experiments. Two heuristics, involving two different criteria to be optimized, are proposed to sequentially design these computer experiments in the limits of a given computational budget. The first criterion is a Weighted Integrated Mean Square Error (WIMSE). The second one, called Expected Conditional Divergence (ECD), developed in the spirit of the Stepwise Uncertainty Reduction (SUR) criterion, is based on the discrepancy between two consecutive approximations of the target posterior distribution. Several numerical comparisons conducted over a toy example then a motivating real case-study show that such adaptive designs can significantly outperform the classical choice of a maximin Latin Hypercube Design (LHD) of experiments. Dealing with a major concern in hydraulic engineering, a particular emphasis is placed upon the prior elicitation of the case-study, highlighting the overall feasibility of the methodology. Faster convergences and manageability considerations lead to recommend the use of the ECD criterion in practical applications.
△ Less
Submitted 23 August, 2015;
originally announced August 2015.
-
On the practical interest of discrete Inverse Polya and Weibull-1 models in industrial reliability studies
Authors:
Alberto Pasanisi,
Côme Roero,
Nicolas Bousquet,
Emmanuel Remy
Abstract:
Engineers often cope with the problem of assessing the lifetime of industrial components, under the basis of observed industrial feedback data. Usually, lifetime is modelled as a continuous random variable, for instance exponentially or Weibull distributed. However, in some cases, the features of the piece of equipment under investigation rather suggest the use of discrete probabilistic models. Th…
▽ More
Engineers often cope with the problem of assessing the lifetime of industrial components, under the basis of observed industrial feedback data. Usually, lifetime is modelled as a continuous random variable, for instance exponentially or Weibull distributed. However, in some cases, the features of the piece of equipment under investigation rather suggest the use of discrete probabilistic models. This happens for an equipment which only operates on cycles or on demand. In these cases, the lifetime is rather measured in number of cycles or number of demands before failure, therefore, in theory, discrete models should be more appropriate. This article aims at bringing some light to the practical interest for the reliability engineer in using two discrete models among the most popular: the Inverse Polya distribution (IPD), based on a Polya urn scheme, and the so-called Weibull-1 (W1) model. It is showed that, for different reasons, the practical use of both models should be restricted to specific industrial situations. In particular, when nothing is a priori known over the nature of ageing and/or data are heavily right-censored, they can remain of limited interest with respect to more flexible continuous lifetime models such as the usual Weibull distribution. Nonetheless, the intuitive meaning given to the IPD distribution favors its use by engineers in low (decelerated) ageing situations.
△ Less
Submitted 3 November, 2014;
originally announced November 2014.
-
Accelerated Monte Carlo estimation of failure probabilities in output of monotone computer codes
Authors:
Nicolas Bousquet
Abstract:
The problem of estimating the probability p=P(g(X<0) is considered when X represents a multivariate stochastic input of a monotone function g. First, a heuristic method to bound p is formally described, involving a specialized design of numerical experiments. Then a statistical estimation of p is considered based on a sequential stochastic exploration of the input space. A maximum likelihood estim…
▽ More
The problem of estimating the probability p=P(g(X<0) is considered when X represents a multivariate stochastic input of a monotone function g. First, a heuristic method to bound p is formally described, involving a specialized design of numerical experiments. Then a statistical estimation of p is considered based on a sequential stochastic exploration of the input space. A maximum likelihood estimator of p based on successive dependent Bernoulli data is defined and its theoretical convergence properties are studied. Under intuitive or mild conditions, the estimation is faster and more robust than the traditional Monte Carlo approach, therefore adapted to time-consuming computer codes g. The main result of the paper is related to the variance of the estimator. It appears as a new baseline measure of efficiency under monotone constraints, which could play a similar role to the usual Monte Carlo estimator variance in unconstrained frameworks. Furthermore the bias of the estimator is shown to be corrigible via bootstrap heuristics. The behavior of the method is illustrated by numerical tests led on a class of toy examples and a more realistic hydraulic case-study.
Keywords : monotone function, deterministic computer codes, Monte Carlo acceleration, failure probability
△ Less
Submitted 18 May, 2012; v1 submitted 5 December, 2010;
originally announced December 2010.
-
Estimating Discrete Markov Models From Various Incomplete Data Schemes
Authors:
Alberto Pasanisi,
Shuai Fu,
Nicolas Bousquet
Abstract:
The parameters of a discrete stationary Markov model are transition probabilities between states. Traditionally, data consist in sequences of observed states for a given number of individuals over the whole observation period. In such a case, the estimation of transition probabilities is straightforwardly made by counting one-step moves from a given state to another. In many real-life problems, ho…
▽ More
The parameters of a discrete stationary Markov model are transition probabilities between states. Traditionally, data consist in sequences of observed states for a given number of individuals over the whole observation period. In such a case, the estimation of transition probabilities is straightforwardly made by counting one-step moves from a given state to another. In many real-life problems, however, the inference is much more difficult as state sequences are not fully observed, namely the state of each individual is known only for some given values of the time variable. A review of the problem is given, focusing on Monte Carlo Markov Chain (MCMC) algorithms to perform Bayesian inference and evaluate posterior distributions of the transition probabilities in this missing-data framework. Leaning on the dependence between the rows of the transition matrix, an adaptive MCMC mechanism accelerating the classical Metropolis-Hastings algorithm is then proposed and empirically studied.
△ Less
Submitted 22 February, 2012; v1 submitted 7 September, 2010;
originally announced September 2010.
-
Elicitation of Weibull priors
Authors:
Nicolas Bousquet
Abstract:
Based on expert opinions, informative prior elicitation for the common Weibull lifetime distribution usually presents some difficulties since it requires to elicit a two-dimensional joint prior. We consider here a reliability framework where the available expert information states directly in terms of prior predictive values (lifetimes) and not parameter values, which are less intuitive. The novel…
▽ More
Based on expert opinions, informative prior elicitation for the common Weibull lifetime distribution usually presents some difficulties since it requires to elicit a two-dimensional joint prior. We consider here a reliability framework where the available expert information states directly in terms of prior predictive values (lifetimes) and not parameter values, which are less intuitive. The novelty of our procedure is to weigh the expert information by the size m of a virtual sample yielding a similar information, the prior being seen as a reference posterior. Thus, the prior calibration by the Bayesian analyst, who has to moderate the subjective information with respect to the data information, is made simple. A main result is the full tractability of the prior under mild conditions, despite the conjugation issues encountered with the Weibull distribution. Besides, m is a practical focus point for discussion between analysts and experts, and a helpful parameter for leading sensitivity studies and reducing the potential imbalance in posterior selection between Bayesian Weibull models, which can be due to favoring arbitrarily a prior. The calibration of m is discussed and a real example is treated along the paper.
△ Less
Submitted 21 October, 2010; v1 submitted 27 July, 2010;
originally announced July 2010.