Search | arXiv e-print repository

Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression

Authors: Samuel Stocksieker, Denys pommeret, Arthur Charpentier

Abstract: Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework… ▽ More Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR. △ Less

Submitted 19 August, 2025; originally announced August 2025.

arXiv:2507.03628 [pdf, ps, other]

When Numbers Mislead Us

Authors: Arthur Charpentier

Abstract: The belief that numbers offer a single, objective description of reality overlooks a crucial truth: data does not speak for itself. Every dataset results from choices-what to measure, how, when, and with whom-which inevitably reflect implicit, and sometimes ideological, assumptions about what is worth quantifying. Moreover, in any analysis, what remains unmeasured can be just as significant as wha… ▽ More The belief that numbers offer a single, objective description of reality overlooks a crucial truth: data does not speak for itself. Every dataset results from choices-what to measure, how, when, and with whom-which inevitably reflect implicit, and sometimes ideological, assumptions about what is worth quantifying. Moreover, in any analysis, what remains unmeasured can be just as significant as what is captured. When a key variable is omitted-whether by neglect, design, or ignorance-it can distort the observed relationships between other variables. This phenomenon, known as omitted variable bias, may produce misleading correlations or conceal genuine effects. In some cases, accounting for this hidden factor can completely overturn the conclusions drawn from a superficial analysis. This is precisely the mechanism behind Simpson's paradox. △ Less

Submitted 4 July, 2025; originally announced July 2025.

arXiv:2506.13900 [pdf, ps, other]

Beyond Shapley Values: Cooperative Games for the Interpretation of Machine Learning Models

Authors: Marouane Il Idrissi, Agathe Fernandes Machado, Arthur Charpentier

Abstract: Cooperative game theory has become a cornerstone of post-hoc interpretability in machine learning, largely through the use of Shapley values. Yet, despite their widespread adoption, Shapley-based methods often rest on axiomatic justifications whose relevance to feature attribution remains debatable. In this paper, we revisit cooperative game theory from an interpretability perspective and argue fo… ▽ More Cooperative game theory has become a cornerstone of post-hoc interpretability in machine learning, largely through the use of Shapley values. Yet, despite their widespread adoption, Shapley-based methods often rest on axiomatic justifications whose relevance to feature attribution remains debatable. In this paper, we revisit cooperative game theory from an interpretability perspective and argue for a broader and more principled use of its tools. We highlight two general families of efficient allocations, the Weber and Harsanyi sets, that extend beyond Shapley values and offer richer interpretative flexibility. We present an accessible overview of these allocation schemes, clarify the distinction between value functions and aggregation rules, and introduce a three-step blueprint for constructing reliable and theoretically-grounded feature attributions. Our goal is to move beyond fixed axioms and provide the XAI community with a coherent framework to design attribution methods that are both meaningful and robust to shifting methodological trends. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2505.13118 [pdf, other]

Unveil Sources of Uncertainty: Feature Contribution to Conformal Prediction Intervals

Authors: Marouane Il Idrissi, Agathe Fernandes Machado, Ewen Gallic, Arthur Charpentier

Abstract: Cooperative game theory methods, notably Shapley values, have significantly enhanced machine learning (ML) interpretability. However, existing explainable AI (XAI) frameworks mainly attribute average model predictions, overlooking predictive uncertainty. This work addresses that gap by proposing a novel, model-agnostic uncertainty attribution (UA) method grounded in conformal prediction (CP). By d… ▽ More Cooperative game theory methods, notably Shapley values, have significantly enhanced machine learning (ML) interpretability. However, existing explainable AI (XAI) frameworks mainly attribute average model predictions, overlooking predictive uncertainty. This work addresses that gap by proposing a novel, model-agnostic uncertainty attribution (UA) method grounded in conformal prediction (CP). By defining cooperative games where CP interval properties-such as width and bounds-serve as value functions, we systematically attribute predictive uncertainty to input features. Extending beyond the traditional Shapley values, we use the richer class of Harsanyi allocations, and in particular the proportional Shapley values, which distribute attribution proportionally to feature importance. We propose a Monte Carlo approximation method with robust statistical guarantees to address computational feasibility, significantly improving runtime efficiency. Our comprehensive experiments on synthetic benchmarks and real-world datasets demonstrate the practical utility and interpretative depth of our approach. By combining cooperative game theory and conformal prediction, we offer a rigorous, flexible toolkit for understanding and communicating predictive uncertainty in high-stakes ML applications. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2501.15549 [pdf, other]

Optimal Transport on Categorical Data for Counterfactuals using Compositional Data and Dirichlet Transport

Authors: Agathe Fernandes Machado, Arthur Charpentier, Ewen Gallic

Abstract: Recently, optimal transport-based approaches have gained attention for deriving counterfactuals, e.g., to quantify algorithmic discrimination. However, in the general multivariate setting, these methods are often opaque and difficult to interpret. To address this, alternative methodologies have been proposed, using causal graphs combined with iterative quantile regressions (Plečko and Meinshausen… ▽ More Recently, optimal transport-based approaches have gained attention for deriving counterfactuals, e.g., to quantify algorithmic discrimination. However, in the general multivariate setting, these methods are often opaque and difficult to interpret. To address this, alternative methodologies have been proposed, using causal graphs combined with iterative quantile regressions (Plečko and Meinshausen (2020)) or sequential transport (Fernandes Machado et al. (2025)) to examine fairness at the individual level, often referred to as ``counterfactual fairness.'' Despite these advancements, transporting categorical variables remains a significant challenge in practical applications with real datasets. In this paper, we propose a novel approach to address this issue. Our method involves (1) converting categorical variables into compositional data and (2) transporting these compositions within the probabilistic simplex of $\mathbb{R}^d$. We demonstrate the applicability and effectiveness of this approach through an illustration on real-world data, and discuss limitations. △ Less

Submitted 20 May, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

arXiv:2408.03425 [pdf, other]

Sequential Conditional Transport on Probabilistic Graphs for Interpretable Counterfactual Fairness

Authors: Agathe Fernandes Machado, Arthur Charpentier, Ewen Gallic

Abstract: In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, and optimal transport. We extend "Knothe's rearrangement" and "triangular transport" to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss fairness at the individual level. After establishing the theoretical foundations of… ▽ More In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, and optimal transport. We extend "Knothe's rearrangement" and "triangular transport" to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss fairness at the individual level. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets. △ Less

Submitted 28 April, 2025; v1 submitted 6 August, 2024; originally announced August 2024.

arXiv:2408.03421 [pdf, other]

Probabilistic Scores of Classifiers, Calibration is not Enough

Authors: Agathe Fernandes Machado, Arthur Charpentier, Emmanuel Flachaire, Ewen Gallic, François Hu

Abstract: In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distributi… ▽ More In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distribution, traditional calibration metrics lose reliability, failing to align score distribution with actual probabilities. In this study, we highlight approaches that prioritize optimizing the alignment between predicted scores and true probability distributions over minimizing traditional performance or calibration metrics. When employing tree-based models such as Random Forest and XGBoost, our analysis emphasizes the flexibility these models offer in tuning hyperparameters to minimize the Kullback-Leibler (KL) divergence between predicted and true distributions. Through extensive empirical analysis across 10 UCI datasets and simulations, we demonstrate that optimizing tree-based models based on KL divergence yields superior alignment between predicted scores and actual probabilities without significant performance loss. In real-world scenarios, the reference probability is determined a priori as a Beta distribution estimated through maximum likelihood. Conversely, minimizing traditional calibration metrics may lead to suboptimal results, characterized by notable performance declines and inferior KL values. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2403.15790 [pdf, other]

Boarding for ISS: Imbalanced Self-Supervised: Discovery of a Scaled Autoencoder for Mixed Tabular Datasets

Authors: Samuel Stocksieker, Denys Pommeret, Arthur Charpentier

Abstract: The field of imbalanced self-supervised learning, especially in the context of tabular data, has not been extensively studied. Existing research has predominantly focused on image datasets. This paper aims to fill this gap by examining the specific challenges posed by data imbalance in self-supervised learning in the domain of tabular data, with a primary focus on autoencoders. Autoencoders are wi… ▽ More The field of imbalanced self-supervised learning, especially in the context of tabular data, has not been extensively studied. Existing research has predominantly focused on image datasets. This paper aims to fill this gap by examining the specific challenges posed by data imbalance in self-supervised learning in the domain of tabular data, with a primary focus on autoencoders. Autoencoders are widely employed for learning and constructing a new representation of a dataset, particularly for dimensionality reduction. They are also often used for generative model learning, as seen in variational autoencoders. When dealing with mixed tabular data, qualitative variables are often encoded using a one-hot encoder with a standard loss function (MSE or Cross Entropy). In this paper, we analyze the drawbacks of this approach, especially when categorical variables are imbalanced. We propose a novel metric to balance learning: a Multi-Supervised Balanced MSE. This approach reduces the reconstruction error by balancing the influence of variables. Finally, we empirically demonstrate that this new metric, compared to the standard MSE: i) outperforms when the dataset is imbalanced, especially when the learning process is insufficient, and ii) provides similar results in the opposite case. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2311.11900 [pdf, other]

Measuring and Mitigating Biases in Motor Insurance Pricing

Authors: Mulah Moriah, Franck Vermet, Arthur Charpentier

Abstract: The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market com… ▽ More The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Age-based premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance. △ Less

Submitted 20 June, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.20508 [pdf, other]

Parametric Fairness with Statistical Guarantees

Authors: François HU, Philipp Ratz, Arthur Charpentier

Abstract: Algorithmic fairness has gained prominence due to societal and regulatory concerns about biases in Machine Learning models. Common group fairness metrics like Equalized Odds for classification or Demographic Parity for both classification and regression are widely used and a host of computationally advantageous post-processing methods have been developed around them. However, these metrics often l… ▽ More Algorithmic fairness has gained prominence due to societal and regulatory concerns about biases in Machine Learning models. Common group fairness metrics like Equalized Odds for classification or Demographic Parity for both classification and regression are widely used and a host of computationally advantageous post-processing methods have been developed around them. However, these metrics often limit users from incorporating domain knowledge. Despite meeting traditional fairness criteria, they can obscure issues related to intersectional fairness and even replicate unwanted intra-group biases in the resulting fair solution. To avoid this narrow perspective, we extend the concept of Demographic Parity to incorporate distributional properties in the predictions, allowing expert knowledge to be used in the fair solution. We illustrate the use of this new metric through a practical example of wages, and develop a parametric method that efficiently addresses practical challenges like limited training data and constraints on total spending, offering a robust solution for real-life applications. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2309.06627 [pdf, other]

doi 10.1609/aaai.v38i11.29143

A Sequentially Fair Mechanism for Multiple Sensitive Attributes

Authors: François Hu, Philipp Ratz, Arthur Charpentier

Abstract: In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less s… ▽ More In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less straightfoward in the case of multiple sensitive attributes. To tackle this issue, we propose a sequential framework, which allows to progressively achieve fairness across a set of sensitive features. We accomplish this by leveraging multi-marginal Wasserstein barycenters, which extends the standard notion of Strong Demographic Parity to the case with multiple sensitive characteristics. This method also provides a closed-form solution for the optimal, sequentially fair predictor, permitting a clear interpretation of inter-sensitive feature correlations. Our approach seamlessly extends to approximate fairness, enveloping a framework accommodating the trade-off between risk and unfairness. This extension permits a targeted prioritization of fairness improvements for a specific attribute within a set of sensitive attributes, allowing for a case specific adaptation. A data-driven estimation procedure for the derived solution is developed, and comprehensive numerical experiments are conducted on both synthetic and real datasets. Our empirical findings decisively underscore the practical efficacy of our post-processing approach in fostering fair decision-making. △ Less

Submitted 14 January, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

arXiv:2308.11090 [pdf, other]

Fairness Explainability using Optimal Transport with Applications in Image Classification

Authors: Philipp Ratz, François Hu, Arthur Charpentier

Abstract: Ensuring trust and accountability in Artificial Intelligence systems demands explainability of its outcomes. Despite significant progress in Explainable AI, human biases still taint a substantial portion of its training data, raising concerns about unfairness or discriminatory tendencies. Current approaches in the field of Algorithmic Fairness focus on mitigating such biases in the outcomes of a m… ▽ More Ensuring trust and accountability in Artificial Intelligence systems demands explainability of its outcomes. Despite significant progress in Explainable AI, human biases still taint a substantial portion of its training data, raising concerns about unfairness or discriminatory tendencies. Current approaches in the field of Algorithmic Fairness focus on mitigating such biases in the outcomes of a model, but few attempts have been made to try to explain \emph{why} a model is biased. To bridge this gap between the two fields, we propose a comprehensive approach that uses optimal transport theory to uncover the causes of discrimination in Machine Learning applications, with a particular emphasis on image classification. We leverage Wasserstein barycenters to achieve fair predictions and introduce an extension to pinpoint bias-associated regions. This allows us to derive a cohesive system which uses the enforced fairness to measure each features influence \emph{on} the bias. Taking advantage of this interplay of enforcing and explaining fairness, our method hold significant implications for the development of trustworthy and unbiased AI systems, fostering transparency, accountability, and fairness in critical decision-making scenarios across diverse domains. △ Less

Submitted 31 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2308.02966 [pdf, other]

Generalized Oversampling for Learning from Imbalanced datasets and Associated Theory

Authors: Samuel Stocksieker, Denys Pommeret, Arthur Charpentier

Abstract: In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH… ▽ More In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates which can be used in classification and regression. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of these machine learning algorithms and an expression of their conditional densities, in particular for SMOTE. New synthetic data generators are deduced. We apply GOLIATH in imbalanced regression combining such generator procedures with a wild-bootstrap resampling technique for the target values. We evaluate the performance of the GOLIATH algorithm in imbalanced regression situations. We empirically evaluate and compare our approach and demonstrate significant improvement over existing state-of-the-art techniques. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: This paper focuses specifically on the Imbalanced Regression issues but could be used for Imbalanced classification tasks

arXiv:2306.12912 [pdf, other]

Mitigating Discrimination in Insurance with Wasserstein Barycenters

Authors: Arthur Charpentier, François Hu, Philipp Ratz

Abstract: The insurance industry is heavily reliant on predictions of risks based on characteristics of potential customers. Although the use of said models is common, researchers have long pointed out that such practices perpetuate discrimination based on sensitive features such as gender or race. Given that such discrimination can often be attributed to historical data biases, an elimination or at least m… ▽ More The insurance industry is heavily reliant on predictions of risks based on characteristics of potential customers. Although the use of said models is common, researchers have long pointed out that such practices perpetuate discrimination based on sensitive features such as gender or race. Given that such discrimination can often be attributed to historical data biases, an elimination or at least mitigation is desirable. With the shift from more traditional models to machine-learning based predictions, calls for greater mitigation have grown anew, as simply excluding sensitive variables in the pricing process can be shown to be ineffective. In this article, we first investigate why predictions are a necessity within the industry and why correcting biases is not as straightforward as simply identifying a sensitive variable. We then propose to ease the biases through the use of Wasserstein barycenters instead of simple scaling. To demonstrate the effects and effectiveness of the approach we employ it on real data and discuss its implications. △ Less

Submitted 22 June, 2023; originally announced June 2023.

arXiv:2306.10155 [pdf, other]

doi 10.1007/978-3-031-43415-0_18

Fairness in Multi-Task Learning via Wasserstein Barycenters

Authors: François Hu, Philipp Ratz, Arthur Charpentier

Abstract: Algorithmic Fairness is an established field in machine learning that aims to reduce biases in data. Recent advances have proposed various methods to ensure fairness in a univariate environment, where the goal is to de-bias a single task. However, extending fairness to a multi-task setting, where more than one objective is optimised using a shared representation, remains underexplored. To bridge t… ▽ More Algorithmic Fairness is an established field in machine learning that aims to reduce biases in data. Recent advances have proposed various methods to ensure fairness in a univariate environment, where the goal is to de-bias a single task. However, extending fairness to a multi-task setting, where more than one objective is optimised using a shared representation, remains underexplored. To bridge this gap, we develop a method that extends the definition of Strong Demographic Parity to multi-task learning using multi-marginal Wasserstein barycenters. Our approach provides a closed form solution for the optimal fair multi-task predictor including both regression and binary classification tasks. We develop a data-driven estimation procedure for the solution and run numerical experiments on both synthetic and real datasets. The empirical results highlight the practical value of our post-processing methodology in promoting fair decision-making. △ Less

Submitted 6 July, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

arXiv:2302.09288 [pdf, other]

Data Augmentation for Imbalanced Regression

Authors: Samuel Stocksieker, Denys Pommeret, Arthur Charpentier

Abstract: In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring… ▽ More In this work, we consider the problem of imbalanced data in a regression framework when the imbalanced phenomenon concerns continuous or discrete covariates. Such a situation can lead to biases in the estimates. In this case, we propose a data augmentation algorithm that combines a weighted resampling (WR) and a data augmentation (DA) procedure. In a first step, the DA procedure permits exploring a wider support than the initial one. In a second step, the WR method drives the exogenous distribution to a target one. We discuss the choice of the DA procedure through a numerical study that illustrates the advantages of this approach. Finally, an actuarial application is studied. △ Less

Submitted 18 February, 2023; originally announced February 2023.

Comments: paper accepted at the AISTATS 2023 conference, to be published in PMLR (Proceedings of Machine Learning Research)

arXiv:2202.12008 [pdf, other]

A Fair Pricing Model via Adversarial Learning

Authors: Vincent Grari, Arthur Charpentier, Marcin Detyniecki

Abstract: At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For… ▽ More At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric. △ Less

Submitted 26 December, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

Comments: 20 pages, 12 figures

arXiv:2107.07668 [pdf, other]

doi 10.5194/nhess-22-2401-2022

Predicting Drought and Subsidence Risks in France

Authors: Arthur Charpentier, Molly James, Hani Ali

Abstract: The economic consequences of drought episodes are increasingly important, although they are often difficult to apprehend in part because of the complexity of the underlying mechanisms. In this article, we will study one of the consequences of drought, namely the risk of subsidence (or more specifically clay shrinkage induced subsidence), for which insurance has been mandatory in France for several… ▽ More The economic consequences of drought episodes are increasingly important, although they are often difficult to apprehend in part because of the complexity of the underlying mechanisms. In this article, we will study one of the consequences of drought, namely the risk of subsidence (or more specifically clay shrinkage induced subsidence), for which insurance has been mandatory in France for several decades. Using data obtained from several insurers, representing about a quarter of the household insurance market, over the past twenty years, we propose some statistical models to predict the frequency but also the intensity of these droughts, for insurers, showing that climate change will have probably major economic consequences on this risk. But even if we use more advanced models than standard regression-type models (here random forests to capture non linearity and cross effects), it is still difficult to predict the economic cost of subsidence claims, even if all geophysical and climatic information is available. △ Less

Submitted 15 July, 2021; originally announced July 2021.

arXiv:2103.03635 [pdf, other]

Autocalibration and Tweedie-dominance for Insurance Pricing with Machine Learning

Authors: Michel Denuit, Arthur Charpentier, Julien Trufin

Abstract: Boosting techniques and neural networks are particularly effective machine learning methods for insurance pricing. Often in practice, there are nevertheless endless debates about the choice of the right loss function to be used to train the machine learning model, as well as about the appropriate metric to assess the performances of competing models. Also, the sum of fitted values can depart from… ▽ More Boosting techniques and neural networks are particularly effective machine learning methods for insurance pricing. Often in practice, there are nevertheless endless debates about the choice of the right loss function to be used to train the machine learning model, as well as about the appropriate metric to assess the performances of competing models. Also, the sum of fitted values can depart from the observed totals to a large extent and this often confuses actuarial analysts. The lack of balance inherent to training models by minimizing deviance outside the familiar GLM with canonical link setting has been empirically documented in Wüthrich (2019, 2020) who attributes it to the early stopping rule in gradient descent methods for model fitting. The present paper aims to further study this phenomenon when learning proceeds by minimizing Tweedie deviance. It is shown that minimizing deviance involves a trade-off between the integral of weighted differences of lower partial moments and the bias measured on a specific scale. Autocalibration is then proposed as a remedy. This new method to correct for bias adds an extra local GLM step to the analysis. Theoretically, it is shown that it implements the autocalibration concept in pure premium calculation and ensures that balance also holds on a local scale, not only at portfolio level as with existing bias-correction techniques. The convex order appears to be the natural tool to compare competing models, putting a new light on the diagnostic graphs and associated metrics proposed by Denuit et al. (2019). △ Less

Submitted 9 July, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

arXiv:2006.08446 [pdf, other]

Modeling Joint Lives within Families

Authors: Olivier Cabrignac, Arthur Charpentier, Ewen Gallic

Abstract: Family history is usually seen as a significant factor insurance companies look at when applying for a life insurance policy. Where it is used, family history of cardiovascular diseases, death by cancer, or family history of high blood pressure and diabetes could result in higher premiums or no coverage at all. In this article, we use massive (historical) data to study dependencies between life le… ▽ More Family history is usually seen as a significant factor insurance companies look at when applying for a life insurance policy. Where it is used, family history of cardiovascular diseases, death by cancer, or family history of high blood pressure and diabetes could result in higher premiums or no coverage at all. In this article, we use massive (historical) data to study dependencies between life length within families. If joint life contracts (between a husband and a wife) have been long studied in actuarial literature, little is known about child and parents dependencies. We illustrate those dependencies using 19th century family trees in France, and quantify implications in annuities computations. For parents and children, we observe a modest but significant positive association between life lengths. It yields different estimates for remaining life expectancy, present values of annuities, or whole life insurance guarantee, given information about the parents (such as the number of parents alive). A similar but weaker pattern is observed when using information on grandparents. △ Less

Submitted 15 June, 2020; originally announced June 2020.

arXiv:1912.11736 [pdf, other]

Pareto models for risk management

Authors: Arthur Charpentier, Emmanuel Flachaire

Abstract: The Pareto model is very popular in risk management, since simple analytical formulas can be derived for financial downside risk measures (Value-at-Risk, Expected Shortfall) or reinsurance premiums and related quantities (Large Claim Index, Return Period). Nevertheless, in practice, distributions are (strictly) Pareto only in the tails, above (possible very) large threshold. Therefore, it could be… ▽ More The Pareto model is very popular in risk management, since simple analytical formulas can be derived for financial downside risk measures (Value-at-Risk, Expected Shortfall) or reinsurance premiums and related quantities (Large Claim Index, Return Period). Nevertheless, in practice, distributions are (strictly) Pareto only in the tails, above (possible very) large threshold. Therefore, it could be interesting to take into account second order behavior to provide a better fit. In this article, we present how to go from a strict Pareto model to Pareto-type distributions. We discuss inference, and derive formulas for various measures and indices, and finally provide applications on insurance losses and financial risks. △ Less

Submitted 25 December, 2019; originally announced December 2019.

arXiv:1905.10267 [pdf, other]

Extended Scale-Free Networks

Authors: Arthur Charpentier, Emmanuel Flachaire

Abstract: Recently, Broido & Clauset (2019) mentioned that (strict) Scale-Free networks were rare, in real life. This might be related to the statement of Stumpf, Wiuf & May (2005), that sub-networks of scale-free networks are not scale-free. In the later, those sub-networks are asymptotically scale-free, but one should not forget about second-order deviation (possibly also third order actually). In this ar… ▽ More Recently, Broido & Clauset (2019) mentioned that (strict) Scale-Free networks were rare, in real life. This might be related to the statement of Stumpf, Wiuf & May (2005), that sub-networks of scale-free networks are not scale-free. In the later, those sub-networks are asymptotically scale-free, but one should not forget about second-order deviation (possibly also third order actually). In this article, we introduce a concept of extended scale-free network, inspired by the extended Pareto distribution, that actually is maybe more realistic to describe real network than the strict scale free property. This property is consistent with Stumpf, Wiuf & May (2005): sub-network of scale-free larger networks are not strictly scale-free, but extended scale-free. △ Less

Submitted 28 May, 2019; v1 submitted 24 May, 2019; originally announced May 2019.

arXiv:1810.09214 [pdf, other]

A new GEE method to account for heteroscedasticity, using asymmetric least-square regressions

Authors: Amadou Barry, Karim Oualkacha, Arthur Charpentier

Abstract: Generalized estimating equations (GEE) are widely used to analyze longitudinal data; however, they are not appropriate for heteroscedastic data, because they only estimate regressor effects on the mean response{\textemdash}and therefore do not account for data heterogeneity. Here, we combine the GEE with the asymmetric least squares (expectile) regression to derive a new class of estimators, which… ▽ More Generalized estimating equations (GEE) are widely used to analyze longitudinal data; however, they are not appropriate for heteroscedastic data, because they only estimate regressor effects on the mean response{\textemdash}and therefore do not account for data heterogeneity. Here, we combine the GEE with the asymmetric least squares (expectile) regression to derive a new class of estimators, which we call generalized expectile estimating equations (GEEE). The GEEE model estimates regressor effects on the expectiles of the response distribution, which provides a detailed view of regressor effects on the entire response distribution. In addition to capturing data heteroscedasticity, the GEEE extends the various working correlation structures to account for within-subject dependence. We derive the asymptotic properties of the GEEE estimators and propose a robust estimator of its covariance matrix for inference (see our R package, github.com/AmBarry/expectgee). Our simulations show that the GEEE estimator is non-biased and efficient, and our real data analysis shows it captures heteroscedasticity. △ Less

Submitted 24 December, 2020; v1 submitted 22 October, 2018; originally announced October 2018.

Comments: 40 pages, 14 figures and all section modified

arXiv:1708.06992 [pdf, other]

Econométrie et Machine Learning

Authors: Arthur Charpentier, Emmanuel Flachaire, Antoine Ly

Abstract: Econometrics and machine learning seem to have one common goal: to construct a predictive model, for a variable of interest, using explanatory variables (or features). However, these two fields developed in parallel, thus creating two different cultures, to paraphrase Breiman (2001). The first was to build probabilistic models to describe economic phenomena. The second uses algorithms that will le… ▽ More Econometrics and machine learning seem to have one common goal: to construct a predictive model, for a variable of interest, using explanatory variables (or features). However, these two fields developed in parallel, thus creating two different cultures, to paraphrase Breiman (2001). The first was to build probabilistic models to describe economic phenomena. The second uses algorithms that will learn from their mistakes, with the aim, most often to classify (sounds, images, etc.). Recently, however, learning models have proven to be more effective than traditional econometric techniques (with a price to pay less explanatory power), and above all, they manage to manage much larger data. In this context, it becomes necessary for econometricians to understand what these two cultures are, what opposes them and especially what brings them closer together, in order to appropriate tools developed by the statistical learning community to integrate them into Econometric models. △ Less

Submitted 19 March, 2018; v1 submitted 26 July, 2017; originally announced August 2017.

Comments: in French

arXiv:1707.07607 [pdf, other]

We are not alone ! (at least, most of us). Homonymy in large scale social groups

Authors: Arthur Charpentier, Baptiste Coulmont

Abstract: This article brings forward an estimation of the proportion of homonyms in large scale groups based on the distribution of first names and last names in a subset of these groups. The estimation is based on the generalization of the "birthday paradox problem". The main results is that, in societies such as France or the United States, identity collisions (based on first + last names) are frequent.… ▽ More This article brings forward an estimation of the proportion of homonyms in large scale groups based on the distribution of first names and last names in a subset of these groups. The estimation is based on the generalization of the "birthday paradox problem". The main results is that, in societies such as France or the United States, identity collisions (based on first + last names) are frequent. The large majority of the population has at least one homonym. But in smaller settings, it is much less frequent : even if small groups of a few thousand people have at least one couple of homonyms, only a few individuals have an homonym. △ Less

Submitted 24 July, 2017; originally announced July 2017.

arXiv:1602.08773 [pdf, other]

Macro vs. Micro Methods in Non-Life Claims Reserving (an Econometric Perspective)

Authors: Arthur Charpentier, Mathieu Pigeon

Abstract: Traditionally, actuaries have used run-off triangles to estimate reserve ("macro" models, on agregated data). But it is possible to model payments related to individual claims. If those models provide similar estimations, we investigate uncertainty related to reserves, with "macro" and "micro" models. We study theoretical properties of econometric models (Gaussian, Poisson and quasi-Poisson) on in… ▽ More Traditionally, actuaries have used run-off triangles to estimate reserve ("macro" models, on agregated data). But it is possible to model payments related to individual claims. If those models provide similar estimations, we investigate uncertainty related to reserves, with "macro" and "micro" models. We study theoretical properties of econometric models (Gaussian, Poisson and quasi-Poisson) on individual data, and clustered data. Finally, application on claims reserving are considered. △ Less

Submitted 28 February, 2016; originally announced February 2016.

arXiv:1404.4414 [pdf, other]

Probit transformation for nonparametric kernel estimation of the copula density

Authors: Gery Geenens, Arthur Charpentier, Davy Paindaveine

Abstract: Copula modelling has become ubiquitous in modern statistics. Here, the problem of nonparametrically estimating a copula density is addressed. Arguably the most popular nonparametric density estimator, the kernel estimator is not suitable for the unit-square-supported copula densities, mainly because it is heavily affected by boundary bias issues. In addition, most common copulas admit unbounded de… ▽ More Copula modelling has become ubiquitous in modern statistics. Here, the problem of nonparametrically estimating a copula density is addressed. Arguably the most popular nonparametric density estimator, the kernel estimator is not suitable for the unit-square-supported copula densities, mainly because it is heavily affected by boundary bias issues. In addition, most common copulas admit unbounded densities, and kernel methods are not consistent in that case. In this paper, a kernel-type copula density estimator is proposed. It is based on the idea of transforming the uniform marginals of the copula density into normal distributions via the probit function, estimating the density in the transformed domain, which can be accomplished without boundary problems, and obtaining an estimate of the copula density through back-transformation. Although natural, a raw application of this procedure was, however, seen not to perform very well in the earlier literature. Here, it is shown that, if combined with local likelihood density estimation methods, the idea yields very good and easy to implement estimators, fixing boundary issues in a natural way and able to cope with unbounded copula densities. The asymptotic properties of the suggested estimators are derived, and a practical way of selecting the crucially important smoothing parameters is devised. Finally, extensive simulation studies and a real data analysis evidence their excellent performance compared to their main competitors. △ Less

Submitted 16 April, 2014; originally announced April 2014.

arXiv:1112.0929 [pdf, other]

Multivariate integer-valued autoregressive models applied to earthquake counts

Authors: Mathieu Boudreault, Arthur Charpentier

Abstract: In various situations in the insurance industry, in finance, in epidemiology, etc., one needs to represent the joint evolution of the number of occurrences of an event. In this paper, we present a multivariate integer-valued autoregressive (MINAR) model, derive its properties and apply the model to earthquake occurrences across various pairs of tectonic plates. The model is an extension of Pedelis… ▽ More In various situations in the insurance industry, in finance, in epidemiology, etc., one needs to represent the joint evolution of the number of occurrences of an event. In this paper, we present a multivariate integer-valued autoregressive (MINAR) model, derive its properties and apply the model to earthquake occurrences across various pairs of tectonic plates. The model is an extension of Pedelis & Karlis (2011) where cross autocorrelation (spatial contagion in a seismic context) is considered. We fit various bivariate count models and find that for many contiguous tectonic plates, spatial contagion is significant in both directions. Furthermore, ignoring cross autocorrelation can underestimate the potential for high numbers of occurrences over the short-term. Our overall findings seem to further confirm Parsons & Velasco (2001). △ Less

Submitted 5 December, 2011; originally announced December 2011.

Showing 1–28 of 28 results for author: Charpentier, A