-
Expectiles as basis risk-optimal payment schemes in parametric insurance
Authors:
Markus Johannes Maier,
Matthias Scherer
Abstract:
Payments in parametric insurance solutions are linked to an index and thus decoupled from policyholders' true losses. While this principle has appealing operational benefits compared to traditional indemnity coverage, i.e. is very efficient and cost effective, a downside is the discrepancy between payouts and actual damage, called basis risk. We show that in an asymmetrically weighted mean square…
▽ More
Payments in parametric insurance solutions are linked to an index and thus decoupled from policyholders' true losses. While this principle has appealing operational benefits compared to traditional indemnity coverage, i.e. is very efficient and cost effective, a downside is the discrepancy between payouts and actual damage, called basis risk. We show that in an asymmetrically weighted mean square error framework, the basis risk-minimizing payment schemes for pure parametric and parametric index insurance contracts can be expressed as conditional expectiles of policyholders' true loss given a compensation-triggering incident. We provide connections to stochastic orderings and demonstrate that regression approaches allow easy implementation in practice. Our results are visualized in parametric coverage for cyber risks and agricultural insurance.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Footprint of publication selection bias on meta-analyses in medicine, environmental sciences, psychology, and economics
Authors:
František Bartoš,
Maximilian Maier,
Eric-Jan Wagenmakers,
Franziska Nippold,
Hristos Doucouliagos,
John P. A. Ioannidis,
Willem M. Otte,
Martina Sladekova,
Teshome K. Deresssa,
Stephan B. Bruns,
Daniele Fanelli,
T. D. Stanley
Abstract:
Publication selection bias undermines the systematic accumulation of evidence. To assess the extent of this problem, we survey over 68,000 meta-analyses containing over 700,000 effect size estimates from medicine (67,386/597,699), environmental sciences (199/12,707), psychology (605/23,563), and economics (327/91,421). Our results indicate that meta-analyses in economics are the most severely cont…
▽ More
Publication selection bias undermines the systematic accumulation of evidence. To assess the extent of this problem, we survey over 68,000 meta-analyses containing over 700,000 effect size estimates from medicine (67,386/597,699), environmental sciences (199/12,707), psychology (605/23,563), and economics (327/91,421). Our results indicate that meta-analyses in economics are the most severely contaminated by publication selection bias, closely followed by meta-analyses in environmental sciences and psychology, whereas meta-analyses in medicine are contaminated the least. After adjusting for publication selection bias, the median probability of the presence of an effect decreased from 99.9% to 29.7% in economics, from 98.9% to 55.7% in psychology, from 99.8% to 70.7% in environmental sciences, and from 38.0% to 29.7% in medicine. The median absolute effect sizes (in terms of standardized mean differences) decreased from d = 0.20 to d = 0.07 in economics, from d = 0.37 to d = 0.26 in psychology, from d = 0.62 to d = 0.43 in environmental sciences, and from d = 0.24 to d = 0.13 in medicine.
△ Less
Submitted 26 September, 2023; v1 submitted 25 August, 2022;
originally announced August 2022.
-
Interpretable bias mitigation for textual data: Reducing gender bias in patient notes while maintaining classification performance
Authors:
Joshua R. Minot,
Nicholas Cheney,
Marc Maier,
Danne C. Elbers,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language mo…
▽ More
Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models -- statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how word choices made by healthcare practitioners and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce the potential for bias in natural language processing pipelines.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
Estimating Chlorophyll a Concentrations of Several Inland Waters with Hyperspectral Data and Machine Learning Models
Authors:
Philipp M. Maier,
Sina Keller
Abstract:
Water is a key component of life, the natural environment and human health. For monitoring the conditions of a water body, the chlorophyll a concentration can serve as a proxy for nutrients and oxygen supply. In situ measurements of water quality parameters are often time-consuming, expensive and limited in areal validity. Therefore, we apply remote sensing techniques. During field campaigns, we c…
▽ More
Water is a key component of life, the natural environment and human health. For monitoring the conditions of a water body, the chlorophyll a concentration can serve as a proxy for nutrients and oxygen supply. In situ measurements of water quality parameters are often time-consuming, expensive and limited in areal validity. Therefore, we apply remote sensing techniques. During field campaigns, we collected hyperspectral data with a spectrometer and in situ measured chlorophyll a concentrations of 13 inland water bodies with different spectral characteristics. One objective of this study is to estimate chlorophyll a concentrations of these inland waters by applying three machine learning regression models: Random Forest, Support Vector Machine and an Artificial Neural Network. Additionally, we simulate four different hyperspectral resolutions of the spectrometer data to investigate the effects on the estimation performance. Furthermore, the application of first order derivatives of the spectra is evaluated in turn to the regression performance. This study reveals the potential of combining machine learning approaches and remote sensing data for inland waters. Each machine learning model achieves an R2-score between 80 % to 90 % for the regression on chlorophyll a concentrations. The random forest model benefits clearly from the applied derivatives of the spectra. In further studies, we will focus on the application of machine learning models on spectral satellite data to enhance the area-wide estimation of chlorophyll a concentration for inland waters.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Machine learning regression on hyperspectral data to estimate multiple water parameters
Authors:
Philipp M. Maier,
Sina Keller
Abstract:
In this paper, we present a regression framework involving several machine learning models to estimate water parameters based on hyperspectral data. Measurements from a multi-sensor field campaign, conducted on the River Elbe, Germany, represent the benchmark dataset. It contains hyperspectral data and the five water parameters chlorophyll a, green algae, diatoms, CDOM and turbidity. We apply a PC…
▽ More
In this paper, we present a regression framework involving several machine learning models to estimate water parameters based on hyperspectral data. Measurements from a multi-sensor field campaign, conducted on the River Elbe, Germany, represent the benchmark dataset. It contains hyperspectral data and the five water parameters chlorophyll a, green algae, diatoms, CDOM and turbidity. We apply a PCA for the high-dimensional data as a possible preprocessing step. Then, we evaluate the performance of the regression framework with and without this preprocessing step. The regression results of the framework clearly reveal the potential of estimating water parameters based on hyperspectral data with machine learning. The proposed framework provides the basis for further investigations, such as adapting the framework to estimate water parameters of different inland waters.
△ Less
Submitted 7 August, 2018; v1 submitted 3 May, 2018;
originally announced May 2018.
-
Developing a machine learning framework for estimating soil moisture with VNIR hyperspectral data
Authors:
Sina Keller,
Felix M. Riese,
Johanna Stötzer,
Philipp M. Maier,
Stefan Hinz
Abstract:
In this paper, we investigate the potential of estimating the soil-moisture content based on VNIR hyperspectral data combined with LWIR data. Measurements from a multi-sensor field campaign represent the benchmark dataset which contains measured hyperspectral, LWIR, and soil-moisture data conducted on grassland site. We introduce a regression framework with three steps consisting of feature select…
▽ More
In this paper, we investigate the potential of estimating the soil-moisture content based on VNIR hyperspectral data combined with LWIR data. Measurements from a multi-sensor field campaign represent the benchmark dataset which contains measured hyperspectral, LWIR, and soil-moisture data conducted on grassland site. We introduce a regression framework with three steps consisting of feature selection, preprocessing, and well-chosen regression models. The latter are mainly supervised machine learning models. An exception are the self-organizing maps which combine unsupervised and supervised learning. We analyze the impact of the distinct preprocessing methods on the regression results. Of all regression models, the extremely randomized trees model without preprocessing provides the best estimation performance. Our results reveal the potential of the respective regression framework combined with the VNIR hyperspectral data to estimate soil moisture measured under real-world conditions. In conclusion, the results of this paper provide a basis for further improvements in different research directions.
△ Less
Submitted 12 July, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
How the result of graph clustering methods depends on the construction of the graph
Authors:
Markus Maier,
Ulrike von Luxburg,
Matthias Hein
Abstract:
We study the scenario of graph-based clustering algorithms such as spectral clustering. Given a set of data points, one first has to construct a graph on the data points and then apply a graph clustering algorithm to find a suitable partition of the graph. Our main question is if and how the construction of the graph (choice of the graph, choice of parameters, choice of weights) influences the out…
▽ More
We study the scenario of graph-based clustering algorithms such as spectral clustering. Given a set of data points, one first has to construct a graph on the data points and then apply a graph clustering algorithm to find a suitable partition of the graph. Our main question is if and how the construction of the graph (choice of the graph, choice of parameters, choice of weights) influences the outcome of the final clustering result. To this end we study the convergence of cluster quality measures such as the normalized cut or the Cheeger cut on various kinds of random geometric graphs as the sample size tends to infinity. It turns out that the limit values of the same objective function are systematically different on different types of graphs. This implies that clustering results systematically depend on the graph and can be very different for different types of graph. We provide examples to illustrate the implications on spectral clustering.
△ Less
Submitted 10 February, 2011;
originally announced February 2011.
-
Optimal construction of k-nearest neighbor graphs for identifying noisy clusters
Authors:
Markus Maier,
Matthias Hein,
Ulrike von Luxburg
Abstract:
We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected…
▽ More
We study clustering algorithms based on neighborhood graphs on a random sample of data points. The question we ask is how such a graph should be constructed in order to obtain optimal clustering results. Which type of neighborhood graph should one choose, mutual k-nearest neighbor or symmetric k-nearest neighbor? What is the optimal parameter k? In our setting, clusters are defined as connected components of the t-level set of the underlying probability distribution. Clusters are said to be identified in the neighborhood graph if connected components in the graph correspond to the true underlying clusters. Using techniques from random geometric graph theory, we prove bounds on the probability that clusters are identified successfully, both in a noise-free and in a noisy setting. Those bounds lead to several conclusions. First, k has to be chosen surprisingly high (rather of the order n than of the order log n) to maximize the probability of cluster identification. Secondly, the major difference between the mutual and the symmetric k-nearest neighbor graph occurs when one attempts to detect the most significant cluster only.
△ Less
Submitted 17 December, 2009;
originally announced December 2009.