-
Revisiting Reweighted Risk for Calibration: AURC, Focal Loss, and Inverse Focal Loss
Authors:
Han Zhou,
Sebastian G. Gruber,
Teodora Popordanoska,
Matthew B. Blaschko
Abstract:
Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functio…
▽ More
Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between these reweighting schemes and calibration errors. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing a regularized variant of the AURC naturally leads to improved calibration. This regularized AURC shares a similar reweighting strategy with inverse focal loss, lending support to the idea that focal loss is less principled when calibration is a desired outcome. Direct AURC optimization offers greater flexibility through the choice of confidence score functions (CSFs). To enable gradient-based optimization, we introduce a differentiable formulation of the regularized AURC using the SoftRank technique. Empirical evaluations demonstrate that our AURC-based loss achieves competitive class-wise calibration performance across a range of datasets and model architectures.
△ Less
Submitted 10 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Beyond Segmentation: Confidence-Aware and Debiased Estimation of Ratio-based Biomarkers
Authors:
Jiameng Li,
Teodora Popordanoska,
Sebastian G. Gruber,
Frederik Maes,
Matthew B. Blaschko
Abstract:
Ratio-based biomarkers -- such as the proportion of necrotic tissue within a tumor -- are widely used in clinical practice to support diagnosis, prognosis and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering…
▽ More
Ratio-based biomarkers -- such as the proportion of necrotic tissue within a tumor -- are widely used in clinical practice to support diagnosis, prognosis and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified \textit{confidence-aware} framework for estimating ratio-based biomarkers. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. To mitigate this, we incorporate a lightweight, post-hoc calibration module that can be applied using internal hospital data without retraining. We leverage a tunable parameter $Q$ to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Optimizing Estimators of Squared Calibration Errors in Classification
Authors:
Sebastian G. Gruber,
Francis Bach
Abstract:
In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimato…
▽ More
In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.
△ Less
Submitted 21 February, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Disentangling Mean Embeddings for Better Diagnostics of Image Generators
Authors:
Sebastian G. Gruber,
Pascal Tobias Ziegler,
Florian Buettner
Abstract:
The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for indivi…
▽ More
The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for individual pixel clusters via central kernel alignment. Consequently, we can quantify the contribution of the cluster-wise performance to the overall image generation performance. We demonstrate how this enhances the explainability and the likelihood of identifying pixel regions of model misbehavior across various real-world use cases.
△ Less
Submitted 12 December, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
Neural Network Surrogate and Projected Gradient Descent for Fast and Reliable Finite Element Model Calibration: a Case Study on an Intervertebral Disc
Authors:
Matan Atad,
Gabriel Gruber,
Marx Ribeiro,
Luis Fernando Nicolini,
Robert Graf,
Hendrik Möller,
Kati Nispel,
Ivan Ezhov,
Daniel Rueckert,
Jan S. Kirschke
Abstract:
Accurate calibration of finite element (FE) models is essential across various biomechanical applications, including human intervertebral discs (IVDs), to ensure their reliability and use in diagnosing and planning treatments. However, traditional calibration methods are computationally intensive, requiring iterative, derivative-free optimization algorithms that often take days to converge. This s…
▽ More
Accurate calibration of finite element (FE) models is essential across various biomechanical applications, including human intervertebral discs (IVDs), to ensure their reliability and use in diagnosing and planning treatments. However, traditional calibration methods are computationally intensive, requiring iterative, derivative-free optimization algorithms that often take days to converge. This study addresses these challenges by introducing a novel, efficient, and effective calibration method demonstrated on a human L4-L5 IVD FE model as a case study using a neural network (NN) surrogate. The NN surrogate predicts simulation outcomes with high accuracy, outperforming other machine learning models, and significantly reduces the computational cost associated with traditional FE simulations. Next, a Projected Gradient Descent (PGD) approach guided by gradients of the NN surrogate is proposed to efficiently calibrate FE models. Our method explicitly enforces feasibility with a projection step, thus maintaining material bounds throughout the optimization process. The proposed method is evaluated against SOTA Genetic Algorithm and inverse model baselines on synthetic and in vitro experimental datasets. Our approach demonstrates superior performance on synthetic data, achieving an MAE of 0.06 compared to the baselines' MAE of 0.18 and 0.54, respectively. On experimental specimens, our method outperforms the baseline in 5 out of 6 cases. While our approach requires initial dataset generation and surrogate training, these steps are performed only once, and the actual calibration takes under three seconds. In contrast, traditional calibration time scales linearly with the number of specimens, taking up to 8 days in the worst-case. Such efficiency paves the way for applying more complex FE models, potentially extending beyond IVDs, and enabling accurate patient-specific simulations.
△ Less
Submitted 9 December, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors
Authors:
Teodora Popordanoska,
Sebastian G. Gruber,
Aleksei Tiulpin,
Florian Buettner,
Matthew B. Blaschko
Abstract:
Proper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components -- proper calibration error and refinement -- utilizing a Bregman divergence. While uncertainty calibration has gained significant attention, current literature lacks a general estimato…
▽ More
Proper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components -- proper calibration error and refinement -- utilizing a Bregman divergence. While uncertainty calibration has gained significant attention, current literature lacks a general estimator for these quantities with known statistical properties. To address this gap, we propose a method that allows consistent, and asymptotically unbiased estimation of all proper calibration errors and refinement terms. In particular, we introduce Kullback--Leibler calibration error, induced by the commonly used cross-entropy loss. As part of our results, we prove the relation between refinement and f-divergences, which implies information monotonicity in neural networks, regardless of which proper scoring rule is optimized. Our experiments validate empirically the claimed properties of the proposed estimator and suggest that the selection of a post-hoc calibration method should be determined by the particular calibration error of interest.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models
Authors:
Sebastian G. Gruber,
Florian Buettner
Abstract:
Generative models, like large language models, are becoming increasingly relevant in our daily lives, yet a theoretical framework to assess their generalization behavior and uncertainty does not exist. Particularly, the problem of uncertainty estimation is commonly solved in an ad-hoc and task-dependent manner. For example, natural language approaches cannot be transferred to image generation. In…
▽ More
Generative models, like large language models, are becoming increasingly relevant in our daily lives, yet a theoretical framework to assess their generalization behavior and uncertainty does not exist. Particularly, the problem of uncertainty estimation is commonly solved in an ad-hoc and task-dependent manner. For example, natural language approaches cannot be transferred to image generation. In this paper, we introduce the first bias-variance-covariance decomposition for kernel scores. This decomposition represents a theoretical framework from which we derive a kernel-based variance and entropy for uncertainty estimation. We propose unbiased and consistent estimators for each quantity which only require generated samples but not the underlying model itself. Based on the wide applicability of kernels, we demonstrate our framework via generalization and uncertainty experiments for image, audio, and language generation. Specifically, kernel entropy for uncertainty estimation is more predictive of performance on CoQA and TriviaQA question answering datasets than existing baselines and can also be applied to closed-source models.
△ Less
Submitted 10 July, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Authors:
Sebastian G. Gruber,
Florian Buettner
Abstract:
Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift and restricted to classification. Alternatively, proper scores can be used for most…
▽ More
Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift and restricted to classification. Alternatively, proper scores can be used for most predictive tasks but a bias-variance decomposition for model uncertainty does not exist in the current literature. In this work we introduce a general bias-variance decomposition for proper scores, giving rise to the Bregman Information as the variance term. We discover how exponential families and the classification log-likelihood are special cases and provide novel formulations. Surprisingly, we can express the classification case purely in the logit space. We showcase the practical relevance of this decomposition on several downstream tasks, including model ensembles and confidence regions. Further, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.
△ Less
Submitted 20 April, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Better Uncertainty Calibration via Proper Scores for Classification and Beyond
Authors:
Sebastian G. Gruber,
Florian Buettner
Abstract:
With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibratio…
▽ More
With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.
△ Less
Submitted 12 March, 2024; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Mass sensing for the advanced fabrication of nanomechanical resonators
Authors:
G. Gruber,
C. Urgell,
A. Tavernarakis,
A. Stavrinadis,
S. Tepsic,
C. Magen,
S. Sangiao,
J. M. de Teresa,
P. Verlot,
A. Bachtold
Abstract:
We report on a nanomechanical engineering method to monitor matter growth in real time via e-beam electromechanical coupling. This method relies on the exceptional mass sensing capabilities of nanomechanical resonators. Focused electron beam induced deposition (FEBID) is employed to selectively grow platinum particles at the free end of singly clamped nanotube cantilevers. The electron beam has tw…
▽ More
We report on a nanomechanical engineering method to monitor matter growth in real time via e-beam electromechanical coupling. This method relies on the exceptional mass sensing capabilities of nanomechanical resonators. Focused electron beam induced deposition (FEBID) is employed to selectively grow platinum particles at the free end of singly clamped nanotube cantilevers. The electron beam has two functions: it allows both to grow material on the nanotube and to track in real time the deposited mass by probing the noise-driven mechanical resonance of the nanotube. On the one hand, this detection method is highly effective as it can resolve mass deposition with a resolution in the zeptogram range; on the other hand, this method is simple to use and readily available to a wide range of potential users, since it can be operated in existing commercial FEBID systems without making any modification. The presented method allows to engineer hybrid nanomechanical resonators with precisely tailored functionality. It also appears as a new tool for studying growth dynamics of ultra-thin nanostructures, opening new opportunities for investigating so far out-of-reach physics of FEBID and related methods.
△ Less
Submitted 22 January, 2021;
originally announced January 2021.
-
Interrelation of elasticity and thermal bath in nanotube cantilevers
Authors:
S. Tepsic,
G. Gruber,
C. B. Moller,
C. Magen,
P. Belardinelli,
E. R. Hernadez,
F. Alijani,
P. Verlot,
A. Bachtold
Abstract:
We report the first study on the thermal behaviour of the stiffness of individual carbon nanotubes, which is achieved by measuring the resonance frequency of their fundamental mechanical bending modes. We observe a reduction of the Young's modulus over a large temperature range with a slope $-(173\pm 65)$ ppm/K in its relative shift. These findings are reproduced by two different theoretical model…
▽ More
We report the first study on the thermal behaviour of the stiffness of individual carbon nanotubes, which is achieved by measuring the resonance frequency of their fundamental mechanical bending modes. We observe a reduction of the Young's modulus over a large temperature range with a slope $-(173\pm 65)$ ppm/K in its relative shift. These findings are reproduced by two different theoretical models based on the thermal dynamics of the lattice. These results reveal how the measured fundamental bending modes depend on the phonons in the nanotube via the Young's modulus. An alternative description based on the coupling between the measured mechanical modes and the phonon thermal bath in the Akhiezer limit is discussed.
△ Less
Submitted 23 March, 2021; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Analyzing conformational changes in single FRET-labeled A1 parts of archaeal A1AO-ATP synthase
Authors:
Hendrik Sielaff,
Dhirendra Singh,
Gerhard Grueber,
Michael Börsch
Abstract:
ATP synthases utilize a proton motive force to synthesize ATP. In reverse, these membrane-embedded enzymes can also hydrolyze ATP to pump protons over the membrane. To prevent wasteful ATP hydrolysis, distinct control mechanisms exist for ATP synthases in bacteria, archaea, chloroplasts and mitochondria. Single-molecule Förster resonance energy transfer (smFRET) demonstrated that the C-terminus of…
▽ More
ATP synthases utilize a proton motive force to synthesize ATP. In reverse, these membrane-embedded enzymes can also hydrolyze ATP to pump protons over the membrane. To prevent wasteful ATP hydrolysis, distinct control mechanisms exist for ATP synthases in bacteria, archaea, chloroplasts and mitochondria. Single-molecule Förster resonance energy transfer (smFRET) demonstrated that the C-terminus of the rotary subunit epsilon in the Escherichia coli enzyme changes its conformation to block ATP hydrolysis. Previously we investigated the related conformational changes of subunit F of the A1AO-ATP synthase from the archaeon Methanosarcina mazei Gö1. Here, we analyze the lifetimes of fluorescence donor and acceptor dyes to distinguish between smFRET signals for conformational changes and potential artefacts.
△ Less
Submitted 15 January, 2018;
originally announced January 2018.
-
Electrically detected magnetic resonance of carbon dangling bonds at the Si-face 4H-SiC/SiO$_2$ interface
Authors:
Gernot Gruber,
Jonathon Cottom,
Robert Meszaros,
Markus Koch,
Gregor Pobegen,
Thomas Aichinger,
Dethard Peters,
Peter Hadley
Abstract:
SiC based metal-oxide-semiconductor field-effect transistors (MOSFETs) have gained a significant importance in power electronics applications. However, electrically active defects at the SiC/SiO$_2$ interface degrade the ideal behavior of the devices. The relevant microscopic defects can be identified by electron paramagnetic resonance (EPR) or electrically detected magnetic resonance (EDMR). This…
▽ More
SiC based metal-oxide-semiconductor field-effect transistors (MOSFETs) have gained a significant importance in power electronics applications. However, electrically active defects at the SiC/SiO$_2$ interface degrade the ideal behavior of the devices. The relevant microscopic defects can be identified by electron paramagnetic resonance (EPR) or electrically detected magnetic resonance (EDMR). This helps to decide which changes to the fabrication process will likely lead to further increases of device performance and reliability. EDMR measurements have shown very similar dominant hyperfine (HF) spectra in differently processed MOSFETs although some discrepancies were observed in the measured $g$-factors. Here, the HF spectra measured of different SiC MOSFETs are compared and it is argued that the same dominant defect is present in all devices. A comparison of the data with simulated spectra of the C dangling bond (P$_\textrm{bC}$) center and the silicon vacancy (V$_\textrm{Si}$) demonstrates that the P$_\textrm{bC}$ center is a more suitable candidate to explain the observed HF spectra.
△ Less
Submitted 25 September, 2017;
originally announced September 2017.