-
Simple yet effective: a comparative study of statistical models for yearly hurricane forecasting
Authors:
Pietro Colombo,
Raffaele Mattera,
Philipp Otto
Abstract:
In this paper, we study the problem of forecasting the next year's number of Atlantic hurricanes, which is relevant in many fields of applications such as land-use planning, hazard mitigation, reinsurance and long-term weather derivative market. Considering a set of well-known predictors, we compare the forecasting accuracy of both machine learning and simpler models, showing that the latter may b…
▽ More
In this paper, we study the problem of forecasting the next year's number of Atlantic hurricanes, which is relevant in many fields of applications such as land-use planning, hazard mitigation, reinsurance and long-term weather derivative market. Considering a set of well-known predictors, we compare the forecasting accuracy of both machine learning and simpler models, showing that the latter may be more adequate than the first. Quantile regression models, which are adopted for the first time for forecasting hurricane numbers, provide the best results. Moreover, we construct a new index showing good properties in anticipating the direction of the future number of hurricanes. We consider different evaluation metrics based on both magnitude forecasting errors and directional accuracy.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Warped multifidelity Gaussian processes for data fusion of skewed environmental data
Authors:
Pietro Colombo,
Claire Miller,
Xiaochen Yang,
Ruth O'Donnell,
Paolo Maranzano
Abstract:
Understanding the dynamics of climate variables is paramount for numerous sectors, like energy and environmental monitoring. This study focuses on the critical need for a precise mapping of environmental variables for national or regional monitoring networks, a task notably challenging when dealing with skewed data. To address this issue, we propose a novel data fusion approach, the \textit{warped…
▽ More
Understanding the dynamics of climate variables is paramount for numerous sectors, like energy and environmental monitoring. This study focuses on the critical need for a precise mapping of environmental variables for national or regional monitoring networks, a task notably challenging when dealing with skewed data. To address this issue, we propose a novel data fusion approach, the \textit{warped multifidelity Gaussian process} (WMFGP). The method performs prediction using multiple time-series, accommodating varying reliability and resolutions and effectively handling skewness. In an extended simulation experiment the benefits and the limitations of the methods are explored, while as a case study, we focused on the wind speed monitored by the network of ARPA Lombardia, one of the regional environmental agencies operting in Italy. ARPA grapples with data gaps, and due to the connection between wind speed and air quality, it struggles with an effective air quality management. We illustrate the efficacy of our approach in filling the wind speed data gaps through two extensive simulation experiments. The case study provides more informative wind speed predictions crucial for predicting air pollutant concentrations, enhancing network maintenance, and advancing understanding of relevant meteorological and climatic phenomena.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection
Authors:
Eduardo Dadalto,
Pierre Colombo,
Guillaume Staerman,
Nathan Noiry,
Pablo Piantanida
Abstract:
A key feature of out-of-distribution (OOD) detection is to exploit a trained neural network by extracting statistical patterns and relationships through the multi-layer classifier to detect shifts in the expected input data distribution. Despite achieving solid results, several state-of-the-art methods rely on the penultimate or last layer outputs only, leaving behind valuable information for OOD…
▽ More
A key feature of out-of-distribution (OOD) detection is to exploit a trained neural network by extracting statistical patterns and relationships through the multi-layer classifier to detect shifts in the expected input data distribution. Despite achieving solid results, several state-of-the-art methods rely on the penultimate or last layer outputs only, leaving behind valuable information for OOD detection. Methods that explore the multiple layers either require a special architecture or a supervised objective to do so. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. It goes beyond multivariate features aggregation and introduces a baseline rooted in functional anomaly detection. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. We validate our method and empirically demonstrate its effectiveness in OOD detection compared to strong state-of-the-art baselines on computer vision benchmarks.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
A Differential Entropy Estimator for Training Neural Networks
Authors:
Georg Pichler,
Pierre Colombo,
Malik Boudiaf,
Günther Koliander,
Pablo Piantanida
Abstract:
Mutual Information (MI) has been widely used as a loss regularizer for training neural networks. This has been particularly effective when learn disentangled or compressed representations of high dimensional data. However, differential entropy (DE), another fundamental measure of information, has not found widespread use in neural network training. Although DE offers a potentially wider range of a…
▽ More
Mutual Information (MI) has been widely used as a loss regularizer for training neural networks. This has been particularly effective when learn disentangled or compressed representations of high dimensional data. However, differential entropy (DE), another fundamental measure of information, has not found widespread use in neural network training. Although DE offers a potentially wider range of applications than MI, off-the-shelf DE estimators are either non differentiable, computationally intractable or fail to adapt to changes in the underlying distribution. These drawbacks prevent them from being used as regularizers in neural networks training. To address shortcomings in previously proposed estimators for DE, here we introduce KNIFE, a fully parameterized, differentiable kernel-based estimator of DE. The flexibility of our approach also allows us to construct KNIFE-based estimators for conditional (on either discrete or continuous variables) DE, as well as MI. We empirically validate our method on high-dimensional synthetic data and further apply it to guide the training of neural networks for real-world tasks. Our experiments on a large variety of tasks, including visual domain adaptation, textual fair classification, and textual fine-tuning demonstrate the effectiveness of KNIFE-based estimation. Code can be found at https://github.com/g-pichler/knife.
△ Less
Submitted 19 June, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions
Authors:
Guillaume Staerman,
Pavlo Mozharovskyi,
Pierre Colombo,
Stéphan Clémençon,
Florence d'Alché-Buc
Abstract:
The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametri…
▽ More
The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions w.r.t. each distribution. Its good behavior w.r.t. major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in numerical experiments.
△ Less
Submitted 10 October, 2022; v1 submitted 23 March, 2021;
originally announced March 2021.
-
A patient-specific approach for quantitative and automatic analysis of computed tomography images in lung disease: application to COVID-19 patients
Authors:
L. Berta,
C. De Mattia,
F. Rizzetto,
S. Carrazza,
P. E. Colombo,
R. Fumagalli,
T. Langer,
D. Lizio,
A. Vanzulli,
A. Torresin
Abstract:
Quantitative metrics in lung computed tomography (CT) images have been widely used, often without a clear connection with physiology. This work proposes a patient-independent model for the estimation of well-aerated volume of lungs in CT images (WAVE). A Gaussian fit, with mean (Mu.f) and width (Sigma.f) values, was applied to the lower CT histogram data points of the lung to provide the estimatio…
▽ More
Quantitative metrics in lung computed tomography (CT) images have been widely used, often without a clear connection with physiology. This work proposes a patient-independent model for the estimation of well-aerated volume of lungs in CT images (WAVE). A Gaussian fit, with mean (Mu.f) and width (Sigma.f) values, was applied to the lower CT histogram data points of the lung to provide the estimation of the well-aerated lung volume (WAVE.f). Independence from CT reconstruction parameters and respiratory cycle was analysed using healthy lung CT images and 4DCT acquisitions. The Gaussian metrics and first order radiomic features calculated for a third cohort of COVID-19 patients were compared with those relative to healthy lungs. Each lung was further segmented in 24 subregions and a new biomarker derived from Gaussian fit parameter Mu.f was proposed to represent the local density changes. WAVE.f resulted independent from the respiratory motion in 80% of the cases. Differences of 1%, 2% and up to 14% resulted comparing a moderate iterative strength and FBP algorithm, 1 and 3 mm of slice thickness and different reconstruction kernel. Healthy subjects were significantly different from COVID-19 patients for all the metrics calculated. Graphical representation of the local biomarker provides spatial and quantitative information in a single 2D picture. Unlike other metrics based on fixed histogram thresholds, this model is able to consider the inter-and intra-subject variability. In addition, it defines a local biomarker to quantify the severity of the disease, independently of the observer.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Heavy-tailed Representations, Text Polarity Classification & Data Augmentation
Authors:
Hamid Jalalzai,
Pierre Colombo,
Chloé Clavel,
Eric Gaussier,
Giovanna Varni,
Emmanuel Vignon,
Anne Sabourin
Abstract:
The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the…
▽ More
The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.
△ Less
Submitted 25 March, 2021; v1 submitted 25 March, 2020;
originally announced March 2020.