-
Proportional infinite-width infinite-depth limit for deep linear neural networks
Authors:
Federico Bassetti,
Lucia Ladelli,
Pietro Rotondo
Abstract:
We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process,…
▽ More
We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.
△ Less
Submitted 22 November, 2024;
originally announced November 2024.
-
Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers
Authors:
Federico Bassetti,
Marco Gherardi,
Alessandro Ingrosso,
Mauro Pastore,
Pietro Rotondo
Abstract:
Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterizatio…
▽ More
Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterization of feature learning in the Bayesian setting. Our results include: (i) an exact and elementary non-asymptotic integral representation for the joint prior distribution over the outputs, given in terms of a mixture of Gaussians; (ii) an analytical formula for the posterior distribution in the case of squared error loss function (Gaussian likelihood); (iii) a quantitative description of the feature learning infinite-width regime, using large deviation theory. From a physical perspective, deep architectures with multiple outputs or convolutional layers represent different manifestations of kernel shape renormalization, and our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.
△ Less
Submitted 16 June, 2025; v1 submitted 5 June, 2024;
originally announced June 2024.
-
A Spatiotemporal Gamma Shot Noise Cox Process
Authors:
Federico Bassetti,
Roberto Casarin,
Matteo Iacopini
Abstract:
A new discrete-time shot noise Cox process for spatiotemporal data is proposed. The random intensity is driven by a dependent sequence of latent gamma random measures. Some properties of the latent process are derived, such as an autoregressive representation and the Laplace functional. Moreover, these results are used to derive the moment, predictive, and pair correlation measures of the proposed…
▽ More
A new discrete-time shot noise Cox process for spatiotemporal data is proposed. The random intensity is driven by a dependent sequence of latent gamma random measures. Some properties of the latent process are derived, such as an autoregressive representation and the Laplace functional. Moreover, these results are used to derive the moment, predictive, and pair correlation measures of the proposed shot noise Cox process. The model is flexible but still tractable and allows for capturing persistence, global trends, and latent spatial and temporal factors. A Bayesian inference approach is adopted, and an efficient Markov Chain Monte Carlo procedure based on conditional Sequential Monte Carlo is proposed. An application to georeferenced wildfire data illustrates the properties of the model and inference.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
First-order integer-valued autoregressive processes with Generalized Katz innovations
Authors:
Ovielt Baltodano Lopez,
Federico Bassetti,
Giulia Carallo,
Roberto Casarin
Abstract:
A new integer--valued autoregressive process (INAR) with Generalised Lagrangian Katz (GLK) innovations is defined. This process family provides a flexible modelling framework for count data, allowing for under and over--dispersion, asymmetry, and excess of kurtosis and includes standard INAR models such as Generalized Poisson and Negative Binomial as special cases. We show that the GLK--INAR proce…
▽ More
A new integer--valued autoregressive process (INAR) with Generalised Lagrangian Katz (GLK) innovations is defined. This process family provides a flexible modelling framework for count data, allowing for under and over--dispersion, asymmetry, and excess of kurtosis and includes standard INAR models such as Generalized Poisson and Negative Binomial as special cases. We show that the GLK--INAR process is discrete semi--self--decomposable, infinite divisible, stable by aggregation and provides stationarity conditions. Some extensions are discussed, such as the Markov--Switching and the zero--inflated GLK--INARs. A Bayesian inference framework and an efficient posterior approximation procedure are introduced. The proposed models are applied to 130 time series from Google Trend, which proxy the worldwide public concern about climate change. New evidence is found of heterogeneity across time, countries and keywords in the persistence, uncertainty, and long--run public awareness level.
△ Less
Submitted 17 December, 2024; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Computing Kantorovich-Wasserstein Distances on $d$-dimensional histograms using $(d+1)$-partite graphs
Authors:
Gennaro Auricchio,
Federico Bassetti,
Stefano Gualandi,
Marco Veneroni
Abstract:
This paper presents a novel method to compute the exact Kantorovich-Wasserstein distance between a pair of $d$-dimensional histograms having $n$ bins each. We prove that this problem is equivalent to an uncapacitated minimum cost flow problem on a $(d+1)$-partite graph with $(d+1)n$ nodes and $dn^{\frac{d+1}{d}}$ arcs, whenever the cost is separable along the principal $d$-dimensional directions.…
▽ More
This paper presents a novel method to compute the exact Kantorovich-Wasserstein distance between a pair of $d$-dimensional histograms having $n$ bins each. We prove that this problem is equivalent to an uncapacitated minimum cost flow problem on a $(d+1)$-partite graph with $(d+1)n$ nodes and $dn^{\frac{d+1}{d}}$ arcs, whenever the cost is separable along the principal $d$-dimensional directions. We show numerically the benefits of our approach by computing the Kantorovich-Wasserstein distance of order 2 among two sets of instances: gray scale images and $d$-dimensional biomedical histograms. On these types of instances, our approach is competitive with state-of-the-art optimal transport algorithms.
△ Less
Submitted 11 January, 2019; v1 submitted 18 May, 2018;
originally announced May 2018.
-
On the Computation of Kantorovich-Wasserstein Distances between 2D-Histograms by Uncapacitated Minimum Cost Flows
Authors:
Federico Bassetti,
Stefano Gualandi,
Marco Veneroni
Abstract:
In this work, we present a method to compute the Kantorovich-Wasserstein distance of order one between a pair of two-dimensional histograms. Recent works in Computer Vision and Machine Learning have shown the benefits of measuring Wasserstein distances of order one between histograms with $n$ bins, by solving a classical transportation problem on very large complete bipartite graphs with $n$ nodes…
▽ More
In this work, we present a method to compute the Kantorovich-Wasserstein distance of order one between a pair of two-dimensional histograms. Recent works in Computer Vision and Machine Learning have shown the benefits of measuring Wasserstein distances of order one between histograms with $n$ bins, by solving a classical transportation problem on very large complete bipartite graphs with $n$ nodes and $n^2$ edges. The main contribution of our work is to approximate the original transportation problem by an uncapacitated min cost flow problem on a reduced flow network of size $O(n)$ that exploits the geometric structure of the cost function. More precisely, when the distance among the bin centers is measured with the 1-norm or the $\infty$-norm, our approach provides an optimal solution. When the distance among bins is measured with the 2-norm: (i) we derive a quantitative estimate on the error between optimal and approximate solution; (ii) given the error, we construct a reduced flow network of size $O(n)$. We numerically show the benefits of our approach by computing Wasserstein distances of order one on a set of grey scale images used as benchmark in the literature. We show how our approach scales with the size of the images with 1-norm, 2-norm and $\infty$-norm ground distances, and we compare it with other two methods which are largely used in the literature.
△ Less
Submitted 26 July, 2019; v1 submitted 2 April, 2018;
originally announced April 2018.
-
Hierarchical Species Sampling Models
Authors:
Federico Bassetti,
Roberto Casarin,
Luca Rossini
Abstract:
This paper introduces a general class of hierarchical nonparametric prior distributions. The random probability measures are constructed by a hierarchy of generalized species sampling processes with possibly non-diffuse base measures. The proposed framework provides a general probabilistic foundation for hierarchical random measures with either atomic or mixed base measures and allows for studying…
▽ More
This paper introduces a general class of hierarchical nonparametric prior distributions. The random probability measures are constructed by a hierarchy of generalized species sampling processes with possibly non-diffuse base measures. The proposed framework provides a general probabilistic foundation for hierarchical random measures with either atomic or mixed base measures and allows for studying their properties, such as the distribution of the marginal and total number of clusters. We show that hierarchical species sampling models have a Chinese Restaurants Franchise representation and can be used as prior distributions to undertake Bayesian nonparametric inference. We provide a method to sample from the posterior distribution together with some numerical illustrations. Our class of priors includes some new hierarchical mixture priors such as the hierarchical Gnedin measures, and other well-known prior distributions such as the hierarchical Pitman-Yor and the hierarchical normalized random measures.
△ Less
Submitted 15 March, 2018;
originally announced March 2018.
-
Bayesian Nonparametric Calibration and Combination of Predictive Distributions
Authors:
Federico Bassetti,
Roberto Casarin,
Francesco Ravazzolo
Abstract:
We introduce a Bayesian approach to predictive density calibration and combination that accounts for parameter uncertainty and model set incompleteness through the use of random calibration functionals and random combination weights. Building on the work of Ranjan, R. and Gneiting, T. (2010) and Gneiting, T. and Ranjan, R. (2013), we use infinite beta mixtures for the calibration. The proposed Bay…
▽ More
We introduce a Bayesian approach to predictive density calibration and combination that accounts for parameter uncertainty and model set incompleteness through the use of random calibration functionals and random combination weights. Building on the work of Ranjan, R. and Gneiting, T. (2010) and Gneiting, T. and Ranjan, R. (2013), we use infinite beta mixtures for the calibration. The proposed Bayesian nonparametric approach takes advantage of the flexibility of Dirichlet process mixtures to achieve any continuous deformation of linearly combined predictive distributions. The inference procedure is based on Gibbs sampling and allows accounting for uncertainty in the number of mixture components, mixture weights, and calibration parameters. The weak posterior consistency of the Bayesian nonparametric calibration is provided under suitable conditions for unknown true density. We study the methodology in simulation examples with fat tails and multimodal densities and apply it to density forecasts of daily S&P returns and daily maximum wind speed at the Frankfurt airport.
△ Less
Submitted 25 October, 2016; v1 submitted 25 February, 2015;
originally announced February 2015.
-
Beta-Product Poisson-Dirichlet Processes
Authors:
Federico Bassetti,
Roberto Casarin,
Fabrizio Leisen
Abstract:
Time series data may exhibit clustering over time and, in a multiple time series context, the clustering behavior may differ across the series. This paper is motivated by the Bayesian non--parametric modeling of the dependence between the clustering structures and the distributions of different time series. We follow a Dirichlet process mixture approach and introduce a new class of multivariate de…
▽ More
Time series data may exhibit clustering over time and, in a multiple time series context, the clustering behavior may differ across the series. This paper is motivated by the Bayesian non--parametric modeling of the dependence between the clustering structures and the distributions of different time series. We follow a Dirichlet process mixture approach and introduce a new class of multivariate dependent Dirichlet processes (DDP). The proposed DDP are represented in terms of vector of stick-breaking processes with dependent weights. The weights are beta random vectors that determine different and dependent clustering effects along the dimension of the DDP vector. We discuss some theoretical properties and provide an efficient Monte Carlo Markov Chain algorithm for posterior computation. The effectiveness of the method is illustrated with a simulation study and an application to the United States and the European Union industrial production indexes.
△ Less
Submitted 22 September, 2011;
originally announced September 2011.
-
Generalized Species Sampling Priors with Latent Beta reinforcements
Authors:
Edoardo M. Airoldi,
Thiago Costa,
Federico Bassetti,
Fabrizio Leisen,
Michele Guindani
Abstract:
Many popular Bayesian nonparametric priors can be characterized in terms of exchangeable species sampling sequences. However, in some applications, exchangeability may not be appropriate. We introduce a {novel and probabilistically coherent family of non-exchangeable species sampling sequences characterized by a tractable predictive probability function with weights driven by a sequence of indepen…
▽ More
Many popular Bayesian nonparametric priors can be characterized in terms of exchangeable species sampling sequences. However, in some applications, exchangeability may not be appropriate. We introduce a {novel and probabilistically coherent family of non-exchangeable species sampling sequences characterized by a tractable predictive probability function with weights driven by a sequence of independent Beta random variables. We compare their theoretical clustering properties with those of the Dirichlet Process and the two parameters Poisson-Dirichlet process. The proposed construction provides a complete characterization of the joint process, differently from existing work. We then propose the use of such process as prior distribution in a hierarchical Bayes modeling framework, and we describe a Markov Chain Monte Carlo sampler for posterior inference. We evaluate the performance of the prior and the robustness of the resulting inference in a simulation study, providing a comparison with popular Dirichlet Processes mixtures and Hidden Markov Models. Finally, we develop an application to the detection of chromosomal aberrations in breast cancer by leveraging array CGH data.
△ Less
Submitted 1 August, 2014; v1 submitted 3 December, 2010;
originally announced December 2010.
-
Quantitative comparisons between finitary posterior distributions and Bayesian posterior distributions
Authors:
Federico Bassetti
Abstract:
The main object of Bayesian statistical inference is the determination of posterior distributions. Sometimes these laws are given for quantities devoid of empirical value. This serious drawback vanishes when one confines oneself to considering a finite horizon framework. However, assuming infinite exchangeability gives rise to fairly tractable {\it a posteriori} quantities, which is very attract…
▽ More
The main object of Bayesian statistical inference is the determination of posterior distributions. Sometimes these laws are given for quantities devoid of empirical value. This serious drawback vanishes when one confines oneself to considering a finite horizon framework. However, assuming infinite exchangeability gives rise to fairly tractable {\it a posteriori} quantities, which is very attractive in applications. Hence, with a view to a reconciliation between these two aspects of the Bayesian way of reasoning, in this paper we provide quantitative comparisons between posterior distributions of finitary parameters and posterior distributions of allied parameters appearing in usual statistical models.
△ Less
Submitted 8 July, 2008;
originally announced July 2008.