Search | arXiv e-print repository

Sufficient digits and density estimation: A Bayesian nonparametric approach using generalized finite Pólya trees

Abstract: This paper proposes a novel approach for statistical modelling of a continuous random variable $X$ on $[0, 1)$, based on its digit representation $X=.X_1X_2\ldots$. In general, $X$ can be coupled with a random variable $N$ so that if a prior of $N$ is imposed, $(X_1,\ldots,X_N)$ becomes a sufficient statistics and $.X_{N+1}X_{N+2}\ldots$ is uniformly distributed. In line with this fact, and focusi… ▽ More This paper proposes a novel approach for statistical modelling of a continuous random variable $X$ on $[0, 1)$, based on its digit representation $X=.X_1X_2\ldots$. In general, $X$ can be coupled with a random variable $N$ so that if a prior of $N$ is imposed, $(X_1,\ldots,X_N)$ becomes a sufficient statistics and $.X_{N+1}X_{N+2}\ldots$ is uniformly distributed. In line with this fact, and focusing on binary digits for simplicity, we propose a family of generalized finite P{ó}lya trees that induces a random density for a sample, which becomes a flexible tool for density estimation. Here, the digit system may be random and learned from the data. We provide a detailed Bayesian analysis, including closed form expression for the posterior distribution which sidesteps the need of MCMC methods for posterior inference. We analyse the frequentist properties as the sample size increases, and provide sufficient conditions for consistency of the posterior distributions of the random density and $N$. We consider an extension to data spanning multiple orders of magnitude, and propose a prior distribution that encodes the so-called extended Newcomb-Benford law. Such a model shows promising results for density estimation of human-activity data. Our methodology is illustrated on several synthetic and real datasets. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2505.19643 [pdf, ps, other]

Online activity prediction via generalized Indian buffet process models

Authors: Mario Beraha, Lorenzo Masoero, Stefano Favaro, Thomas S. Richardson

Abstract: Online A/B experiments generate millions of user-activity records each day, yet experimenters need timely forecasts to guide roll-outs and safeguard user experience. Motivated by the problem of activity prediction for A/B tests at Amazon, we introduce a Bayesian nonparametric model for predicting both first-time and repeat triggers in web experiments. The model is based on the stable beta-scaled p… ▽ More Online A/B experiments generate millions of user-activity records each day, yet experimenters need timely forecasts to guide roll-outs and safeguard user experience. Motivated by the problem of activity prediction for A/B tests at Amazon, we introduce a Bayesian nonparametric model for predicting both first-time and repeat triggers in web experiments. The model is based on the stable beta-scaled process prior, which allows for capturing heavy-tailed behaviour without strict parametric assumptions. All posterior and predictive quantities are available in closed form, allowing for fast inference even on large-scale datasets. Simulation studies and a retrospective analysis of 1,774 production experiments show improved accuracy in forecasting new users and total triggers compared with state-of-the-art competitors, especially when only a few pilot days are observed. The framework enables shorter tests while preserving calibrated uncertainty estimates. Although motivated by Amazon's experimentation platform, the method extends to other applications that require rapid, distribution-free prediction of sparse count processes. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: This paper supersedes the two technical reports by the same authors arXiv:2401.14722 and arXiv:2402.03231

arXiv:2502.10257 [pdf, other]

Bayesian calculus and predictive characterizations of extended feature allocation models

Authors: Mario Beraha, Federico Camerlenghi, Lorenzo Ghilotti

Abstract: We introduce and study a unified Bayesian framework for extended feature allocations which flexibly captures interactions -- such as repulsion or attraction -- among features and their associated weights. We provide a complete Bayesian analysis of the proposed model and specialize our general theory to noteworthy classes of priors. This includes a novel prior based on determinantal point processes… ▽ More We introduce and study a unified Bayesian framework for extended feature allocations which flexibly captures interactions -- such as repulsion or attraction -- among features and their associated weights. We provide a complete Bayesian analysis of the proposed model and specialize our general theory to noteworthy classes of priors. This includes a novel prior based on determinantal point processes, for which we show promising results in a spatial statistics application. Within the general class of extended feature allocations, we further characterize those priors that yield predictive probabilities of discovering new features depending either solely on the sample size or on both the sample size and the distinct number of observed features. These predictive characterizations, known as "sufficientness" postulates, have been extensively studied in the literature on species sampling models starting from the seminal contribution of the English philosopher W.E. Johnson for the Dirichlet distribution. Within the feature allocation setting, existing predictive characterizations are limited to very specific examples; in contrast, our results are general, providing practical guidance for prior selection. Additionally, our approach, based on Palm calculus, is analytical in nature and yields a novel characterization of the Poisson point process through its reduced Palm kernel. △ Less

Submitted 3 March, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

arXiv:2402.03231 [pdf, other]

Improved prediction of future user activity in online A/B testing

Authors: Lorenzo Masoero, Mario Beraha, Thomas Richardson, Stefano Favaro

Abstract: In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at… ▽ More In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at which individuals will be exposed to interventions within the realm of online A/B testing. Our approach stands out by offering dual prediction capabilities: it forecasts both the quantity of new customers expected in future time windows and, unlike available alternative methods, the number of times they will be observed. We derive closed-form expressions for the posterior distributions of the quantities needed to form predictions about future user activity, thereby bypassing the need for numerical algorithms such as Markov chain Monte Carlo. After a comprehensive exposition of our model, we test its performance on experiments on real and simulated data, where we show its superior performance with respect to existing alternatives in the literature. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.14722 [pdf, other]

A Nonparametric Bayes Approach to Online Activity Prediction

Authors: Mario Beraha, Lorenzo Masoero, Stefano Favaro, Thomas S. Richardson

Abstract: Accurately predicting the onset of specific activities within defined timeframes holds significant importance in several applied contexts. In particular, accurate prediction of the number of future users that will be exposed to an intervention is an important piece of information for experimenters running online experiments (A/B tests). In this work, we propose a novel approach to predict the numb… ▽ More Accurately predicting the onset of specific activities within defined timeframes holds significant importance in several applied contexts. In particular, accurate prediction of the number of future users that will be exposed to an intervention is an important piece of information for experimenters running online experiments (A/B tests). In this work, we propose a novel approach to predict the number of users that will be active in a given time period, as well as the temporal trajectory needed to attain a desired user participation threshold. We model user activity using a Bayesian nonparametric approach which allows us to capture the underlying heterogeneity in user engagement. We derive closed-form expressions for the number of new users expected in a given period, and a simple Monte Carlo algorithm targeting the posterior distribution of the number of days needed to attain a desired number of users; the latter is important for experimental planning. We illustrate the performance of our approach via several experiments on synthetic and real world data, in which we show that our novel method outperforms existing competitors. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2312.13992 [pdf, other]

Bayesian nonparametric boundary detection for income areal data

Authors: Matteo Gianella, Mario Beraha, Alessandra Guglielmi

Abstract: Recent discussions on the future of metropolitan cities underscore the pivotal role of (social) equity, driven by demographic and economic trends. More equal policies can foster and contribute to a city's economic success and social stability. In this work, we focus on identifying metropolitan areas with distinct economic and social levels in the greater Los Angeles area, one of the most diverse y… ▽ More Recent discussions on the future of metropolitan cities underscore the pivotal role of (social) equity, driven by demographic and economic trends. More equal policies can foster and contribute to a city's economic success and social stability. In this work, we focus on identifying metropolitan areas with distinct economic and social levels in the greater Los Angeles area, one of the most diverse yet unequal areas in the United States. Utilising American Community Survey data, we propose a Bayesian model for boundary detection based on areal income distributions. The model identifies areas with significant income disparities, offering actionable insights for policymakers to address social and economic inequalities. We have multiple observations (i.e., personal income of survey respondents) for each area, and our approach, formalised as a Bayesian structural learning framework, models areal densities through mixtures of finite mixtures. We address boundary detection by identifying boundaries for which the associated geographically contiguous areal densities are estimated as being very different without resorting to dissimilarity metrics or covariates. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler. The methodology is validated via extensive simulations and applied to the income data in the greater Los Angeles area. We identify several boundaries in the income distributions, which can be explained ex-post in terms of the percentage of the population without health insurance, though not in terms of the total number of crimes, showing the usefulness of such an analysis to policymakers. △ Less

Submitted 29 January, 2025; v1 submitted 21 December, 2023; originally announced December 2023.

arXiv:2310.09818 [pdf, other]

MCMC for Bayesian nonparametric mixture modeling under differential privacy

Authors: Mario Beraha, Stefano Favaro, Vinayak Rao

Abstract: Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov cha… ▽ More Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov chain Monte Carlo (MCMC) algorithms for posterior inference. One is a marginal approach, resembling Neal's algorithm 5 with a pseudo-marginal Metropolis-Hastings move, and the other is a conditional approach. Although our focus is primarily on local DP, we show that our MCMC algorithms can be easily extended to deal with global differential privacy mechanisms. Moreover, for some carefully chosen mechanisms and mixture kernels, we show how auxiliary parameters can be analytically marginalized, allowing standard MCMC algorithms (i.e., non-privatized, such as Neal's Algorithm 2) to be efficiently employed. Our approach is general and applicable to any mixture model and privacy mechanism. In several simulations and a real case study, we discuss the performance of our algorithms and evaluate different privacy mechanisms proposed in the frequentist literature. △ Less

Submitted 21 May, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2309.15408 [pdf, other]

A smoothed-Bayesian approach to frequency recovery from sketched data

Authors: Mario Beraha, Stefano Favaro, Matteo Sesia

Abstract: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods… ▽ More We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. △ Less

Submitted 10 April, 2025; v1 submitted 27 September, 2023; originally announced September 2023.

arXiv:2304.02402 [pdf, other]

Wasserstein Principal Component Analysis for Circular Measures

Authors: Mario Beraha, Matteo Pegoraro

Abstract: We consider the 2-Wasserstein space of probability measures supported on the unit-circle, and propose a framework for Principal Component Analysis (PCA) for data living in such a space. We build on a detailed investigation of the optimal transportation problem for measures on the unit-circle which might be of independent interest. In particular, we derive an expression for optimal transport maps i… ▽ More We consider the 2-Wasserstein space of probability measures supported on the unit-circle, and propose a framework for Principal Component Analysis (PCA) for data living in such a space. We build on a detailed investigation of the optimal transportation problem for measures on the unit-circle which might be of independent interest. In particular, we derive an expression for optimal transport maps in (almost) closed form and propose an alternative definition of the tangent space at an absolutely continuous probability measure, together with the associated exponential and logarithmic maps. PCA is performed by mapping data on the tangent space at the Wasserstein barycentre, which we approximate via an iterative scheme, and for which we establish a sufficient a posteriori condition to assess its convergence. Our methodology is illustrated on several simulated scenarios and a real data analysis of measurements of optical nerve thickness. △ Less

Submitted 5 April, 2023; originally announced April 2023.

arXiv:2303.17844 [pdf, other]

Transform-scaled process priors for trait allocations in Bayesian nonparametrics

Authors: Mario Beraha, Stefano Favaro

Abstract: Completely random measures (CRMs) provide a broad class of priors, arguably, the most popular, for Bayesian nonparametric (BNP) analysis of trait allocations. As a peculiar property, CRM priors lead to predictive distributions that share the following common structure: for fixed prior's parameters, a new data point exhibits a Poisson (random) number of ``new'' traits, i.e., not appearing in the sa… ▽ More Completely random measures (CRMs) provide a broad class of priors, arguably, the most popular, for Bayesian nonparametric (BNP) analysis of trait allocations. As a peculiar property, CRM priors lead to predictive distributions that share the following common structure: for fixed prior's parameters, a new data point exhibits a Poisson (random) number of ``new'' traits, i.e., not appearing in the sample, which depends on the sampling information only through the sample size. While the Poisson posterior distribution is appealing for analytical tractability and ease of interpretation, its independence from the sampling information is a critical drawback, as it makes the posterior distribution of ``new'' traits completely determined by the estimation of the unknown prior's parameters. In this paper, we introduce the class of transform-scaled process (T-SP) priors as a tool to enrich the posterior distribution of ``new'' traits arising from CRM priors, while maintaining the same analytical tractability and ease of interpretation. In particular, we present a framework for posterior analysis of trait allocations under T-SP priors, showing that Stable T-SP priors, i.e., T-SP priors built from Stable CRMs, lead to predictive distributions such that, for fixed prior's parameters, a new data point displays a negative-Binomial (random) number of ``new'' traits, which depends on the sampling information through the number of distinct traits and the sample size. Then, by relying on a hierarchical version of T-SP priors, we extend our analysis to the more general setting of trait allocations with multiple groups of data or subpopulations. The empirical effectiveness of our methods is demonstrated through numerical experiments and applications to real data. △ Less

Submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.15029 [pdf, other]

Random measure priors in Bayesian recovery from sketches

Authors: Mario Beraha, Stefano Favaro, Matteo Sesia

Abstract: This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the… ▽ More This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general ``traits'' setting, where each data point has integer levels of association with multiple symbols, typically referred to as ``traits''. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions. △ Less

Submitted 4 June, 2024; v1 submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.02438 [pdf, other]

Bayesian clustering of high-dimensional data via latent repulsive mixtures

Authors: Lorenzo Ghilotti, Mario Beraha, Alessandra Guglielmi

Abstract: Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic rep… ▽ More Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. The repulsive point process must be anisotropic to favor well-separated clusters of data, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes. We illustrate our model in simulations as well as a plant species co-occurrence dataset. △ Less

Submitted 1 June, 2024; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2302.09034 [pdf, other]

Bayesian Mixtures Models with Repulsive and Attractive Atoms

Authors: Mario Beraha, Raffaele Argiento, Federico Camerlenghi, Alessandra Guglielmi

Abstract: The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well-separated and interpretable clusters. In this work,… ▽ More The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well-separated and interpretable clusters. In this work, we provide a unified framework for the construction and the Bayesian analysis of random probability measures with interacting atoms, encompassing both repulsive and attractive behaviours. Specifically, we derive closed-form expressions for the posterior distribution, the marginal and predictive distributions, which were not previously available except for the case of measures with i.i.d. atoms. We show how these quantities are fundamental both for prior elicitation and to develop new posterior simulation algorithms for hierarchical mixture models. Our results are obtained without any assumption on the finite point process that governs the atoms of the random measure. Their proofs rely on analytical tools borrowed from the Palm calculus theory, which might be of independent interest. We specialise our treatment to the classes of Poisson, Gibbs, and determinantal point processes, as well as in the case of shot-noise Cox processes. Finally, we illustrate the performance of different modelling strategies on simulated and real datasets. △ Less

Submitted 24 April, 2025; v1 submitted 17 February, 2023; originally announced February 2023.

MSC Class: 60G57; 62G05; 62F15; 62H30

arXiv:2205.15654 [pdf, other]

Normalized Latent Measure Factor Models

Authors: Mario Beraha, Jim E. Griffin

Abstract: We propose a methodology for modeling and comparing probability distributions within a Bayesian nonparametric framework. Building on dependent normalized random measures, we consider a prior distribution for a collection of discrete random measures where each measure is a linear combination of a set of latent measures, interpretable as characteristic traits shared by different distributions, with… ▽ More We propose a methodology for modeling and comparing probability distributions within a Bayesian nonparametric framework. Building on dependent normalized random measures, we consider a prior distribution for a collection of discrete random measures where each measure is a linear combination of a set of latent measures, interpretable as characteristic traits shared by different distributions, with positive random weights. The model is non-identified and a method for post-processing posterior samples to achieve identified inference is developed. This uses Riemannian optimization to solve a non-trivial optimization problem over a Lie group of matrices. The effectiveness of our approach is validated on simulated data and in two applications to two real-world data sets: school student test scores and personal incomes in California. Our approach leads to interesting insights for populations and easily interpretable posterior inference △ Less

Submitted 31 May, 2022; originally announced May 2022.

arXiv:2205.08144 [pdf, other]

BayesMix: Bayesian Mixture Models in C++

Authors: Mario Beraha, Bruno Guindani, Matteo Gianella, Alessandra Guglielmi

Abstract: We describe BayesMix, a C++ library for MCMC posterior simulation for general Bayesian mixture models. The goal of BayesMix is to provide a self-contained ecosystem to perform inference for mixture models to computer scientists, statisticians and practitioners. The key idea of this library is extensibility, as we wish the users to easily adapt our software to their specific Bayesian mixture models… ▽ More We describe BayesMix, a C++ library for MCMC posterior simulation for general Bayesian mixture models. The goal of BayesMix is to provide a self-contained ecosystem to perform inference for mixture models to computer scientists, statisticians and practitioners. The key idea of this library is extensibility, as we wish the users to easily adapt our software to their specific Bayesian mixture models. In addition to the several models and MCMC algorithms for posterior inference included in the library, new users with little familiarity on mixture models and the related MCMC algorithms can extend our library with minimal coding effort. Our library is computationally very efficient when compared to competitor software. Examples show that the typical code runtimes are from two to 25 times faster than competitors for data dimension from one to ten. Our library is publicly available on Github at https://github.com/bayesmix-dev/bayesmix/. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2203.12280 [pdf, other]

Bayesian Nonparametric Vector Autoregressive Models via a Logit Stick-breaking Prior: an Application to Child Obesity

Authors: Mario Beraha, Alessandra Guglielmi, Fernando A. Quintana, Maria de Iorio, Johan Gunnar Eriksson, Fabian Yap

Abstract: Overweight and obesity in adults are known to be associated with risks of metabolic and cardiovascular diseases. Because obesity is an epidemic, increasingly affecting children, it is important to understand if this condition persists from early life to childhood and if different patterns of obesity growth can be detected. Our motivation starts from a study of obesity over time in children from So… ▽ More Overweight and obesity in adults are known to be associated with risks of metabolic and cardiovascular diseases. Because obesity is an epidemic, increasingly affecting children, it is important to understand if this condition persists from early life to childhood and if different patterns of obesity growth can be detected. Our motivation starts from a study of obesity over time in children from South Eastern Asia. Our main focus is on clustering obesity patterns after adjusting for the effect of baseline information. Specifically, we consider a joint model for height and weight patterns taken every 6 months from birth. We propose a novel model that facilitates clustering by combining a vector autoregressive sampling model with a dependent logit stick-breaking prior. Simulation studies show the superiority of the model to capture patterns, compared to other alternatives. We apply the model to the motivating dataset, and discuss the main features of the detected clusters. We also compare alternative models with ours in terms of predictive performances. △ Less

Submitted 23 March, 2022; originally announced March 2022.

arXiv:2112.10393 [pdf, other]

Bayesian nonparametric model based clustering with intractable distributions: an ABC approach

Authors: Mario Beraha, Riccardo Corradin

Abstract: Bayesian nonparametric mixture models offer a rich framework for model based clustering. We consider the situation where the kernel of the mixture is available only up to an intractable normalizing constant. In this case, most of the commonly used Markov chain Monte Carlo (MCMC) methods are not suitable. We propose an approximate Bayesian computational (ABC) strategy, whereby we approximate the po… ▽ More Bayesian nonparametric mixture models offer a rich framework for model based clustering. We consider the situation where the kernel of the mixture is available only up to an intractable normalizing constant. In this case, most of the commonly used Markov chain Monte Carlo (MCMC) methods are not suitable. We propose an approximate Bayesian computational (ABC) strategy, whereby we approximate the posterior to avoid the intractability of the kernel. We derive an ABC-MCMC algorithm which combines (i) the use of the predictive distribution induced by the nonparametric prior as proposal and (ii) the use of the Wasserstein distance and its connection to optimal matching problems. To overcome the sensibility with respect to the parameters of our algorithm, we further propose an adaptive strategy. We illustrate the use of the proposed algorithm with several simulation studies and an application on real data, where we cluster a population of networks, comparing its performance with standard MCMC algorithms and validating the adaptive strategy. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: 20 pages, 4 figures

arXiv:2107.09357 [pdf, other]

JAGS, NIMBLE, Stan: a detailed comparison among Bayesian MCMC software

Authors: Mario Beraha, Daniele Falco, Alessandra Guglielmi

Abstract: The aim of this work is the comparison of the performance of the three popular software platforms JAGS, NIMBLE and Stan. These probabilistic programming languages are able to automatically generate samples from the posterior distribution of interest using MCMC algorithms, starting from the specification of a Bayesian model, i.e. the likelihood and the prior. The final goal is to present a detailed… ▽ More The aim of this work is the comparison of the performance of the three popular software platforms JAGS, NIMBLE and Stan. These probabilistic programming languages are able to automatically generate samples from the posterior distribution of interest using MCMC algorithms, starting from the specification of a Bayesian model, i.e. the likelihood and the prior. The final goal is to present a detailed analysis of their strengths and weaknesses to statisticians or applied scientists. In this way, we wish to contribute to make them fully aware of the pros and cons of this software. We carry out a systematic comparison of the three platforms on a wide class of models, prior distributions, and data generating mechanisms. Our extensive simulation studies evaluate the quality of the MCMC chains produced, the efficiency of the software and the goodness of fit of the output. We also consider the efficiency of the parallelization made by the three platforms. △ Less

Submitted 20 July, 2021; originally announced July 2021.

arXiv:2101.09039 [pdf, other]

Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric

Authors: Matteo Pegoraro, Mario Beraha

Abstract: We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a… ▽ More We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed. △ Less

Submitted 29 November, 2021; v1 submitted 22 January, 2021; originally announced January 2021.

arXiv:2011.06444 [pdf, other]

MCMC computations for Bayesian mixture models using repulsive point processes

Authors: Mario Beraha, Raffaele Argiento, Jesper Møller, Alessandra Guglielmi

Abstract: Repulsive mixture models have recently gained popularity for Bayesian cluster detection. Compared to more traditional mixture models, repulsive mixture models produce a smaller number of well separated clusters. The most commonly used methods for posterior inference either require to fix a priori the number of components or are based on reversible jump MCMC computation. We present a general framew… ▽ More Repulsive mixture models have recently gained popularity for Bayesian cluster detection. Compared to more traditional mixture models, repulsive mixture models produce a smaller number of well separated clusters. The most commonly used methods for posterior inference either require to fix a priori the number of components or are based on reversible jump MCMC computation. We present a general framework for mixture models, when the prior of the `cluster centres' is a finite repulsive point process depending on a hyperparameter, specified by a density which may depend on an intractable normalizing constant. By investigating the posterior characterization of this class of mixture models, we derive a MCMC algorithm which avoids the well-known difficulties associated to reversible jump MCMC computation. In particular, we use an ancillary variable method, which eliminates the problem of having intractable normalizing constants in the Hastings ratio. The ancillary variable method relies on a perfect simulation algorithm, and we demonstrate this is fast because the number of components is typically small. In several simulation studies and an application on sociological data, we illustrate the advantage of our new methodology over existing methods, and we compare the use of a determinantal or a repulsive Gibbs point process prior model. △ Less

Submitted 19 April, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

arXiv:2007.14961 [pdf, other]

Spatially dependent mixture models via the Logistic Multivariate CAR prior

Authors: Mario Beraha, Matteo Pegoraro, Riccardo Peli, Alessandra Guglielmi

Abstract: We consider the problem of spatially dependent areal data, where for each area independent observations are available, and propose to model the density of each area through a finite mixture of Gaussian distributions. The spatial dependence is introduced via a novel joint distribution for a collection of vectors in the simplex, that we term logisticMCAR. We show that salient features of the logisti… ▽ More We consider the problem of spatially dependent areal data, where for each area independent observations are available, and propose to model the density of each area through a finite mixture of Gaussian distributions. The spatial dependence is introduced via a novel joint distribution for a collection of vectors in the simplex, that we term logisticMCAR. We show that salient features of the logisticMCAR distribution can be described analytically, and that a suitable augmentation scheme based on the Pólya-Gamma identity allows to derive an efficient Markov Chain Monte Carlo algorithm. When compared to competitors, our model has proved to better estimate densities in different (disconnected) areal locations when they have different characteristics. We discuss an application on a real dataset of Airbnb listings in the city of Amsterdam, also showing how to easily incorporate for additional covariate information in the model. △ Less

Submitted 8 June, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:2005.10287 [pdf, other]

The semi-hierarchical Dirichlet Process and its application to clustering homogeneous distributions

Authors: Mario Beraha, Alessandra Guglielmi, Fernando A. Quintana

Abstract: Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Came… ▽ More Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Camerlenghi et al. (2019a). We go beyond the simple yes/no answer to the homogeneity question and embed the proposed prior in a random partition model; this procedure allows us to give a more comprehensive response to the above question and in fact find groups of populations that are internally homogeneous when I greater or equal than 2 such populations are considered. We study theoretical properties of the semi-hierarchical Dirichlet process and of the Bayes factor for the homogeneity test when I = 2. Extensive simulation studies and applications to educational data are also discussed. △ Less

Submitted 16 June, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:1907.07384 [pdf, other]

Feature Selection via Mutual Information: New Theoretical Insights

Authors: Mario Beraha, Alberto Maria Metelli, Matteo Papini, Andrea Tirinzoni, Marcello Restelli

Abstract: Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that cond… ▽ More Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods. △ Less

Submitted 17 July, 2019; originally announced July 2019.

Comments: Accepted for presentation at the International Joint Conference on Neural Networks (IJCNN) 2019

Showing 1–23 of 23 results for author: Beraha, M