-
Bayesian information criteria for clustering normally distributed data
Authors:
Anthony J. Webster
Abstract:
Maximum likelihood estimates (MLEs) are asymptotically normally distributed, and this property is used in meta-analyses to test the heterogeneity of estimates, either for a single cluster or for several sub-groups. More recently, MLEs for associations between risk factors and diseases have been hierarchically clustered to search for diseases with shared underlying causes, but an objective statisti…
▽ More
Maximum likelihood estimates (MLEs) are asymptotically normally distributed, and this property is used in meta-analyses to test the heterogeneity of estimates, either for a single cluster or for several sub-groups. More recently, MLEs for associations between risk factors and diseases have been hierarchically clustered to search for diseases with shared underlying causes, but an objective statistical criterion is needed to determine the number and composition of clusters. To tackle this problem, conventional statistical tests are briefly reviewed, before considering the posterior distribution for a partition of data into clusters. The posterior distribution is calculated by marginalising out the unknown cluster centres, and is different to the likelihood associated with mixture models. The calculation is equivalent to that used to obtain the Bayesian Information Criterion (BIC), but is exact, without a Laplace approximation. The result includes a sum of squares term, and terms that depend on the number and composition of clusters, that penalise the number of free parameters in the model. The usual BIC is shown to be unsuitable for clustering applications unless the number of items in each individual cluster is sufficiently large.
△ Less
Submitted 24 February, 2022; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Calculation of Exact Estimators by Integration Over the Surface of an n-Dimensional Sphere
Authors:
Anthony J Webster
Abstract:
This paper reconsiders the problem of calculating the expected set of probabilities <p_i>, given the observed set of items {m_i}, that are distributed among n bins with an (unknown) set of probabilities {p_i} for being placed in the ith bin. The problem is often formulated using Bayes theorem and the multinomial distribution, along with a constant prior for the values of the p_i, leading to a Diri…
▽ More
This paper reconsiders the problem of calculating the expected set of probabilities <p_i>, given the observed set of items {m_i}, that are distributed among n bins with an (unknown) set of probabilities {p_i} for being placed in the ith bin. The problem is often formulated using Bayes theorem and the multinomial distribution, along with a constant prior for the values of the p_i, leading to a Dirichlet distribution for the {p_i}. The moments of the p_i can then be calculated exactly. Here a new approach is suggested for the calculation of the moments, that uses a change of variables that reduces the problem to an integration over a portion of the surface of an n-dimensional sphere. This greatly simplifies the calculation by allowing a straightforward integration over (n-1) independent variables, with the constraints on the set of p_i being automatically satisfied. For the Dirichlet and similar distributions the problem simplifies even further, with the resulting integrals subsequently factorising, allowing their easy evaluation in terms of Beta functions. A proof by induction confirms existing calculations for the moments. The advantage of the approach presented here is that the methods and results apply with minimum or no modifications to numerical calculations that involve more complicated distributions or non-constant prior distributions, for which cases the numerical calculations will be greatly simplified.
△ Less
Submitted 3 May, 2013;
originally announced May 2013.
-
Estimating Omissions from Searches
Authors:
Anthony J Webster,
Richard Kemp
Abstract:
The mark-recapture method was devised by Petersen in 1896 to estimate the number of fish migrating into the Limfjord, and independently by Lincoln in 1930 to estimate waterfowl abundance. The technique applies to any search for a finite number of items by two or more people or agents, allowing the number of searched-for items to be estimated. This ubiquitous problem appears in fields from ecology…
▽ More
The mark-recapture method was devised by Petersen in 1896 to estimate the number of fish migrating into the Limfjord, and independently by Lincoln in 1930 to estimate waterfowl abundance. The technique applies to any search for a finite number of items by two or more people or agents, allowing the number of searched-for items to be estimated. This ubiquitous problem appears in fields from ecology and epidemiology, through to mathematics, social sciences, and computing. Here we exactly calculate the moments of the hypergeometric distribution associated with this long-standing problem, confirming that widely used estimates conjectured in 1951 are often too small. Our Bayesian approach highlights how different search strategies will modify the estimates. As an example, we assess the accuracy of a systematic literature review, an application we recommend.
△ Less
Submitted 31 May, 2013; v1 submitted 5 May, 2012;
originally announced May 2012.