-
A Rate-Distortion View of Uncertainty Quantification
Authors:
Ifigeneia Apostolopoulou,
Benjamin Eysenbach,
Frank Nielsen,
Artur Dubrawski
Abstract:
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching de…
▽ More
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck.
△ Less
Submitted 18 June, 2024; v1 submitted 15 June, 2024;
originally announced June 2024.
-
A bivariate two-state Markov modulated Poisson process for failure modelling
Authors:
Yoel G. Yera,
Rosa E. Lillo,
Bo F. Nielsen,
Pepa Ramírez-Cobo,
Fabrizio Ruggeri
Abstract:
Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between…
▽ More
Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between the two sequences of inter-failure times, while at the same time preserves the MMPP properties, marginally. The generalization is based on the Marshall-Olkin exponential distribution. Inference is undertaken for the new model through a method combining a matching moments approach with an Approximate Bayesian Computation (ABC) algorithm. The performance of the method is shown on simulated and real datasets representing times and distances covered between consecutive failures in a public transport company. For the real dataset, some quantities of importance associated with the reliability of the system are estimated as the probabilities and expected number of failures at different times and distances covered by trains until the occurrence of a failure.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs
Authors:
Ehsan Amid,
Frank Nielsen,
Richard Nock,
Manfred K. Warmuth
Abstract:
Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we…
▽ More
Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Fisher-Rao distance and pullback SPD cone distances between multivariate normal distributions
Authors:
Frank Nielsen
Abstract:
Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities between normals and paths joi…
▽ More
Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities between normals and paths joining them. The Fisher-Rao distance defined as the Riemannian geodesic distance induced by the Fisher information metric is such a principled metric distance which however is not known in closed-form excepts for a few particular cases. In this work, we first report a fast and robust method to approximate arbitrarily finely the Fisher-Rao distance between multivariate normal distributions. Second, we introduce a class of distances based on diffeomorphic embeddings of the normal manifold into a submanifold of the higher-dimensional symmetric positive-definite cone corresponding to the manifold of centered normal distributions. We show that the projective Hilbert distance on the cone yields a metric on the embedded normal submanifold and we pullback that cone distance with its associated straight line Hilbert cone geodesics to obtain a distance and smooth paths between normal distributions. Compared to the Fisher-Rao distance approximation, the pullback Hilbert cone distance is computationally light since it requires to compute only the extreme minimal and maximal eigenvalues of matrices. Finally, we show how to use those distances in clustering tasks.
△ Less
Submitted 9 June, 2024; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Product Jacobi-Theta Boltzmann machines with score matching
Authors:
Andrea Pasquale,
Daniel Krefl,
Stefano Carrazza,
Frank Nielsen
Abstract:
The estimation of probability density functions is a non trivial task that over the last years has been tackled with machine learning techniques. Successful applications can be obtained using models inspired by the Boltzmann machine (BM) architecture. In this manuscript, the product Jacobi-Theta Boltzmann machine (pJTBM) is introduced as a restricted version of the Riemann-Theta Boltzmann machine…
▽ More
The estimation of probability density functions is a non trivial task that over the last years has been tackled with machine learning techniques. Successful applications can be obtained using models inspired by the Boltzmann machine (BM) architecture. In this manuscript, the product Jacobi-Theta Boltzmann machine (pJTBM) is introduced as a restricted version of the Riemann-Theta Boltzmann machine (RTBM) with diagonal hidden sector connection matrix. We show that score matching, based on the Fisher divergence, can be used to fit probability densities with the pJTBM more efficiently than with the original RTBM.
△ Less
Submitted 12 January, 2024; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning
Authors:
Wu Lin,
Valentin Duruisseaux,
Melvin Leok,
Frank Nielsen,
Mohammad Emtiyaz Khan,
Mark Schmidt
Abstract:
Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riem…
▽ More
Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL
△ Less
Submitted 16 March, 2024; v1 submitted 19 February, 2023;
originally announced February 2023.
-
Variational Representations of Annealing Paths: Bregman Information under Monotonic Embedding
Authors:
Rob Brekelmans,
Frank Nielsen
Abstract:
Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quasi-arithmetic means, and interpreted the resulting…
▽ More
Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quasi-arithmetic means, and interpreted the resulting intermediate densities as minimizing an expected divergence to the endpoints. To analyze these variational representations of annealing paths, we extend known results showing that the arithmetic mean over arguments minimizes the expected Bregman divergence to a single representative point. In particular, we obtain an analogous result for quasi-arithmetic means, when the inputs to the Bregman divergence are transformed under a monotonic embedding function. Our analysis highlights the interplay between quasi-arithmetic means, parametric families, and divergence functionals using the rho-tau representational Bregman divergence framework, and associates common divergence functionals with intermediate densities along an annealing path.
△ Less
Submitted 6 February, 2024; v1 submitted 15 September, 2022;
originally announced September 2022.
-
On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models
Authors:
Pascal Mattia Esser,
Frank Nielsen
Abstract:
A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these m…
▽ More
A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
Structured second-order methods via natural gradient descent
Authors:
Wu Lin,
Frank Nielsen,
Mohammad Emtiyaz Khan,
Mark Schmidt
Abstract:
In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invar…
▽ More
In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
△ Less
Submitted 19 February, 2022; v1 submitted 22 July, 2021;
originally announced July 2021.
-
q-Paths: Generalizing the Geometric Annealing Path using Power Means
Authors:
Vaden Masrani,
Rob Brekelmans,
Thang Bui,
Frank Nielsen,
Aram Galstyan,
Greg Ver Steeg,
Frank Wood
Abstract:
Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of…
▽ More
Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of closed form energy function. In this work, we introduce $q$-paths, a family of paths which is derived from a generalized notion of the mean, includes the geometric and arithmetic mixtures as special cases, and admits a simple closed form involving the deformed logarithm function from nonextensive thermodynamics. Following previous analysis of the geometric path, we interpret our $q$-paths as corresponding to a $q$-exponential family of distributions, and provide a variational representation of intermediate densities as minimizing a mixture of $α$-divergences to the endpoints. We show that small deviations away from the geometric path yield empirical gains for Bayesian inference using Sequential Monte Carlo and generative model evaluation using Annealed Importance Sampling.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Tractable structured natural gradient descent using local parameterizations
Authors:
Wu Lin,
Frank Nielsen,
Mohammad Emtiyaz Khan,
Mark Schmidt
Abstract:
Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) genera…
▽ More
Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
△ Less
Submitted 17 January, 2022; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Likelihood Ratio Exponential Families
Authors:
Rob Brekelmans,
Frank Nielsen,
Alireza Makhzani,
Aram Galstyan,
Greg Ver Steeg
Abstract:
The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermod…
▽ More
The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermodynamic variational objective (TVO).
We extend these likelihood ratio exponential families to include solutions to rate-distortion (RD) optimization, the information bottleneck (IB) method, and recent rate-distortion-classification approaches which combine RD and IB. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration bounds such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function.
△ Less
Submitted 15 January, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Clustering patterns connecting COVID-19 dynamics and Human mobility using optimal transport
Authors:
Frank Nielsen,
Gautier Marti,
Sumanta Ray,
Saumyadipta Pyne
Abstract:
Social distancing and stay-at-home are among the few measures that are known to be effective in checking the spread of a pandemic such as COVID-19 in a given population. The patterns of dependency between such measures and their effects on disease incidence may vary dynamically and across different populations. We described a new computational framework to measure and compare the temporal relation…
▽ More
Social distancing and stay-at-home are among the few measures that are known to be effective in checking the spread of a pandemic such as COVID-19 in a given population. The patterns of dependency between such measures and their effects on disease incidence may vary dynamically and across different populations. We described a new computational framework to measure and compare the temporal relationships between human mobility and new cases of COVID-19 across more than 150 cities of the United States with relatively high incidence of the disease. We used a novel application of Optimal Transport for computing the distance between the normalized patterns induced by bivariate time series for each pair of cities. Thus, we identified 10 clusters of cities with similar temporal dependencies, and computed the Wasserstein barycenter to describe the overall dynamic pattern for each cluster. Finally, we used city-specific socioeconomic covariates to analyze the composition of each cluster.
△ Less
Submitted 21 July, 2020;
originally announced July 2020.
-
Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distances
Authors:
Gaëtan Hadjeres,
Frank Nielsen
Abstract:
Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term th…
▽ More
Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the Schoenberg-Rao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semi-definite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closed-form distances between mixtures of Gaussian distributions. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification.
△ Less
Submitted 28 April, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Information-Geometric Set Embeddings (IGSE): From Sets to Probability Distributions
Authors:
Ke Sun,
Frank Nielsen
Abstract:
This letter introduces an abstract learning problem called the "set embedding": The objective is to map sets into probability distributions so as to lose less information. We relate set union and intersection operations with corresponding interpolations of probability distributions. We also demonstrate a preliminary solution with experimental results on toy set embedding examples.
This letter introduces an abstract learning problem called the "set embedding": The objective is to map sets into probability distributions so as to lose less information. We relate set union and intersection operations with corresponding interpolations of probability distributions. We also demonstrate a preliminary solution with experimental results on toy set embedding examples.
△ Less
Submitted 11 December, 2019; v1 submitted 27 November, 2019;
originally announced November 2019.
-
A Geometric Modeling of Occam's Razor in Deep Learning
Authors:
Ke Sun,
Frank Nielsen
Abstract:
Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. With the belief that s…
▽ More
Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. With the belief that simplicity is linked to better generalization, as grounded in the theory of minimum description length, the objective of our analysis is to examine and bound the complexity of DNNs. We introduce the locally varying dimensionality of the parameter space of neural network models by considering the number of significant dimensions of the Fisher information matrix, and model the parameter space as a manifold using the framework of singular semi-Riemannian geometry. We derive model complexity measures which yield short description lengths for deep neural network models based on their singularity analysis thus explaining the good performance of DNNs despite their large number of parameters.
△ Less
Submitted 26 March, 2025; v1 submitted 27 May, 2019;
originally announced May 2019.
-
The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models
Authors:
Frank Nielsen
Abstract:
The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponent…
▽ More
The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponents. This result extends to arbitrary mixtures of exponential families with natural parameter spaces being cones: This includes the binomial, the multinomial, the zero-centered Laplacian, the Gaussian and the Wishart mixtures, among others. We also derive a Minkowski's diversity index of a normalized weighted set of probability distributions from Minkowski's inequality.
△ Less
Submitted 17 January, 2019; v1 submitted 9 January, 2019;
originally announced January 2019.
-
Variation Network: Learning High-level Attributes for Controlled Input Manipulation
Authors:
Gaëtan Hadjeres,
Frank Nielsen
Abstract:
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this…
▽ More
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this model applicable to a wide variety of tasks. Further, VarNet has a sound information-theoretic interpretation which grants us with interpretable means to control how these high-level attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and meaningful.
△ Less
Submitted 16 September, 2019; v1 submitted 11 January, 2019;
originally announced January 2019.
-
On The Chain Rule Optimal Transport Distance
Authors:
Frank Nielsen,
Ke Sun
Abstract:
We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced m…
▽ More
We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced metric distance between statistical mixtures, and provide an upper bound for jointly convex distances between statistical mixtures. By entropic regularization of the optimal transport, we obtain a fast differentiable Sinkhorn-type distance. We experimentally evaluate our new family of distances by quantifying the upper bounds of several jointly convex distances between statistical mixtures, and by proposing a novel efficient method to learn Gaussian mixture models (GMMs) by simplifying kernel density estimators with respect to our distance. Our GMM learning technique experimentally improves significantly over the EM implementation of {\tt sklearn} on the {\tt MNIST} and {\tt Fashion MNIST} datasets.
△ Less
Submitted 2 November, 2020; v1 submitted 19 December, 2018;
originally announced December 2018.
-
Geometry and clustering with metrics derived from separable Bregman divergences
Authors:
Erika Gomes-Gonçalves,
Henryk Gzyl,
Frank Nielsen
Abstract:
Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances.
Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
The Bregman chord divergence
Authors:
Frank Nielsen,
Richard Nock
Abstract:
Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of {\em ad hoc} distances, one rather prefers to consider parametric classes of distances that are…
▽ More
Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of {\em ad hoc} distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However fine-tuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances does not require gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
Sinkhorn AutoEncoders
Authors:
Giorgio Patrini,
Rianne van den Berg,
Patrick Forré,
Marcello Carioni,
Samarth Bhargav,
Max Welling,
Tim Genewein,
Frank Nielsen
Abstract:
Optimal transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show that minimizing the p-Wasserstein distance between the generator and the true data distribution is equivalent to the unconstrained min-min optimization of the p-Wasserstein distance between the encoder aggregated posterior and the prior in latent space, plus a reconstruction error. We…
▽ More
Optimal transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show that minimizing the p-Wasserstein distance between the generator and the true data distribution is equivalent to the unconstrained min-min optimization of the p-Wasserstein distance between the encoder aggregated posterior and the prior in latent space, plus a reconstruction error. We also identify the role of its trade-off hyperparameter as the capacity of the generator: its Lipschitz constant. Moreover, we prove that optimizing the encoder over any class of universal approximators, such as deterministic neural networks, is enough to come arbitrarily close to the optimum. We therefore advertise this framework, which holds for any metric space and prior, as a sweet-spot of current generative autoencoding objectives. We then introduce the Sinkhorn auto-encoder (SAE), which approximates and minimizes the p-Wasserstein distance in latent space via backprogation through the Sinkhorn algorithm. SAE directly works on samples, i.e. it models the aggregated posterior as an implicit distribution, with no need for a reparameterization trick for gradients estimations. SAE is thus able to work with different metric spaces and priors with minimal adaptations. We demonstrate the flexibility of SAE on latent spaces with different geometries and priors and compare with other methods on benchmark data sets.
△ Less
Submitted 15 July, 2019; v1 submitted 2 October, 2018;
originally announced October 2018.
-
An elementary introduction to information geometry
Authors:
Frank Nielsen
Abstract:
In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity.
In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity.
△ Less
Submitted 6 September, 2020; v1 submitted 16 August, 2018;
originally announced August 2018.
-
Guaranteed Deterministic Bounds on the Total Variation Distance between Univariate Mixtures
Authors:
Frank Nielsen,
Ke Sun
Abstract:
The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in $[0,1]$. This distance plays a fundamental role in machine learning and signal processing: It is a member of the broader class of $f$-divergences, and it is related to the probability of error in Bayesian hypothesis testing. Since the total variati…
▽ More
The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in $[0,1]$. This distance plays a fundamental role in machine learning and signal processing: It is a member of the broader class of $f$-divergences, and it is related to the probability of error in Bayesian hypothesis testing. Since the total variation distance does not admit closed-form expressions for statistical mixtures (like Gaussian mixture models), one often has to rely in practice on costly numerical integrations or on fast Monte Carlo approximations that however do not guarantee deterministic lower and upper bounds. In this work, we consider two methods for bounding the total variation of univariate mixture models: The first method is based on the information monotonicity property of the total variation to design guaranteed nested deterministic lower bounds. The second method relies on computing the geometric lower and upper envelopes of weighted mixture components to derive deterministic bounds based on density ratio. We demonstrate the tightness of our bounds in a series of experiments on Gaussian, Gamma and Rayleigh mixture models.
△ Less
Submitted 29 June, 2018;
originally announced June 2018.
-
Probabilistic PARAFAC2
Authors:
Philip J. H. Jørgensen,
Søren F. V. Nielsen,
Jesper L. Hinrich,
Mikkel N. Schmidt,
Kristoffer H. Madsen,
Morten Mørup
Abstract:
The PARAFAC2 is a multimodal factor analysis model suitable for analyzing multi-way data when one of the modes has incomparable observation units, for example because of differences in signal sampling or batch sizes. A fully probabilistic treatment of the PARAFAC2 is desirable in order to improve robustness to noise and provide a well founded principle for determining the number of factors, but ch…
▽ More
The PARAFAC2 is a multimodal factor analysis model suitable for analyzing multi-way data when one of the modes has incomparable observation units, for example because of differences in signal sampling or batch sizes. A fully probabilistic treatment of the PARAFAC2 is desirable in order to improve robustness to noise and provide a well founded principle for determining the number of factors, but challenging because the factor loadings are constrained to be orthogonal. We develop two probabilistic formulations of the PARAFAC2 along with variational procedures for inference: In the one approach, the mean values of the factor loadings are orthogonal leading to closed form variational updates, and in the other, the factor loadings themselves are orthogonal using a matrix Von Mises-Fisher distribution. We contrast our probabilistic formulation to the conventional direct fitting algorithm based on maximum likelihood. On simulated data and real fluorescence spectroscopy and gas chromatography-mass spectrometry data, we compare our approach to the conventional PARAFAC2 model estimation and find that the probabilistic formulation is more robust to noise and model order misspecification. The probabilistic PARAFAC2 thus forms a promising framework for modeling multi-way data accounting for uncertainty.
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
q-Neurons: Neuron Activations based on Stochastic Jackson's Derivative Operators
Authors:
Frank Nielsen,
Ke Sun
Abstract:
We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives with stochastic parameters $q$. Our generalization of neural network architectures with $q$-neurons is shown to be both scalable and very easy to implement. We demonstrate experimentally consistently improved performances over state-of-the-art standard act…
▽ More
We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives with stochastic parameters $q$. Our generalization of neural network architectures with $q$-neurons is shown to be both scalable and very easy to implement. We demonstrate experimentally consistently improved performances over state-of-the-art standard activation functions, both on training and testing loss functions.
△ Less
Submitted 13 June, 2018; v1 submitted 31 May, 2018;
originally announced June 2018.
-
Monte Carlo Information Geometry: The dually flat case
Authors:
Frank Nielsen,
Gaëtan Hadjeres
Abstract:
Exponential families and mixture families are parametric probability models that can be geometrically studied as smooth statistical manifolds with respect to any statistical divergence like the Kullback-Leibler (KL) divergence or the Hellinger divergence. When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distr…
▽ More
Exponential families and mixture families are parametric probability models that can be geometrically studied as smooth statistical manifolds with respect to any statistical divergence like the Kullback-Leibler (KL) divergence or the Hellinger divergence. When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distributions amounts to an equivalent Bregman divergence on their corresponding parameters. In practice, the corresponding Bregman generators of mixture/exponential families require to perform definite integral calculus that can either be too time-consuming (for exponentially large discrete support case) or even do not admit closed-form formula (for continuous support case). In these cases, the dually flat construction remains theoretical and cannot be used by information-geometric algorithms. To bypass this problem, we consider performing stochastic Monte Carlo (MC) estimation of those integral-based mixture/exponential family Bregman generators. We show that, under natural assumptions, these MC generators are almost surely Bregman generators. We define a series of dually flat information geometries, termed Monte Carlo Information Geometries, that increasingly-finely approximate the untractable geometry. The advantage of this MCIG is that it allows a practical use of the Bregman algorithmic toolbox on a wide range of probability distribution families. We demonstrate our approach with a clustering task on a mixture family manifold.
△ Less
Submitted 19 March, 2018;
originally announced March 2018.
-
Wembedder: Wikidata entity embedding web service
Authors:
Finn Årup Nielsen
Abstract:
I present a web service for querying an embedding of entities in the Wikidata knowledge graph. The embedding is trained on the Wikidata dump using Gensim's Word2Vec implementation and a simple graph walk. A REST API is implemented. Together with the Wikidata API the web service exposes a multilingual resource for over 600'000 Wikidata items and properties.
I present a web service for querying an embedding of entities in the Wikidata knowledge graph. The embedding is trained on the Wikidata dump using Gensim's Word2Vec implementation and a simple graph walk. A REST API is implemented. Together with the Wikidata API the web service exposes a multilingual resource for over 600'000 Wikidata items and properties.
△ Less
Submitted 11 October, 2017;
originally announced October 2017.
-
Interactive Music Generation with Positional Constraints using Anticipation-RNNs
Authors:
Gaëtan Hadjeres,
Frank Nielsen
Abstract:
Recurrent Neural Networks (RNNS) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This paper in…
▽ More
Recurrent Neural Networks (RNNS) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This paper introduces a novel architecture called Anticipation-RNN which possesses the assets of the RNN-based generative models while allowing to enforce user-defined positional constraints. We demonstrate its efficiency on the task of generating melodies satisfying positional constraints in the style of the soprano parts of the J.S. Bach chorale harmonizations. Sampling using the Anticipation-RNN is of the same order of complexity than sampling from the traditional RNN model. This fast and interactive generation of musical sequences opens ways to devise real-time systems that could be used for creative purposes.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
GLSR-VAE: Geodesic Latent Space Regularization for Variational AutoEncoder Architectures
Authors:
Gaëtan Hadjeres,
Frank Nielsen,
François Pachet
Abstract:
VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes. In many settings, the data we model possesses continuous attributes that we would like to take into account at generation time. We propose in this paper GLSR-VAE, a Geodesic Latent Space Regularization for the Variational AutoEncoder archi…
▽ More
VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes. In many settings, the data we model possesses continuous attributes that we would like to take into account at generation time. We propose in this paper GLSR-VAE, a Geodesic Latent Space Regularization for the Variational AutoEncoder architecture and its generalizations which allows a fine control on the embedding of the data into the latent space. When augmenting the VAE loss with this regularization, changes in the learned latent space reflects changes of the attributes of the data. This deeper understanding of the VAE latent space structure offers the possibility to modulate the attributes of the generated data in a continuous way. We demonstrate its efficiency on a monophonic music generation task where we manage to generate variations of discrete sequences in an intended and playful way.
△ Less
Submitted 14 July, 2017;
originally announced July 2017.
-
Scalable Group Level Probabilistic Sparse Factor Analysis
Authors:
Jesper L. Hinrich,
Søren F. V. Nielsen,
Nicolai A. B. Riis,
Casper T. Eriksen,
Jacob Frøsig,
Marco D. F. Kristensen,
Mikkel N. Schmidt,
Kristoffer H. Madsen,
Morten Mørup
Abstract:
Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a group level scalable probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial n…
▽ More
Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a group level scalable probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial noise modeling. For task-based and resting state fMRI, we show that the sparsity constraint gives rise to components similar to those obtained by group independent component analysis. The noise modeling shows that noise is reduced in areas typically associated with activation by the experimental design. The psFA model identifies sparse components and the probabilistic setting provides a natural way to handle parameter uncertainties. The variational Bayesian framework easily extends to more complex noise models than the presently considered.
△ Less
Submitted 14 December, 2016;
originally announced December 2016.
-
Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering
Authors:
Gautier Marti,
Sebastien Andler,
Frank Nielsen,
Philippe Donnat
Abstract:
We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers…
▽ More
We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research.
△ Less
Submitted 30 October, 2016;
originally announced October 2016.
-
Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities
Authors:
Frank Nielsen,
Ke Sun
Abstract:
Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration, approximated, or bounded using variou…
▽ More
Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration, approximated, or bounded using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy and the Kullback-Leibler divergence of mixtures. We illustrate the versatile method by reporting on our experiments for approximating the Kullback-Leibler divergence between univariate exponential mixtures, Gaussian mixtures, Rayleigh mixtures, and Gamma mixtures.
△ Less
Submitted 16 August, 2016; v1 submitted 19 June, 2016;
originally announced June 2016.
-
Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series
Authors:
Gautier Marti,
Sébastien Andler,
Frank Nielsen,
Philippe Donnat
Abstract:
We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas.…
▽ More
We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas. In this work, we compare renowned distances between distributions: the Fisher-Rao geodesic distance, related divergences and optimal transport, and discuss their advantages and disadvantages. Applications of such methodology can be found in the clustering of financial assets. A tutorial, experiments and implementation for reproducible research can be found at www.datagrapple.com/Tech.
△ Less
Submitted 14 November, 2016; v1 submitted 28 April, 2016;
originally announced April 2016.
-
On clustering financial time series: a need for distances between dependent random variables
Authors:
Gautier Marti,
Frank Nielsen,
Philippe Donnat,
Sébastien Andler
Abstract:
The following working document summarizes our work on the clustering of financial time series. It was written for a workshop on information geometry and its application for image and signal processing. This workshop brought several experts in pure and applied mathematics together with applied researchers from medical imaging, radar signal processing and finance. The authors belong to the latter gr…
▽ More
The following working document summarizes our work on the clustering of financial time series. It was written for a workshop on information geometry and its application for image and signal processing. This workshop brought several experts in pure and applied mathematics together with applied researchers from medical imaging, radar signal processing and finance. The authors belong to the latter group. This document was written as a long introduction to further development of geometric tools in financial applications such as risk or portfolio analysis. Indeed, risk and portfolio analysis essentially rely on covariance matrices. Besides that the Gaussian assumption is known to be inaccurate, covariance matrices are difficult to estimate from empirical data. To filter noise from the empirical estimate, Mantegna proposed using hierarchical clustering. In this work, we first show that this procedure is statistically consistent. Then, we propose to use clustering with a much broader application than the filtering of empirical covariance matrices from the estimate correlation coefficients. To be able to do that, we need to obtain distances between the financial time series that incorporate all the available information in these cross-dependent random processes.
△ Less
Submitted 25 March, 2016;
originally announced March 2016.
-
Clustering Financial Time Series: How Long is Enough?
Authors:
Gautier Marti,
Sébastien Andler,
Frank Nielsen,
Philippe Donnat
Abstract:
Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the m…
▽ More
Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the much debated question: How long should the time series be? If too short, the clusters found can be spurious; if too long, dynamics can be smoothed out.
△ Less
Submitted 14 April, 2016; v1 submitted 13 March, 2016;
originally announced March 2016.
-
Loss factorization, weakly supervised learning and label noise robustness
Authors:
Giorgio Patrini,
Frank Nielsen,
Richard Nock,
Marcello Carioni
Abstract:
We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic…
▽ More
We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation.
Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results.
△ Less
Submitted 9 February, 2016; v1 submitted 7 February, 2016;
originally announced February 2016.
-
Nonparametric Modeling of Dynamic Functional Connectivity in fMRI Data
Authors:
Søren F. V. Nielsen,
Kristoffer H. Madsen,
Rasmus Røge,
Mikkel N. Schmidt,
Morten Mørup
Abstract:
Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on ti…
▽ More
Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on time-windowing the data and k-means clustering. We propose a non-parametric generative model for dynamic FC in fMRI that does not rely on specifying window lengths and number of dynamic states. Rooted in Bayesian statistical modeling we use the predictive likelihood to investigate if the model can discriminate between a motor task and rest both within and across subjects. We further investigate what drives dynamic states using the model on the entire data collated across subjects and task/rest. We find that the number of states extracted are driven by subject variability and preprocessing differences while the individual states are almost purely defined by either task or rest. This questions how we in general interpret dynamic FC and points to the need for more research on what drives dynamic FC.
△ Less
Submitted 8 June, 2016; v1 submitted 4 January, 2016;
originally announced January 2016.
-
Optimal Copula Transport for Clustering Multivariate Time Series
Authors:
Gautier Marti,
Frank Nielsen,
Philippe Donnat
Abstract:
This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dis…
▽ More
This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies.
△ Less
Submitted 11 January, 2016; v1 submitted 27 September, 2015;
originally announced September 2015.
-
Comment partitionner automatiquement des marches aléatoires ? Avec application à la finance quantitative
Authors:
Gautier Marti,
Frank Nielsen,
Philippe Very,
Philippe Donnat
Abstract:
We present in this paper a novel non-parametric approach useful for clustering Markov processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any. An associated metric is defined where th…
▽ More
We present in this paper a novel non-parametric approach useful for clustering Markov processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal \url{http://www.datagrapple.com}.
△ Less
Submitted 30 June, 2015;
originally announced June 2015.
-
Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means
Authors:
Frank Nielsen,
Richard Nock
Abstract:
Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when…
▽ More
Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.
△ Less
Submitted 22 June, 2014;
originally announced June 2014.
-
On the symmetrical Kullback-Leibler Jeffreys centroids
Authors:
Frank Nielsen
Abstract:
Due to the success of the bag-of-word modeling paradigm, clustering histograms has become an important ingredient of modern information processing. Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm. From the viewpoint of applications, it is usually required to deal with symmetric distances. In this letter, we consider the Jeffreys divergence that symmet…
▽ More
Due to the success of the bag-of-word modeling paradigm, clustering histograms has become an important ingredient of modern information processing. Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm. From the viewpoint of applications, it is usually required to deal with symmetric distances. In this letter, we consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We first prove that the Jeffreys centroid can be expressed analytically using the Lambert $W$ function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms. Finally, we conclude with some remarks on the $k$-means histogram clustering.
△ Less
Submitted 22 January, 2014; v1 submitted 28 March, 2013;
originally announced March 2013.
-
Online open neuroimaging mass meta-analysis
Authors:
Finn Årup Nielsen,
Matthew J. Kempton,
Steven C. R. Williams
Abstract:
We describe a system for meta-analysis where a wiki stores numerical data in a simple format and a web service performs the numerical computation.
We initially apply the system on multiple meta-analyses of structural neuroimaging data results. The described system allows for mass meta-analysis, e.g., meta-analysis across multiple brain regions and multiple mental disorders.
We describe a system for meta-analysis where a wiki stores numerical data in a simple format and a web service performs the numerical computation.
We initially apply the system on multiple meta-analyses of structural neuroimaging data results. The described system allows for mass meta-analysis, e.g., meta-analysis across multiple brain regions and multiple mental disorders.
△ Less
Submitted 13 June, 2012;
originally announced June 2012.
-
Maximum A Posteriori Covariance Estimation Using a Power Inverse Wishart Prior
Authors:
Søren Feodor Nielsen,
Jon Sporring
Abstract:
The estimation of the covariance matrix is an initial step in many multivariate statistical methods such as principal components analysis and factor analysis, but in many practical applications the dimensionality of the sample space is large compared to the number of samples, and the usual maximum likelihood estimate is poor. Typically, improvements are obtained by modelling or regularization. Fro…
▽ More
The estimation of the covariance matrix is an initial step in many multivariate statistical methods such as principal components analysis and factor analysis, but in many practical applications the dimensionality of the sample space is large compared to the number of samples, and the usual maximum likelihood estimate is poor. Typically, improvements are obtained by modelling or regularization. From a practical point of view, these methods are often computationally heavy and rely on approximations. As a fast substitute, we propose an easily calculable maximum a posteriori (MAP) estimator based on a new class of prior distributions generalizing the inverse Wishart prior, discuss its properties, and demonstrate the estimator on simulated and real data.
△ Less
Submitted 10 June, 2012;
originally announced June 2012.
-
$k$-MLE: A fast algorithm for learning statistical mixture models
Authors:
Frank Nielsen
Abstract:
We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering…
▽ More
We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering $k$-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of $k$-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of $k$-MLE can be implemented using any $k$-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of $k$-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize $k$-MLE, we propose $k$-MLE++, a careful initialization of $k$-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood.
△ Less
Submitted 23 March, 2012;
originally announced March 2012.