Search | arXiv e-print repository

A Rate-Distortion View of Uncertainty Quantification

Authors: Ifigeneia Apostolopoulou, Benjamin Eysenbach, Frank Nielsen, Artur Dubrawski

Abstract: In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching de… ▽ More In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck. △ Less

Submitted 18 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

Journal ref: International Conference on Machine Learning, 2024

arXiv:2401.15225 [pdf, other]

doi 10.1016/j.ress.2020.107318

A bivariate two-state Markov modulated Poisson process for failure modelling

Authors: Yoel G. Yera, Rosa E. Lillo, Bo F. Nielsen, Pepa Ramírez-Cobo, Fabrizio Ruggeri

Abstract: Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between… ▽ More Motivated by a real failure dataset in a two-dimensional context, this paper presents an extension of the Markov modulated Poisson process (MMPP) to two dimensions. The one-dimensional MMPP has been proposed for the modeling of dependent and non-exponential inter-failure times (in contexts as queuing, risk or reliability, among others). The novel two-dimensional MMPP allows for dependence between the two sequences of inter-failure times, while at the same time preserves the MMPP properties, marginally. The generalization is based on the Marshall-Olkin exponential distribution. Inference is undertaken for the new model through a method combining a matching moments approach with an Approximate Bayesian Computation (ABC) algorithm. The performance of the method is shown on simulated and real datasets representing times and distances covered between consecutive failures in a public transport company. For the real dataset, some quantities of importance associated with the reliability of the system are estimated as the probabilities and expected number of failures at different times and distances covered by trains until the occurrence of a failure. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Journal ref: Reliability Engineering and System Safety 208(2021) 107318

arXiv:2311.13459 [pdf, other]

The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs

Authors: Ehsan Amid, Frank Nielsen, Richard Nock, Manfred K. Warmuth

Abstract: Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we… ▽ More Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2307.10644 [pdf, other]

Fisher-Rao distance and pullback SPD cone distances between multivariate normal distributions

Authors: Frank Nielsen

Abstract: Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities between normals and paths joi… ▽ More Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimilarities between normals and paths joining them. The Fisher-Rao distance defined as the Riemannian geodesic distance induced by the Fisher information metric is such a principled metric distance which however is not known in closed-form excepts for a few particular cases. In this work, we first report a fast and robust method to approximate arbitrarily finely the Fisher-Rao distance between multivariate normal distributions. Second, we introduce a class of distances based on diffeomorphic embeddings of the normal manifold into a submanifold of the higher-dimensional symmetric positive-definite cone corresponding to the manifold of centered normal distributions. We show that the projective Hilbert distance on the cone yields a metric on the embedded normal submanifold and we pullback that cone distance with its associated straight line Hilbert cone geodesics to obtain a distance and smooth paths between normal distributions. Compared to the Fisher-Rao distance approximation, the pullback Hilbert cone distance is computationally light since it requires to compute only the extreme minimal and maximal eigenvalues of matrices. Finally, we show how to use those distances in clustering tasks. △ Less

Submitted 9 June, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: 38 pages, 11 figures

Journal ref: 2nd Annual Topology, Algebra, and Geometry in Machine Learning Workshop, ICML TAG-ML, 2023

arXiv:2303.05910 [pdf, ps, other]

Product Jacobi-Theta Boltzmann machines with score matching

Authors: Andrea Pasquale, Daniel Krefl, Stefano Carrazza, Frank Nielsen

Abstract: The estimation of probability density functions is a non trivial task that over the last years has been tackled with machine learning techniques. Successful applications can be obtained using models inspired by the Boltzmann machine (BM) architecture. In this manuscript, the product Jacobi-Theta Boltzmann machine (pJTBM) is introduced as a restricted version of the Riemann-Theta Boltzmann machine… ▽ More The estimation of probability density functions is a non trivial task that over the last years has been tackled with machine learning techniques. Successful applications can be obtained using models inspired by the Boltzmann machine (BM) architecture. In this manuscript, the product Jacobi-Theta Boltzmann machine (pJTBM) is introduced as a restricted version of the Riemann-Theta Boltzmann machine (RTBM) with diagonal hidden sector connection matrix. We show that score matching, based on the Fisher divergence, can be used to fit probability densities with the pJTBM more efficiently than with the original RTBM. △ Less

Submitted 12 January, 2024; v1 submitted 10 March, 2023; originally announced March 2023.

Comments: 7 pages, 3 figures, ACAT22 proceedings

Report number: TIF-UNIMI-2023-8

arXiv:2302.09738 [pdf, other]

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

Authors: Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riem… ▽ More Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL △ Less

Submitted 16 March, 2024; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: A long version of the ICML 2023 paper. Updated the main text to emphasize challenges of using existing Riemannian methods to estimate sparse and structured SPD matrices

arXiv:2209.07481 [pdf, other]

doi 10.1007/s41884-023-00129-6

Variational Representations of Annealing Paths: Bregman Information under Monotonic Embedding

Authors: Rob Brekelmans, Frank Nielsen

Abstract: Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quasi-arithmetic means, and interpreted the resulting… ▽ More Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quasi-arithmetic means, and interpreted the resulting intermediate densities as minimizing an expected divergence to the endpoints. To analyze these variational representations of annealing paths, we extend known results showing that the arithmetic mean over arguments minimizes the expected Bregman divergence to a single representative point. In particular, we obtain an analogous result for quasi-arithmetic means, when the inputs to the Bregman divergence are transformed under a monotonic embedding function. Our analysis highlights the interplay between quasi-arithmetic means, parametric families, and divergence functionals using the rho-tau representational Bregman divergence framework, and associates common divergence functionals with intermediate densities along an annealing path. △ Less

Submitted 6 February, 2024; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: Published in Information Geometry (Info. Geo. 2024)

arXiv:2206.08598 [pdf, other]

On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models

Authors: Pascal Mattia Esser, Frank Nielsen

Abstract: A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these m… ▽ More A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks. △ Less

Submitted 17 June, 2022; originally announced June 2022.

arXiv:2107.10884 [pdf, other]

Structured second-order methods via natural gradient descent

Authors: Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invar… ▽ More In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems. △ Less

Submitted 19 February, 2022; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Fixed some typos and added a new figure. ICML 2021 workshop paper. A short version of arXiv:2102.07405 with a focus on optimization tasks

arXiv:2107.00745 [pdf, other]

q-Paths: Generalizing the Geometric Annealing Path using Power Means

Authors: Vaden Masrani, Rob Brekelmans, Thang Bui, Frank Nielsen, Aram Galstyan, Greg Ver Steeg, Frank Wood

Abstract: Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of… ▽ More Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of closed form energy function. In this work, we introduce $q$-paths, a family of paths which is derived from a generalized notion of the mean, includes the geometric and arithmetic mixtures as special cases, and admits a simple closed form involving the deformed logarithm function from nonextensive thermodynamics. Following previous analysis of the geometric path, we interpret our $q$-paths as corresponding to a $q$-exponential family of distributions, and provide a variational representation of intermediate densities as minimizing a mixture of $α$-divergences to the endpoints. We show that small deviations away from the geometric path yield empirical gains for Bayesian inference using Sequential Monte Carlo and generative model evaluation using Annealed Importance Sampling. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: arXiv admin note: text overlap with arXiv:2012.07823

arXiv:2102.07405 [pdf, other]

Tractable structured natural gradient descent using local parameterizations

Authors: Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) genera… ▽ More Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods. △ Less

Submitted 17 January, 2022; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: An extended version of the ICML 2021 paper. Note: A workshop (short) paper with a focus on optimization tasks can be found at arXiv:2107.10884

arXiv:2012.15480 [pdf, other]

Likelihood Ratio Exponential Families

Authors: Rob Brekelmans, Frank Nielsen, Alireza Makhzani, Aram Galstyan, Greg Ver Steeg

Abstract: The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermod… ▽ More The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermodynamic variational objective (TVO). We extend these likelihood ratio exponential families to include solutions to rate-distortion (RD) optimization, the information bottleneck (IB) method, and recent rate-distortion-classification approaches which combine RD and IB. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration bounds such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function. △ Less

Submitted 15 January, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: NeurIPS Workshop on Deep Learning through Information Geometry

arXiv:2007.10677 [pdf, other]

doi 10.1007/s13571-021-00255-0

Clustering patterns connecting COVID-19 dynamics and Human mobility using optimal transport

Authors: Frank Nielsen, Gautier Marti, Sumanta Ray, Saumyadipta Pyne

Abstract: Social distancing and stay-at-home are among the few measures that are known to be effective in checking the spread of a pandemic such as COVID-19 in a given population. The patterns of dependency between such measures and their effects on disease incidence may vary dynamically and across different populations. We described a new computational framework to measure and compare the temporal relation… ▽ More Social distancing and stay-at-home are among the few measures that are known to be effective in checking the spread of a pandemic such as COVID-19 in a given population. The patterns of dependency between such measures and their effects on disease incidence may vary dynamically and across different populations. We described a new computational framework to measure and compare the temporal relationships between human mobility and new cases of COVID-19 across more than 150 cities of the United States with relatively high incidence of the disease. We used a novel application of Optimal Transport for computing the distance between the normalized patterns induced by bivariate time series for each pair of cities. Thus, we identified 10 clusters of cities with similar temporal dependencies, and computed the Wasserstein barycenter to describe the overall dynamic pattern for each cluster. Finally, we used city-specific socioeconomic covariates to analyze the composition of each cluster. △ Less

Submitted 21 July, 2020; originally announced July 2020.

Comments: 16 pages

Journal ref: Sankhya B (16th March 2021)

arXiv:2002.08345

Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distances

Authors: Gaëtan Hadjeres, Frank Nielsen

Abstract: Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term th… ▽ More Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the Schoenberg-Rao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semi-definite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closed-form distances between mixtures of Gaussian distributions. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification. △ Less

Submitted 28 April, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

Comments: Most results were already known. The distances described therein do not generalize MMD: it is an MMD with a distance-induced kernel (see [Sejdinovic et al. (2013)]

arXiv:1911.12463 [pdf, other]

Information-Geometric Set Embeddings (IGSE): From Sets to Probability Distributions

Authors: Ke Sun, Frank Nielsen

Abstract: This letter introduces an abstract learning problem called the "set embedding": The objective is to map sets into probability distributions so as to lose less information. We relate set union and intersection operations with corresponding interpolations of probability distributions. We also demonstrate a preliminary solution with experimental results on toy set embedding examples. This letter introduces an abstract learning problem called the "set embedding": The objective is to map sets into probability distributions so as to lose less information. We relate set union and intersection operations with corresponding interpolations of probability distributions. We also demonstrate a preliminary solution with experimental results on toy set embedding examples. △ Less

Submitted 11 December, 2019; v1 submitted 27 November, 2019; originally announced November 2019.

Comments: To be presented at Sets & Partitions (NeurIPS 2019 workshop)

arXiv:1905.11027 [pdf, other]

A Geometric Modeling of Occam's Razor in Deep Learning

Authors: Ke Sun, Frank Nielsen

Abstract: Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. With the belief that s… ▽ More Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. With the belief that simplicity is linked to better generalization, as grounded in the theory of minimum description length, the objective of our analysis is to examine and bound the complexity of DNNs. We introduce the locally varying dimensionality of the parameter space of neural network models by considering the number of significant dimensions of the Fisher information matrix, and model the parameter space as a manifold using the framework of singular semi-Riemannian geometry. We derive model complexity measures which yield short description lengths for deep neural network models based on their singularity analysis thus explaining the good performance of DNNs despite their large number of parameters. △ Less

Submitted 26 March, 2025; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: This work first appeared under the former title "Lightlike Neuromanifolds, Occam's Razor and Deep Learning"

arXiv:1901.03732 [pdf, ps, other]

The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models

Authors: Frank Nielsen

Abstract: The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponent… ▽ More The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponents. This result extends to arbitrary mixtures of exponential families with natural parameter spaces being cones: This includes the binomial, the multinomial, the zero-centered Laplacian, the Gaussian and the Wishart mixtures, among others. We also derive a Minkowski's diversity index of a normalized weighted set of probability distributions from Minkowski's inequality. △ Less

Submitted 17 January, 2019; v1 submitted 9 January, 2019; originally announced January 2019.

Comments: 14 pages

arXiv:1901.03634 [pdf, other]

Variation Network: Learning High-level Attributes for Controlled Input Manipulation

Authors: Gaëtan Hadjeres, Frank Nielsen

Abstract: This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this… ▽ More This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this model applicable to a wide variety of tasks. Further, VarNet has a sound information-theoretic interpretation which grants us with interpretable means to control how these high-level attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and meaningful. △ Less

Submitted 16 September, 2019; v1 submitted 11 January, 2019; originally announced January 2019.

Comments: 15 pages, 7 figures

arXiv:1812.08113 [pdf, other]

On The Chain Rule Optimal Transport Distance

Authors: Frank Nielsen, Ke Sun

Abstract: We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced m… ▽ More We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced metric distance between statistical mixtures, and provide an upper bound for jointly convex distances between statistical mixtures. By entropic regularization of the optimal transport, we obtain a fast differentiable Sinkhorn-type distance. We experimentally evaluate our new family of distances by quantifying the upper bounds of several jointly convex distances between statistical mixtures, and by proposing a novel efficient method to learn Gaussian mixture models (GMMs) by simplifying kernel density estimators with respect to our distance. Our GMM learning technique experimentally improves significantly over the EM implementation of {\tt sklearn} on the {\tt MNIST} and {\tt Fashion MNIST} datasets. △ Less

Submitted 2 November, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: 23 pages, 6 figures

arXiv:1810.10770 [pdf, other]

Geometry and clustering with metrics derived from separable Bregman divergences

Authors: Erika Gomes-Gonçalves, Henryk Gzyl, Frank Nielsen

Abstract: Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances. Separable Bregman divergences induce Riemannian metric spaces that are isometric to the Euclidean space after monotone embeddings. We investigate fixed rate quantization and its codebook Voronoi diagrams, and report on experimental performances of partition-based, hierarchical, and soft clustering algorithms with respect to these Riemann-Bregman distances. △ Less

Submitted 25 October, 2018; originally announced October 2018.

Comments: 23 pages

arXiv:1810.09113 [pdf, other]

doi 10.1007/978-3-030-26980-7_31

The Bregman chord divergence

Authors: Frank Nielsen, Richard Nock

Abstract: Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of {\em ad hoc} distances, one rather prefers to consider parametric classes of distances that are… ▽ More Distances are fundamental primitives whose choice significantly impacts the performances of algorithms in machine learning and signal processing. However selecting the most appropriate distance for a given task is an endeavor. Instead of testing one by one the entries of an ever-expanding dictionary of {\em ad hoc} distances, one rather prefers to consider parametric classes of distances that are exhaustively characterized by axioms derived from first principles. Bregman divergences are such a class. However fine-tuning a Bregman divergence is delicate since it requires to smoothly adjust a functional generator. In this work, we propose an extension of Bregman divergences called the Bregman chord divergences. This new class of distances does not require gradient calculations, uses two scalar parameters that can be easily tailored in applications, and generalizes asymptotically Bregman divergences. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Comments: 10 pages

Journal ref: GSI 2019: Geometric Science of Information pp 299-308

arXiv:1810.01118 [pdf, other]

Sinkhorn AutoEncoders

Authors: Giorgio Patrini, Rianne van den Berg, Patrick Forré, Marcello Carioni, Samarth Bhargav, Max Welling, Tim Genewein, Frank Nielsen

Abstract: Optimal transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show that minimizing the p-Wasserstein distance between the generator and the true data distribution is equivalent to the unconstrained min-min optimization of the p-Wasserstein distance between the encoder aggregated posterior and the prior in latent space, plus a reconstruction error. We… ▽ More Optimal transport offers an alternative to maximum likelihood for learning generative autoencoding models. We show that minimizing the p-Wasserstein distance between the generator and the true data distribution is equivalent to the unconstrained min-min optimization of the p-Wasserstein distance between the encoder aggregated posterior and the prior in latent space, plus a reconstruction error. We also identify the role of its trade-off hyperparameter as the capacity of the generator: its Lipschitz constant. Moreover, we prove that optimizing the encoder over any class of universal approximators, such as deterministic neural networks, is enough to come arbitrarily close to the optimum. We therefore advertise this framework, which holds for any metric space and prior, as a sweet-spot of current generative autoencoding objectives. We then introduce the Sinkhorn auto-encoder (SAE), which approximates and minimizes the p-Wasserstein distance in latent space via backprogation through the Sinkhorn algorithm. SAE directly works on samples, i.e. it models the aggregated posterior as an implicit distribution, with no need for a reparameterization trick for gradients estimations. SAE is thus able to work with different metric spaces and priors with minimal adaptations. We demonstrate the flexibility of SAE on latent spaces with different geometries and priors and compare with other methods on benchmark data sets. △ Less

Submitted 15 July, 2019; v1 submitted 2 October, 2018; originally announced October 2018.

Comments: Accepted for oral presentation at UAI19

arXiv:1808.08271 [pdf, other]

doi 10.3390/e22101100

An elementary introduction to information geometry

Authors: Frank Nielsen

Abstract: In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity. In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity. △ Less

Submitted 6 September, 2020; v1 submitted 16 August, 2018; originally announced August 2018.

Comments: 56 pages, 16 figures

Journal ref: Entropy 2020, 22(10), 1100

arXiv:1806.11311 [pdf, other]

Guaranteed Deterministic Bounds on the Total Variation Distance between Univariate Mixtures

Authors: Frank Nielsen, Ke Sun

Abstract: The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in $[0,1]$. This distance plays a fundamental role in machine learning and signal processing: It is a member of the broader class of $f$-divergences, and it is related to the probability of error in Bayesian hypothesis testing. Since the total variati… ▽ More The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in $[0,1]$. This distance plays a fundamental role in machine learning and signal processing: It is a member of the broader class of $f$-divergences, and it is related to the probability of error in Bayesian hypothesis testing. Since the total variation distance does not admit closed-form expressions for statistical mixtures (like Gaussian mixture models), one often has to rely in practice on costly numerical integrations or on fast Monte Carlo approximations that however do not guarantee deterministic lower and upper bounds. In this work, we consider two methods for bounding the total variation of univariate mixture models: The first method is based on the information monotonicity property of the total variation to design guaranteed nested deterministic lower bounds. The second method relies on computing the geometric lower and upper envelopes of weighted mixture components to derive deterministic bounds based on density ratio. We demonstrate the tightness of our bounds in a series of experiments on Gaussian, Gamma and Rayleigh mixture models. △ Less

Submitted 29 June, 2018; originally announced June 2018.

Comments: 11 pages, 2 figures

arXiv:1806.08195 [pdf, other]

Probabilistic PARAFAC2

Authors: Philip J. H. Jørgensen, Søren F. V. Nielsen, Jesper L. Hinrich, Mikkel N. Schmidt, Kristoffer H. Madsen, Morten Mørup

Abstract: The PARAFAC2 is a multimodal factor analysis model suitable for analyzing multi-way data when one of the modes has incomparable observation units, for example because of differences in signal sampling or batch sizes. A fully probabilistic treatment of the PARAFAC2 is desirable in order to improve robustness to noise and provide a well founded principle for determining the number of factors, but ch… ▽ More The PARAFAC2 is a multimodal factor analysis model suitable for analyzing multi-way data when one of the modes has incomparable observation units, for example because of differences in signal sampling or batch sizes. A fully probabilistic treatment of the PARAFAC2 is desirable in order to improve robustness to noise and provide a well founded principle for determining the number of factors, but challenging because the factor loadings are constrained to be orthogonal. We develop two probabilistic formulations of the PARAFAC2 along with variational procedures for inference: In the one approach, the mean values of the factor loadings are orthogonal leading to closed form variational updates, and in the other, the factor loadings themselves are orthogonal using a matrix Von Mises-Fisher distribution. We contrast our probabilistic formulation to the conventional direct fitting algorithm based on maximum likelihood. On simulated data and real fluorescence spectroscopy and gas chromatography-mass spectrometry data, we compare our approach to the conventional PARAFAC2 model estimation and find that the probabilistic formulation is more robust to noise and model order misspecification. The probabilistic PARAFAC2 thus forms a promising framework for modeling multi-way data accounting for uncertainty. △ Less

Submitted 21 June, 2018; originally announced June 2018.

Comments: 16 pages (incl. 4 pages of supplemental material), 5 figures

arXiv:1806.00149 [pdf, other]

doi 10.1109/TNNLS.2020.3005167

q-Neurons: Neuron Activations based on Stochastic Jackson's Derivative Operators

Authors: Frank Nielsen, Ke Sun

Abstract: We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives with stochastic parameters $q$. Our generalization of neural network architectures with $q$-neurons is shown to be both scalable and very easy to implement. We demonstrate experimentally consistently improved performances over state-of-the-art standard act… ▽ More We propose a new generic type of stochastic neurons, called $q$-neurons, that considers activation functions based on Jackson's $q$-derivatives with stochastic parameters $q$. Our generalization of neural network architectures with $q$-neurons is shown to be both scalable and very easy to implement. We demonstrate experimentally consistently improved performances over state-of-the-art standard activation functions, both on training and testing loss functions. △ Less

Submitted 13 June, 2018; v1 submitted 31 May, 2018; originally announced June 2018.

Comments: 12 pages, 5 figures, 1 table

Journal ref: IEEE Transactions on Neural Networks and Learning Systems (2020)

arXiv:1803.07225 [pdf, other]

Monte Carlo Information Geometry: The dually flat case

Authors: Frank Nielsen, Gaëtan Hadjeres

Abstract: Exponential families and mixture families are parametric probability models that can be geometrically studied as smooth statistical manifolds with respect to any statistical divergence like the Kullback-Leibler (KL) divergence or the Hellinger divergence. When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distr… ▽ More Exponential families and mixture families are parametric probability models that can be geometrically studied as smooth statistical manifolds with respect to any statistical divergence like the Kullback-Leibler (KL) divergence or the Hellinger divergence. When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distributions amounts to an equivalent Bregman divergence on their corresponding parameters. In practice, the corresponding Bregman generators of mixture/exponential families require to perform definite integral calculus that can either be too time-consuming (for exponentially large discrete support case) or even do not admit closed-form formula (for continuous support case). In these cases, the dually flat construction remains theoretical and cannot be used by information-geometric algorithms. To bypass this problem, we consider performing stochastic Monte Carlo (MC) estimation of those integral-based mixture/exponential family Bregman generators. We show that, under natural assumptions, these MC generators are almost surely Bregman generators. We define a series of dually flat information geometries, termed Monte Carlo Information Geometries, that increasingly-finely approximate the untractable geometry. The advantage of this MCIG is that it allows a practical use of the Bregman algorithmic toolbox on a wide range of probability distribution families. We demonstrate our approach with a clustering task on a mixture family manifold. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: 25 pages

arXiv:1710.04099 [pdf, other]

Wembedder: Wikidata entity embedding web service

Authors: Finn Årup Nielsen

Abstract: I present a web service for querying an embedding of entities in the Wikidata knowledge graph. The embedding is trained on the Wikidata dump using Gensim's Word2Vec implementation and a simple graph walk. A REST API is implemented. Together with the Wikidata API the web service exposes a multilingual resource for over 600'000 Wikidata items and properties. I present a web service for querying an embedding of entities in the Wikidata knowledge graph. The embedding is trained on the Wikidata dump using Gensim's Word2Vec implementation and a simple graph walk. A REST API is implemented. Together with the Wikidata API the web service exposes a multilingual resource for over 600'000 Wikidata items and properties. △ Less

Submitted 11 October, 2017; originally announced October 2017.

Comments: 3 pages, 2 figures

ACM Class: I.2.4; H.3.5

arXiv:1709.06404 [pdf, other]

Interactive Music Generation with Positional Constraints using Anticipation-RNNs

Authors: Gaëtan Hadjeres, Frank Nielsen

Abstract: Recurrent Neural Networks (RNNS) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This paper in… ▽ More Recurrent Neural Networks (RNNS) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This paper introduces a novel architecture called Anticipation-RNN which possesses the assets of the RNN-based generative models while allowing to enforce user-defined positional constraints. We demonstrate its efficiency on the task of generating melodies satisfying positional constraints in the style of the soprano parts of the J.S. Bach chorale harmonizations. Sampling using the Anticipation-RNN is of the same order of complexity than sampling from the traditional RNN model. This fast and interactive generation of musical sequences opens ways to devise real-time systems that could be used for creative purposes. △ Less

Submitted 19 September, 2017; originally announced September 2017.

Comments: 9 pages, 7 figures

arXiv:1707.04588 [pdf, other]

GLSR-VAE: Geodesic Latent Space Regularization for Variational AutoEncoder Architectures

Authors: Gaëtan Hadjeres, Frank Nielsen, François Pachet

Abstract: VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes. In many settings, the data we model possesses continuous attributes that we would like to take into account at generation time. We propose in this paper GLSR-VAE, a Geodesic Latent Space Regularization for the Variational AutoEncoder archi… ▽ More VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes. In many settings, the data we model possesses continuous attributes that we would like to take into account at generation time. We propose in this paper GLSR-VAE, a Geodesic Latent Space Regularization for the Variational AutoEncoder architecture and its generalizations which allows a fine control on the embedding of the data into the latent space. When augmenting the VAE loss with this regularization, changes in the learned latent space reflects changes of the attributes of the data. This deeper understanding of the VAE latent space structure offers the possibility to modulate the attributes of the generated data in a continuous way. We demonstrate its efficiency on a monophonic music generation task where we manage to generate variations of discrete sequences in an intended and playful way. △ Less

Submitted 14 July, 2017; originally announced July 2017.

Comments: 11 pages

arXiv:1612.04555 [pdf, ps, other]

Scalable Group Level Probabilistic Sparse Factor Analysis

Authors: Jesper L. Hinrich, Søren F. V. Nielsen, Nicolai A. B. Riis, Casper T. Eriksen, Jacob Frøsig, Marco D. F. Kristensen, Mikkel N. Schmidt, Kristoffer H. Madsen, Morten Mørup

Abstract: Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a group level scalable probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial n… ▽ More Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a group level scalable probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial noise modeling. For task-based and resting state fMRI, we show that the sparsity constraint gives rise to components similar to those obtained by group independent component analysis. The noise modeling shows that noise is reduced in areas typically associated with activation by the experimental design. The psFA model identifies sparse components and the probabilistic setting provides a natural way to handle parameter uncertainties. The variational Bayesian framework easily extends to more complex noise models than the presently considered. △ Less

Submitted 14 December, 2016; originally announced December 2016.

Comments: 10 pages plus 5 pages appendix, Submitted to ICASSP 17

arXiv:1610.09659 [pdf, other]

Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Authors: Gautier Marti, Sebastien Andler, Frank Nielsen, Philippe Donnat

Abstract: We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers… ▽ More We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research. △ Less

Submitted 30 October, 2016; originally announced October 2016.

arXiv:1606.05850 [pdf, other]

doi 10.3390/e18120442

Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities

Authors: Frank Nielsen, Ke Sun

Abstract: Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration, approximated, or bounded using variou… ▽ More Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration, approximated, or bounded using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy and the Kullback-Leibler divergence of mixtures. We illustrate the versatile method by reporting on our experiments for approximating the Kullback-Leibler divergence between univariate exponential mixtures, Gaussian mixtures, Rayleigh mixtures, and Gamma mixtures. △ Less

Submitted 16 August, 2016; v1 submitted 19 June, 2016; originally announced June 2016.

Comments: 20 pages, 3 figures

arXiv:1604.08634 [pdf, other]

doi 10.1109/SSP.2016.7551770

Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Authors: Gautier Marti, Sébastien Andler, Frank Nielsen, Philippe Donnat

Abstract: We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas.… ▽ More We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas. In this work, we compare renowned distances between distributions: the Fisher-Rao geodesic distance, related divergences and optimal transport, and discuss their advantages and disadvantages. Applications of such methodology can be found in the clustering of financial assets. A tutorial, experiments and implementation for reproducible research can be found at www.datagrapple.com/Tech. △ Less

Submitted 14 November, 2016; v1 submitted 28 April, 2016; originally announced April 2016.

Comments: Accepted at IEEE Workshop on Statistical Signal Processing (SSP 2016)

arXiv:1603.07822 [pdf, other]

On clustering financial time series: a need for distances between dependent random variables

Authors: Gautier Marti, Frank Nielsen, Philippe Donnat, Sébastien Andler

Abstract: The following working document summarizes our work on the clustering of financial time series. It was written for a workshop on information geometry and its application for image and signal processing. This workshop brought several experts in pure and applied mathematics together with applied researchers from medical imaging, radar signal processing and finance. The authors belong to the latter gr… ▽ More The following working document summarizes our work on the clustering of financial time series. It was written for a workshop on information geometry and its application for image and signal processing. This workshop brought several experts in pure and applied mathematics together with applied researchers from medical imaging, radar signal processing and finance. The authors belong to the latter group. This document was written as a long introduction to further development of geometric tools in financial applications such as risk or portfolio analysis. Indeed, risk and portfolio analysis essentially rely on covariance matrices. Besides that the Gaussian assumption is known to be inaccurate, covariance matrices are difficult to estimate from empirical data. To filter noise from the empirical estimate, Mantegna proposed using hierarchical clustering. In this work, we first show that this procedure is statistically consistent. Then, we propose to use clustering with a much broader application than the filtering of empirical covariance matrices from the estimate correlation coefficients. To be able to do that, we need to obtain distances between the financial time series that incorporate all the available information in these cross-dependent random processes. △ Less

Submitted 25 March, 2016; originally announced March 2016.

Comments: Work presented during a workshop on Information Geometry at the International Centre for Mathematical Sciences, Edinburgh, UK

arXiv:1603.04017 [pdf, other]

Clustering Financial Time Series: How Long is Enough?

Authors: Gautier Marti, Sébastien Andler, Frank Nielsen, Philippe Donnat

Abstract: Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the m… ▽ More Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the much debated question: How long should the time series be? If too short, the clusters found can be spurious; if too long, dynamics can be smoothed out. △ Less

Submitted 14 April, 2016; v1 submitted 13 March, 2016; originally announced March 2016.

Comments: Accepted at IJCAI 2016

arXiv:1602.02450 [pdf, ps, other]

Loss factorization, weakly supervised learning and label noise robustness

Authors: Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni

Abstract: We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic… ▽ More We prove that the empirical risk of most well-known loss functions factors into a linear term aggregating all labels with a term that is label free, and can further be expressed by sums of the loss. This holds true even for non-smooth, non-convex losses and in any RKHS. The first term is a (kernel) mean operator --the focal quantity of this work-- which we characterize as the sufficient statistic for the labels. The result tightens known generalization bounds and sheds new light on their interpretation. Factorization has a direct application on weakly supervised learning. In particular, we demonstrate that algorithms like SGD and proximal methods can be adapted with minimal effort to handle weak supervision, once the mean operator has been estimated. We apply this idea to learning with asymmetric noisy labels, connecting and extending prior work. Furthermore, we show that most losses enjoy a data-dependent (by the mean operator) form of noise robustness, in contrast with known negative results. △ Less

Submitted 9 February, 2016; v1 submitted 7 February, 2016; originally announced February 2016.

arXiv:1601.00496 [pdf, other]

Nonparametric Modeling of Dynamic Functional Connectivity in fMRI Data

Authors: Søren F. V. Nielsen, Kristoffer H. Madsen, Rasmus Røge, Mikkel N. Schmidt, Morten Mørup

Abstract: Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on ti… ▽ More Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on time-windowing the data and k-means clustering. We propose a non-parametric generative model for dynamic FC in fMRI that does not rely on specifying window lengths and number of dynamic states. Rooted in Bayesian statistical modeling we use the predictive likelihood to investigate if the model can discriminate between a motor task and rest both within and across subjects. We further investigate what drives dynamic states using the model on the entire data collated across subjects and task/rest. We find that the number of states extracted are driven by subject variability and preprocessing differences while the individual states are almost purely defined by either task or rest. This questions how we in general interpret dynamic FC and points to the need for more research on what drives dynamic FC. △ Less

Submitted 8 June, 2016; v1 submitted 4 January, 2016; originally announced January 2016.

Comments: 8 pages, 1 figure. Presented at the Machine Learning and Interpretation in Neuroimaging Workshop (MLINI-2015), 2015 (arXiv:1605.04435)

Report number: MLINI/2015/08

arXiv:1509.08144 [pdf, other]

Optimal Copula Transport for Clustering Multivariate Time Series

Authors: Gautier Marti, Frank Nielsen, Philippe Donnat

Abstract: This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dis… ▽ More This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies. △ Less

Submitted 11 January, 2016; v1 submitted 27 September, 2015; originally announced September 2015.

Comments: Accepted at ICASSP 2016

arXiv:1506.09163 [pdf, other]

Comment partitionner automatiquement des marches aléatoires ? Avec application à la finance quantitative

Authors: Gautier Marti, Frank Nielsen, Philippe Very, Philippe Donnat

Abstract: We present in this paper a novel non-parametric approach useful for clustering Markov processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any. An associated metric is defined where th… ▽ More We present in this paper a novel non-parametric approach useful for clustering Markov processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal \url{http://www.datagrapple.com}. △ Less

Submitted 30 June, 2015; originally announced June 2015.

Comments: in French

arXiv:1406.6314 [pdf, other]

Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

Authors: Frank Nielsen, Richard Nock

Abstract: Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when… ▽ More Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima. △ Less

Submitted 22 June, 2014; originally announced June 2014.

Comments: 14 pages

arXiv:1303.7286 [pdf, other]

doi 10.1109/LSP.2013.2260538

On the symmetrical Kullback-Leibler Jeffreys centroids

Authors: Frank Nielsen

Abstract: Due to the success of the bag-of-word modeling paradigm, clustering histograms has become an important ingredient of modern information processing. Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm. From the viewpoint of applications, it is usually required to deal with symmetric distances. In this letter, we consider the Jeffreys divergence that symmet… ▽ More Due to the success of the bag-of-word modeling paradigm, clustering histograms has become an important ingredient of modern information processing. Clustering histograms can be performed using the celebrated $k$-means centroid-based algorithm. From the viewpoint of applications, it is usually required to deal with symmetric distances. In this letter, we consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We first prove that the Jeffreys centroid can be expressed analytically using the Lambert $W$ function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms. Finally, we conclude with some remarks on the $k$-means histogram clustering. △ Less

Submitted 22 January, 2014; v1 submitted 28 March, 2013; originally announced March 2013.

Comments: 17 pages, 1 figure, source code in R

Journal ref: IEEE Signal Processing Letters (Volume:20 , Issue: 7 ), pp. 657-660, 2013

arXiv:1206.2742 [pdf, other]

Online open neuroimaging mass meta-analysis

Authors: Finn Årup Nielsen, Matthew J. Kempton, Steven C. R. Williams

Abstract: We describe a system for meta-analysis where a wiki stores numerical data in a simple format and a web service performs the numerical computation. We initially apply the system on multiple meta-analyses of structural neuroimaging data results. The described system allows for mass meta-analysis, e.g., meta-analysis across multiple brain regions and multiple mental disorders. We describe a system for meta-analysis where a wiki stores numerical data in a simple format and a web service performs the numerical computation. We initially apply the system on multiple meta-analyses of structural neuroimaging data results. The described system allows for mass meta-analysis, e.g., meta-analysis across multiple brain regions and multiple mental disorders. △ Less

Submitted 13 June, 2012; originally announced June 2012.

Comments: 5 pages, 4 figures SePublica 2012, ESWC 2012 Workshop, 28 May 2012, Heraklion, Greece

MSC Class: 68U35 ACM Class: H.5.4; J.3; G.3

arXiv:1206.2054 [pdf, other]

Maximum A Posteriori Covariance Estimation Using a Power Inverse Wishart Prior

Authors: Søren Feodor Nielsen, Jon Sporring

Abstract: The estimation of the covariance matrix is an initial step in many multivariate statistical methods such as principal components analysis and factor analysis, but in many practical applications the dimensionality of the sample space is large compared to the number of samples, and the usual maximum likelihood estimate is poor. Typically, improvements are obtained by modelling or regularization. Fro… ▽ More The estimation of the covariance matrix is an initial step in many multivariate statistical methods such as principal components analysis and factor analysis, but in many practical applications the dimensionality of the sample space is large compared to the number of samples, and the usual maximum likelihood estimate is poor. Typically, improvements are obtained by modelling or regularization. From a practical point of view, these methods are often computationally heavy and rely on approximations. As a fast substitute, we propose an easily calculable maximum a posteriori (MAP) estimator based on a new class of prior distributions generalizing the inverse Wishart prior, discuss its properties, and demonstrate the estimator on simulated and real data. △ Less

Submitted 10 June, 2012; originally announced June 2012.

Comments: 29 pages, 8 figures, 2 tables

arXiv:1203.5181 [pdf, other]

doi 10.1109/ICASSP.2012.6288022

$k$-MLE: A fast algorithm for learning statistical mixture models

Authors: Frank Nielsen

Abstract: We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering… ▽ More We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering $k$-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of $k$-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of $k$-MLE can be implemented using any $k$-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of $k$-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize $k$-MLE, we propose $k$-MLE++, a careful initialization of $k$-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood. △ Less

Submitted 23 March, 2012; originally announced March 2012.

Comments: 31 pages, Extend preliminary paper presented at IEEE ICASSP 2012

Showing 1–45 of 45 results for author: Nielsen, F