-
Cryo-em images are intrinsically low dimensional
Authors:
Luke Evans,
Octavian-Vlad Murad,
Lars Dingeldein,
Pilar Cossio,
Roberto Covino,
Marina Meila
Abstract:
Simulation-based inference provides a powerful framework for cryo-electron microscopy, employing neural networks in methods like CryoSBI to infer biomolecular conformations via learned latent representations. This latent space represents a rich opportunity, encoding valuable information about the physical system and the inference process. Harnessing this potential hinges on understanding the under…
▽ More
Simulation-based inference provides a powerful framework for cryo-electron microscopy, employing neural networks in methods like CryoSBI to infer biomolecular conformations via learned latent representations. This latent space represents a rich opportunity, encoding valuable information about the physical system and the inference process. Harnessing this potential hinges on understanding the underlying geometric structure of these representations. We investigate this structure by applying manifold learning techniques to CryoSBI representations of hemagglutinin (simulated and experimental). We reveal that these high-dimensional data inherently populate low-dimensional, smooth manifolds, with simulated data effectively covering the experimental counterpart. By characterizing the manifold's geometry using Diffusion Maps and identifying its principal axes of variation via coordinate interpretation methods, we establish a direct link between the latent structure and key physical parameters. Discovering this intrinsic low-dimensionality and interpretable geometric organization not only validates the CryoSBI approach but enables us to learn more from the data structure and provides opportunities for improving future inference strategies by exploiting this revealed manifold geometry.
△ Less
Submitted 3 September, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
The Entropy and Crossentropy of Generalized Mallows Models
Authors:
Marina Meilă
Abstract:
The Generalized Mallows Model (GMM) is a well known family of models for ranking data. A GMM is a distribution over $\mathbb{S}_n$, the set of permutations of n objects, characterized by a location parameter $σ\in \mathbb{S}_n$, known as central permutation and a set of dispersion parameters $θ_{1:n-1}\in(0,1]$. The GMM shares many properties, such as having sufficient statistics, with exponential…
▽ More
The Generalized Mallows Model (GMM) is a well known family of models for ranking data. A GMM is a distribution over $\mathbb{S}_n$, the set of permutations of n objects, characterized by a location parameter $σ\in \mathbb{S}_n$, known as central permutation and a set of dispersion parameters $θ_{1:n-1}\in(0,1]$. The GMM shares many properties, such as having sufficient statistics, with exponential models, thus it can be seen as an exponential family with a discrete parameter $σ$. This paper shows that computing entropy, crossentropy and Kullback-Leibler divergence in the the class of GMM is tractable, paving the way for a better understanding of this exponential family.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
How well behaved is finite dimensional Diffusion Maps?
Authors:
Wenyu Bo,
Marina Meilă
Abstract:
Under a set of assumptions on a family of submanifolds $\subset {\mathbb R}^D$, we derive a series of geometric properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM), including almost uniform density, finite polynomial approximation and reach. Leveraging these properties, we establish rigorous bounds on the embedding errors introduced by the DM algorithm is…
▽ More
Under a set of assumptions on a family of submanifolds $\subset {\mathbb R}^D$, we derive a series of geometric properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM), including almost uniform density, finite polynomial approximation and reach. Leveraging these properties, we establish rigorous bounds on the embedding errors introduced by the DM algorithm is $O\left((\frac{\log n}{n})^{\frac{1}{8d+16}}\right)$. Furthermore, we quantify the error between the estimated tangent spaces and the true tangent spaces over the submanifolds after the DM embedding,
$\sup_{P\in \mathcal{P}}\mathbb{E}_{P^{\otimes \tilde{n}}} \max_{1\leq j \angle (T_{Y_{\varphi(M),j}}\varphi(M),\hat{T}_j)\leq \tilde{n}} \leq C \left(\frac{\log n }{n}\right)^\frac{k-1}{(8d+16)k}$,
which providing a precise characterization of the geometric accuracy of the embeddings. These results offer a solid theoretical foundation for understanding the performance and reliability of DM in practical applications.
△ Less
Submitted 21 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Isometry pursuit
Authors:
Samson Koelle,
Marina Meila
Abstract:
Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For probl…
▽ More
Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
Manifold learning: what, how, and why
Authors:
Marina Meilă,
Hanyu Zhang
Abstract:
Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret the…
▽ More
Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Dictionary-based Manifold Learning
Authors:
Hanyu Zhang,
Samson Koelle,
Marina Meila
Abstract:
We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manif…
▽ More
We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manifold learning algorithms. We also discuss conditions for the existence of such parameterizations in function space and for successful recovery from finite samples. We demonstrate our method with experimental results from a real scientific domain.
△ Less
Submitted 4 February, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
The Parametric Stability of Well-separated Spherical Gaussian Mixtures
Authors:
Hanyu Zhang,
Marina Meila
Abstract:
We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bo…
▽ More
We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
Double Diffusion Maps and their Latent Harmonics for Scientific Computations in Latent Space
Authors:
Nikolaos Evangelou,
Felix Dietrich,
Eliodoro Chiavazzo,
Daniel Lehmberg,
Marina Meila,
Ioannis G. Kevrekidis
Abstract:
We introduce a data-driven approach to building reduced dynamical models through manifold learning; the reduced latent space is discovered using Diffusion Maps (a manifold learning technique) on time series data. A second round of Diffusion Maps on those latent coordinates allows the approximation of the reduced dynamical models. This second round enables mapping the latent space coordinates back…
▽ More
We introduce a data-driven approach to building reduced dynamical models through manifold learning; the reduced latent space is discovered using Diffusion Maps (a manifold learning technique) on time series data. A second round of Diffusion Maps on those latent coordinates allows the approximation of the reduced dynamical models. This second round enables mapping the latent space coordinates back to the full ambient space (what is called lifting); it also enables the approximation of full state functions of interest in terms of the reduced coordinates. In our work, we develop and test three different reduced numerical simulation methodologies, either through pre-tabulation in the latent space and integration on the fly or by going back and forth between the ambient space and the latent space. The data-driven latent space simulation results, based on the three different approaches, are validated through (a) the latent space observation of the full simulation through the Nyström Extension formula, or through (b) lifting the reduced trajectory back to the full ambient space, via Latent Harmonics. Latent space modeling often involves additional regularization to favor certain properties of the space over others, and the mapping back to the ambient space is then constructed mostly independently from these properties; here, we use the same data-driven approach to construct the latent space and then map back to the ambient space.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online
Authors:
Charvi Rastogi,
Ivan Stelmakh,
Xinwei Shen,
Marina Meila,
Federico Echenique,
Shuchi Chawla,
Nihar B. Shah
Abstract:
Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conduct…
▽ More
Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conducted surveys of reviewers in two top-tier double-blind computer science conferences -- ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, outside the review process, we find that preprints from better-ranked affiliations see a weakly higher visibility, with a correlation of 0.06 in ICML and 0.05 in EC. In particular, papers associated with the top-10-ranked affiliations had a visibility of approximately 11% in ICML and 22% in EC, whereas the remaining papers had a visibility of 7% and 18% respectively.
△ Less
Submitted 11 June, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Distribution free optimality intervals for clustering
Authors:
Marina Meilă,
Hanyu Zhang
Abstract:
We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-me…
▽ More
We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
△ Less
Submitted 1 February, 2023; v1 submitted 30 July, 2021;
originally announced July 2021.
-
The decomposition of the higher-order homology embedding constructed from the $k$-Laplacian
Authors:
Yu-Chia Chen,
Marina Meilă
Abstract:
The null space of the $k$-th order Laplacian $\mathbf{\mathcal L}_k$, known as the {\em $k$-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian $\mathbf{\mathcal L}_0$ has spurred…
▽ More
The null space of the $k$-th order Laplacian $\mathbf{\mathcal L}_k$, known as the {\em $k$-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian $\mathbf{\mathcal L}_0$ has spurred new research and applications, such as spectral clustering algorithms with theoretical guarantees and estimators of the Stochastic Block Model. In this work, we investigate the geometry of the $k$-th homology embedding and focus on cases reminiscent of spectral clustering. Namely, we analyze the {\em connected sum} of manifolds as a perturbation to the direct sum of their homology embeddings. We propose an algorithm to factorize the homology embedding into subspaces corresponding to a manifold's simplest topological components. The proposed framework is applied to the {\em shortest homologous loop detection} problem, a problem known to be NP-hard in general. Our spectral loop detection algorithm scales better than existing methods and is effective on diverse data such as point clouds and images.
△ Less
Submitted 2 August, 2021; v1 submitted 22 July, 2021;
originally announced July 2021.
-
A class of network models recoverable by spectral clustering
Authors:
Yali Wan,
Marina Meila
Abstract:
Finding communities in networks is a problem that remains difficult, in spite of the amount of attention it has recently received. The Stochastic Block-Model (SBM) is a generative model for graphs with "communities" for which, because of its simplicity, the theoretical understanding has advanced fast in recent years. In particular, there have been various results showing that simple versions of sp…
▽ More
Finding communities in networks is a problem that remains difficult, in spite of the amount of attention it has recently received. The Stochastic Block-Model (SBM) is a generative model for graphs with "communities" for which, because of its simplicity, the theoretical understanding has advanced fast in recent years. In particular, there have been various results showing that simple versions of spectral clustering using the Normalized Laplacian of the graph can recover the communities almost perfectly with high probability. Here we show that essentially the same algorithm used for the SBM and for its extension called Degree-Corrected SBM, works on a wider class of Block-Models, which we call Preference Frame Models, with essentially the same guarantees. Moreover, the parametrization we introduce clearly exhibits the free parameters needed to specify this class of models, and results in bounds that expose with more clarity the parameters that control the recovery error in this model class.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
Helmholtzian Eigenmap: Topological feature discovery & edge flow learning from point cloud data
Authors:
Yu-Chia Chen,
Weicheng Wu,
Marina Meilă,
Ioannis G. Kevrekidis
Abstract:
The manifold Helmholtzian (1-Laplacian) operator $Δ_1$ elegantly generalizes the Laplace-Beltrami operator to vector fields on a manifold $\mathcal M$. In this work, we propose the estimation of the manifold Helmholtzian from point cloud data by a weighted 1-Laplacian $\mathcal L_1$. While higher order Laplacians have been introduced and studied, this work is the first to present a graph Helmholtz…
▽ More
The manifold Helmholtzian (1-Laplacian) operator $Δ_1$ elegantly generalizes the Laplace-Beltrami operator to vector fields on a manifold $\mathcal M$. In this work, we propose the estimation of the manifold Helmholtzian from point cloud data by a weighted 1-Laplacian $\mathcal L_1$. While higher order Laplacians have been introduced and studied, this work is the first to present a graph Helmholtzian constructed from a simplicial complex as a consistent estimator for the continuous operator in a non-parametric setting. Equipped with the geometric and topological information about $\mathcal M$, the Helmholtzian is a useful tool for the analysis of flows and vector fields on $\mathcal M$ via the Helmholtz-Hodge theorem. In addition, the $\mathcal L_1$ allows the smoothing, prediction, and feature extraction of the flows. We demonstrate these possibilities on substantial sets of synthetic and real point cloud datasets with non-trivial topological structures; and provide theoretical results on the limit of $\mathcal L_1$ to $Δ_1$.
△ Less
Submitted 31 October, 2023; v1 submitted 13 March, 2021;
originally announced March 2021.
-
Guarantees for Hierarchical Clustering by the Sublevel Set method
Authors:
Marina Meila
Abstract:
Meila (2018) introduces an optimization based method called the Sublevel Set method, to guarantee that a clustering is nearly optimal and "approximately correct" without relying on any assumptions about the distribution that generated the data. This paper extends the Sublevel Set method to the cost-based hierarchical clustering paradigm proposed by Dasgupta (2016).
Meila (2018) introduces an optimization based method called the Sublevel Set method, to guarantee that a clustering is nearly optimal and "approximately correct" without relying on any assumptions about the distribution that generated the data. This paper extends the Sublevel Set method to the cost-based hierarchical clustering paradigm proposed by Dasgupta (2016).
△ Less
Submitted 5 July, 2020; v1 submitted 18 June, 2020;
originally announced June 2020.
-
Selecting the independent coordinates of manifolds with large aspect ratios
Authors:
Yu-Chia Chen,
Marina Meilă
Abstract:
Many manifold embedding algorithms fail apparently when the data manifold has a large aspect ratio (such as a long, thin strip). Here, we formulate success and failure in terms of finding a smooth embedding, showing also that the problem is pervasive and more complex than previously recognized. Mathematically, success is possible under very broad conditions, provided that embedding is done by care…
▽ More
Many manifold embedding algorithms fail apparently when the data manifold has a large aspect ratio (such as a long, thin strip). Here, we formulate success and failure in terms of finding a smooth embedding, showing also that the problem is pervasive and more complex than previously recognized. Mathematically, success is possible under very broad conditions, provided that embedding is done by carefully selected eigenfunctions of the Laplace-Beltrami operator $Δ$. Hence, we propose a bicriterial Independent Eigencoordinate Selection (IES) algorithm that selects smooth embeddings with few eigenvectors. The algorithm is grounded in theory, has low computational overhead, and is successful on synthetic and large real data.
△ Less
Submitted 2 August, 2021; v1 submitted 2 July, 2019;
originally announced July 2019.
-
Measuring the Robustness of Graph Properties
Authors:
Yali Wan,
Marina Meila
Abstract:
In this paper, we propose a perturbation framework to measure the robustness of graph properties. Although there are already perturbation methods proposed to tackle this problem, they are limited by the fact that the strength of the perturbation cannot be well controlled. We firstly provide a perturbation framework on graphs by introducing weights on the nodes, of which the magnitude of perturbati…
▽ More
In this paper, we propose a perturbation framework to measure the robustness of graph properties. Although there are already perturbation methods proposed to tackle this problem, they are limited by the fact that the strength of the perturbation cannot be well controlled. We firstly provide a perturbation framework on graphs by introducing weights on the nodes, of which the magnitude of perturbation can be easily controlled through the variance of the weights. Meanwhile, the topology of the graphs are also preserved to avoid uncontrollable strength in the perturbation. We then extend the measure of robustness in the robust statistics literature to the graph properties.
△ Less
Submitted 3 December, 2018;
originally announced January 2019.
-
Manifold Coordinates with Physical Meaning
Authors:
Samson Koelle,
Hanyu Zhang,
Marina Meila,
Yu-Chia Chen
Abstract:
Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper st…
▽ More
Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper studies the problem of recovering the meaning of the new low-dimensional representation in an automatic, principled fashion. We propose a method to explain embedding coordinates of a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data.
△ Less
Submitted 29 July, 2021; v1 submitted 28 November, 2018;
originally announced November 2018.
-
How to sample connected $K$-partitions of a graph
Authors:
Marina Meila
Abstract:
A connected undirected graph $G=(V,E)$ is given. This paper presents an algorithm that samples (non-uniformly) a $K$ partition $U_1,\ldots U_K$ of the graph nodes $V$, such that the subgraph induced by each $U_k$, with $k=1:K$, is connected. Moreover, the probability induced by the algorithm over the set ${\mathcal C}_K$ of all such partitions is obtained in closed form.
A connected undirected graph $G=(V,E)$ is given. This paper presents an algorithm that samples (non-uniformly) a $K$ partition $U_1,\ldots U_K$ of the graph nodes $V$, such that the subgraph induced by each $U_k$, with $k=1:K$, is connected. Moreover, the probability induced by the algorithm over the set ${\mathcal C}_K$ of all such partitions is obtained in closed form.
△ Less
Submitted 20 July, 2018;
originally announced August 2018.
-
megaman: Manifold Learning with Millions of points
Authors:
James McQueen,
Marina Meila,
Jacob VanderPlas,
Zhongyue Zhang
Abstract:
Manifold Learning is a class of algorithms seeking a low-dimensional non-linear representation of high-dimensional data. Thus manifold learning algorithms are, at least in theory, most applicable to high-dimensional data and sample sizes to enable accurate estimation of the manifold. Despite this, most existing manifold learning implementations are not particularly scalable. Here we present a Pyth…
▽ More
Manifold Learning is a class of algorithms seeking a low-dimensional non-linear representation of high-dimensional data. Thus manifold learning algorithms are, at least in theory, most applicable to high-dimensional data and sample sizes to enable accurate estimation of the manifold. Despite this, most existing manifold learning implementations are not particularly scalable. Here we present a Python package that implements a variety of manifold learning algorithms in a modular and scalable fashion, using fast approximate neighbors searches and fast sparse eigendecompositions. The package incorporates theoretical advances in manifold learning, such as the unbiased Laplacian estimator and the estimation of the embedding distortion by the Riemannian metric method. In benchmarks, even on a single-core desktop computer, our code embeds millions of data points in minutes, and takes just 200 minutes to embed the main sample of galaxy spectra from the Sloan Digital Sky Survey --- consisting of 0.6 million samples in 3750-dimensions --- a task which has not previously been possible.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Graph Sensitive Indices for Comparing Clusterings
Authors:
Zaeem Hussain,
Marina Meila
Abstract:
This report discusses two new indices for comparing clusterings of a set of points. The motivation for looking at new ways for comparing clusterings stems from the fact that the existing clustering indices are based on set cardinality alone and do not consider the positions of data points. The new indices, namely, the Random Walk index (RWI) and Variation of Information with Neighbors (VIN), are b…
▽ More
This report discusses two new indices for comparing clusterings of a set of points. The motivation for looking at new ways for comparing clusterings stems from the fact that the existing clustering indices are based on set cardinality alone and do not consider the positions of data points. The new indices, namely, the Random Walk index (RWI) and Variation of Information with Neighbors (VIN), are both inspired by the clustering metric Variation of Information (VI). VI possesses some interesting theoretical properties which are also desirable in a metric for comparing clusterings. We define our indices and discuss some of their explored properties which appear relevant for a clustering index. We also include the results of these indices on clusterings of some example data sets.
△ Less
Submitted 27 November, 2014;
originally announced November 2014.
-
Improved graph Laplacian via geometric self-consistency
Authors:
Dominique Perrault-Joncas,
Marina Meila
Abstract:
We address the problem of setting the kernel bandwidth used by Manifold Learning algorithms to construct the graph Laplacian. Exploiting the connection between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami operator, we set the bandwidth by optimizing the Laplacian's ability to preserve the geometry of the data. Experiments show that this principled approach is e…
▽ More
We address the problem of setting the kernel bandwidth used by Manifold Learning algorithms to construct the graph Laplacian. Exploiting the connection between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami operator, we set the bandwidth by optimizing the Laplacian's ability to preserve the geometry of the data. Experiments show that this principled approach is effective and robust.
△ Less
Submitted 31 May, 2014;
originally announced June 2014.
-
Estimating Vector Fields on Manifolds and the Embedding of Directed Graphs
Authors:
Dominique Perrault-Joncas,
Marina Meila
Abstract:
This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model a directed graph as a finite set of observations from a diffusion on a manifold endowed with a vector field. This is the first generative model of its kind for directed graphs. We introduce a graph embedding algorithm that estimates all three features of this model: th…
▽ More
This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model a directed graph as a finite set of observations from a diffusion on a manifold endowed with a vector field. This is the first generative model of its kind for directed graphs. We introduce a graph embedding algorithm that estimates all three features of this model: the low-dimensional embedding of the manifold, the data density and the vector field. In the process, we also obtain new theoretical results on the limits of "Laplacian type" matrices derived from directed graphs. The application of our method to both artificially constructed and real data highlights its strengths.
△ Less
Submitted 30 May, 2014;
originally announced June 2014.
-
An Experimental Comparison of Several Clustering and Initialization Methods
Authors:
Marina Meila,
David Heckerman
Abstract:
We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root no…
▽ More
We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality.
△ Less
Submitted 16 May, 2015; v1 submitted 30 January, 2013;
originally announced January 2013.
-
Tractable Bayesian Learning of Tree Belief Networks
Authors:
Marina Meila,
Tommi S. Jaakkola
Abstract:
In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. This follows from two main results: First, we show that factored distributions over spanning tre…
▽ More
In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. This follows from two main results: First, we show that factored distributions over spanning trees in a graph can be integrated in closed form. Second, we examine priors over tree parameters and show that a set of assumptions similar to (Heckerman and al. 1995) constrain the tree parameter priors to be a compactly parameterized product of Dirichlet distributions. Beside allowing for exact Bayesian learning, these results permit us to formulate a new class of tractable latent variable models in which the likelihood of a data point is computed through an ensemble average over tree structures.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
Unsupervised spectral learning
Authors:
Susan Shortreed,
Marina Meila
Abstract:
In spectral clustering and spectral image segmentation, the data is partioned starting from a given matrix of pairwise similarities S. the matrix S is constructed by hand, or learned on a separate training set. In this paper we show how to achieve spectral clustering in unsupervised mode. Our algorithm starts with a set of observed pairwise features, which are possible components of an unknown, pa…
▽ More
In spectral clustering and spectral image segmentation, the data is partioned starting from a given matrix of pairwise similarities S. the matrix S is constructed by hand, or learned on a separate training set. In this paper we show how to achieve spectral clustering in unsupervised mode. Our algorithm starts with a set of observed pairwise features, which are possible components of an unknown, parametric similarity function. This function is learned iteratively, at the same time as the clustering of the data. The algorithm shows promosing results on synthetic and real data.
△ Less
Submitted 4 July, 2012;
originally announced July 2012.
-
Consensus ranking under the exponential model
Authors:
Marina Meila,
Kapil Phadnis,
Arthur Patterson,
Jeff A. Bilmes
Abstract:
We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated…
▽ More
We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated around its mode; (2) We show that the generalized Mallows model is jointly exponential in (pi0; theta), and introduce the conjugate prior for this model class; (3) The sufficient statistics are the pairwise marginal probabilities that item i is preferred to item j. Preliminary experiments confirm the theoretical predictions and compare the new algorithm and existing heuristics.
△ Less
Submitted 20 June, 2012;
originally announced June 2012.
-
Estimation and Clustering with Infinite Rankings
Authors:
Marina Meila,
Le Bao
Abstract:
This paper presents a natural extension of stagewise ranking to the the case of infinitely many items. We introduce the infinite generalized Mallows model (IGM), describe its properties and give procedures to estimate it from data. For estimation of multimodal distributions we introduce the Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties…
▽ More
This paper presents a natural extension of stagewise ranking to the the case of infinitely many items. We introduce the infinite generalized Mallows model (IGM), describe its properties and give procedures to estimate it from data. For estimation of multimodal distributions we introduce the Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of the new model and demonstrate that infinite models can be simple, elegant and practical.
△ Less
Submitted 13 June, 2012;
originally announced June 2012.
-
Dirichlet Process Mixtures of Generalized Mallows Models
Authors:
Marina Meila,
Harr Chen
Abstract:
We present a Dirichlet process mixture model over discrete incomplete rankings and study two Gibbs sampling inference techniques for estimating posterior clusterings. The first approach uses a slice sampling subcomponent for estimating cluster parameters. The second approach marginalizes out several cluster parameters by taking advantage of approximations to the conditional posteriors. We empirica…
▽ More
We present a Dirichlet process mixture model over discrete incomplete rankings and study two Gibbs sampling inference techniques for estimating posterior clusterings. The first approach uses a slice sampling subcomponent for estimating cluster parameters. The second approach marginalizes out several cluster parameters by taking advantage of approximations to the conditional posteriors. We empirically demonstrate (1) the effectiveness of this approximation for improving convergence, (2) the benefits of the Dirichlet process model over alternative clustering techniques for ranked data, and (3) the applicability of the approach to exploring large realworld ranking datasets.
△ Less
Submitted 15 March, 2012;
originally announced March 2012.