Skip to main content

Showing 1–28 of 28 results for author: Meila, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.11249  [pdf, ps, other

    q-bio.QM cs.CV cs.LG q-bio.BM stat.ML

    Cryo-em images are intrinsically low dimensional

    Authors: Luke Evans, Octavian-Vlad Murad, Lars Dingeldein, Pilar Cossio, Roberto Covino, Marina Meila

    Abstract: Simulation-based inference provides a powerful framework for cryo-electron microscopy, employing neural networks in methods like CryoSBI to infer biomolecular conformations via learned latent representations. This latent space represents a rich opportunity, encoding valuable information about the physical system and the inference process. Harnessing this potential hinges on understanding the under… ▽ More

    Submitted 3 September, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  2. arXiv:2503.17521  [pdf, ps, other

    math.ST cs.DS stat.CO

    The Entropy and Crossentropy of Generalized Mallows Models

    Authors: Marina Meilă

    Abstract: The Generalized Mallows Model (GMM) is a well known family of models for ranking data. A GMM is a distribution over $\mathbb{S}_n$, the set of permutations of n objects, characterized by a location parameter $σ\in \mathbb{S}_n$, known as central permutation and a set of dispersion parameters $θ_{1:n-1}\in(0,1]$. The GMM shares many properties, such as having sufficient statistics, with exponential… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 15 pages

  3. arXiv:2412.03992  [pdf, ps, other

    stat.ML cs.LG math.ST

    How well behaved is finite dimensional Diffusion Maps?

    Authors: Wenyu Bo, Marina Meilă

    Abstract: Under a set of assumptions on a family of submanifolds $\subset {\mathbb R}^D$, we derive a series of geometric properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM), including almost uniform density, finite polynomial approximation and reach. Leveraging these properties, we establish rigorous bounds on the embedding errors introduced by the DM algorithm is… ▽ More

    Submitted 21 March, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: 33 pages, 4 figures

  4. arXiv:2411.18502  [pdf, other

    stat.ML cs.AI cs.IR cs.LG stat.ME

    Isometry pursuit

    Authors: Samson Koelle, Marina Meila

    Abstract: Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For probl… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.

  5. arXiv:2311.03757  [pdf, other

    stat.ML cs.LG

    Manifold learning: what, how, and why

    Authors: Marina Meilă, Hanyu Zhang

    Abstract: Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret the… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

  6. arXiv:2302.00263  [pdf, other

    cs.LG

    Dictionary-based Manifold Learning

    Authors: Hanyu Zhang, Samson Koelle, Marina Meila

    Abstract: We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manif… ▽ More

    Submitted 4 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

  7. arXiv:2302.00242  [pdf, other

    stat.ML cs.LG

    The Parametric Stability of Well-separated Spherical Gaussian Mixtures

    Authors: Hanyu Zhang, Marina Meila

    Abstract: We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bo… ▽ More

    Submitted 31 January, 2023; originally announced February 2023.

  8. arXiv:2204.12536  [pdf, other

    stat.ML cs.LG math.DS

    Double Diffusion Maps and their Latent Harmonics for Scientific Computations in Latent Space

    Authors: Nikolaos Evangelou, Felix Dietrich, Eliodoro Chiavazzo, Daniel Lehmberg, Marina Meila, Ioannis G. Kevrekidis

    Abstract: We introduce a data-driven approach to building reduced dynamical models through manifold learning; the reduced latent space is discovered using Diffusion Maps (a manifold learning technique) on time series data. A second round of Diffusion Maps on those latent coordinates allows the approximation of the reduced dynamical models. This second round enables mapping the latent space coordinates back… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 25 pages,21 figures, 4 tables

  9. arXiv:2203.17259  [pdf, other

    cs.DL stat.AP

    To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online

    Authors: Charvi Rastogi, Ivan Stelmakh, Xinwei Shen, Marina Meila, Federico Echenique, Shuchi Chawla, Nihar B. Shah

    Abstract: Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conduct… ▽ More

    Submitted 11 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: 17 pages, 3 figures

  10. arXiv:2107.14442  [pdf, other

    stat.ML cs.LG

    Distribution free optimality intervals for clustering

    Authors: Marina Meilă, Hanyu Zhang

    Abstract: We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-me… ▽ More

    Submitted 1 February, 2023; v1 submitted 30 July, 2021; originally announced July 2021.

  11. arXiv:2107.10970  [pdf, other

    stat.ML cs.LG

    The decomposition of the higher-order homology embedding constructed from the $k$-Laplacian

    Authors: Yu-Chia Chen, Marina Meilă

    Abstract: The null space of the $k$-th order Laplacian $\mathbf{\mathcal L}_k$, known as the {\em $k$-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian $\mathbf{\mathcal L}_0$ has spurred… ▽ More

    Submitted 2 August, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

  12. arXiv:2104.10347  [pdf, ps, other

    stat.ML cs.LG

    A class of network models recoverable by spectral clustering

    Authors: Yali Wan, Marina Meila

    Abstract: Finding communities in networks is a problem that remains difficult, in spite of the amount of attention it has recently received. The Stochastic Block-Model (SBM) is a generative model for graphs with "communities" for which, because of its simplicity, the theoretical understanding has advanced fast in recent years. In particular, there have been various results showing that simple versions of sp… ▽ More

    Submitted 21 April, 2021; originally announced April 2021.

    Comments: 15 pages

  13. arXiv:2103.07626  [pdf, other

    stat.ML cs.LG

    Helmholtzian Eigenmap: Topological feature discovery & edge flow learning from point cloud data

    Authors: Yu-Chia Chen, Weicheng Wu, Marina Meilă, Ioannis G. Kevrekidis

    Abstract: The manifold Helmholtzian (1-Laplacian) operator $Δ_1$ elegantly generalizes the Laplace-Beltrami operator to vector fields on a manifold $\mathcal M$. In this work, we propose the estimation of the manifold Helmholtzian from point cloud data by a weighted 1-Laplacian $\mathcal L_1$. While higher order Laplacians have been introduced and studied, this work is the first to present a graph Helmholtz… ▽ More

    Submitted 31 October, 2023; v1 submitted 13 March, 2021; originally announced March 2021.

  14. arXiv:2006.10274  [pdf, ps, other

    stat.ML cs.LG

    Guarantees for Hierarchical Clustering by the Sublevel Set method

    Authors: Marina Meila

    Abstract: Meila (2018) introduces an optimization based method called the Sublevel Set method, to guarantee that a clustering is nearly optimal and "approximately correct" without relying on any assumptions about the distribution that generated the data. This paper extends the Sublevel Set method to the cost-based hierarchical clustering paradigm proposed by Dasgupta (2016).

    Submitted 5 July, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: 7 pages

  15. arXiv:1907.01651  [pdf, other

    stat.ML cs.LG

    Selecting the independent coordinates of manifolds with large aspect ratios

    Authors: Yu-Chia Chen, Marina Meilă

    Abstract: Many manifold embedding algorithms fail apparently when the data manifold has a large aspect ratio (such as a long, thin strip). Here, we formulate success and failure in terms of finding a smooth embedding, showing also that the problem is pervasive and more complex than previously recognized. Mathematically, success is possible under very broad conditions, provided that embedding is done by care… ▽ More

    Submitted 2 August, 2021; v1 submitted 2 July, 2019; originally announced July 2019.

  16. arXiv:1901.09661  [pdf, other

    cs.SI stat.ME stat.ML

    Measuring the Robustness of Graph Properties

    Authors: Yali Wan, Marina Meila

    Abstract: In this paper, we propose a perturbation framework to measure the robustness of graph properties. Although there are already perturbation methods proposed to tackle this problem, they are limited by the fact that the strength of the perturbation cannot be well controlled. We firstly provide a perturbation framework on graphs by introducing weights on the nodes, of which the magnitude of perturbati… ▽ More

    Submitted 3 December, 2018; originally announced January 2019.

  17. arXiv:1811.11891  [pdf, other

    stat.ML cs.LG

    Manifold Coordinates with Physical Meaning

    Authors: Samson Koelle, Hanyu Zhang, Marina Meila, Yu-Chia Chen

    Abstract: Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper st… ▽ More

    Submitted 29 July, 2021; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: Submitted to JMLR. Improved over v2 (added appendix). Improved over v1 (revisions)

  18. arXiv:1808.00050  [pdf, ps, other

    cs.DM math.CO

    How to sample connected $K$-partitions of a graph

    Authors: Marina Meila

    Abstract: A connected undirected graph $G=(V,E)$ is given. This paper presents an algorithm that samples (non-uniformly) a $K$ partition $U_1,\ldots U_K$ of the graph nodes $V$, such that the subgraph induced by each $U_k$, with $k=1:K$, is connected. Moreover, the probability induced by the algorithm over the set ${\mathcal C}_K$ of all such partitions is obtained in closed form.

    Submitted 20 July, 2018; originally announced August 2018.

    Comments: 3 pages

  19. arXiv:1603.02763  [pdf, other

    cs.LG cs.CG stat.ML

    megaman: Manifold Learning with Millions of points

    Authors: James McQueen, Marina Meila, Jacob VanderPlas, Zhongyue Zhang

    Abstract: Manifold Learning is a class of algorithms seeking a low-dimensional non-linear representation of high-dimensional data. Thus manifold learning algorithms are, at least in theory, most applicable to high-dimensional data and sample sizes to enable accurate estimation of the manifold. Despite this, most existing manifold learning implementations are not particularly scalable. Here we present a Pyth… ▽ More

    Submitted 8 March, 2016; originally announced March 2016.

    Comments: 12 pages, 6 figures

  20. arXiv:1411.7582  [pdf, other

    cs.LG

    Graph Sensitive Indices for Comparing Clusterings

    Authors: Zaeem Hussain, Marina Meila

    Abstract: This report discusses two new indices for comparing clusterings of a set of points. The motivation for looking at new ways for comparing clusterings stems from the fact that the existing clustering indices are based on set cardinality alone and do not consider the positions of data points. The new indices, namely, the Random Walk index (RWI) and Variation of Information with Neighbors (VIN), are b… ▽ More

    Submitted 27 November, 2014; originally announced November 2014.

  21. arXiv:1406.0118  [pdf, other

    stat.ML cs.LG

    Improved graph Laplacian via geometric self-consistency

    Authors: Dominique Perrault-Joncas, Marina Meila

    Abstract: We address the problem of setting the kernel bandwidth used by Manifold Learning algorithms to construct the graph Laplacian. Exploiting the connection between manifold geometry, represented by the Riemannian metric, and the Laplace-Beltrami operator, we set the bandwidth by optimizing the Laplacian's ability to preserve the geometry of the data. Experiments show that this principled approach is e… ▽ More

    Submitted 31 May, 2014; originally announced June 2014.

    Comments: 12 pages

  22. arXiv:1406.0013  [pdf, other

    stat.ML cs.LG

    Estimating Vector Fields on Manifolds and the Embedding of Directed Graphs

    Authors: Dominique Perrault-Joncas, Marina Meila

    Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model a directed graph as a finite set of observations from a diffusion on a manifold endowed with a vector field. This is the first generative model of its kind for directed graphs. We introduce a graph embedding algorithm that estimates all three features of this model: th… ▽ More

    Submitted 30 May, 2014; originally announced June 2014.

    Comments: 16 pages

  23. arXiv:1301.7401  [pdf

    cs.LG stat.ML

    An Experimental Comparison of Several Clustering and Initialization Methods

    Authors: Marina Meila, David Heckerman

    Abstract: We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root no… ▽ More

    Submitted 16 May, 2015; v1 submitted 30 January, 2013; originally announced January 2013.

    Comments: Appears in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI1998)

    Report number: UAI-P-1998-PG-386-395

  24. arXiv:1301.3875  [pdf

    cs.LG cs.AI stat.ML

    Tractable Bayesian Learning of Tree Belief Networks

    Authors: Marina Meila, Tommi S. Jaakkola

    Abstract: In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. This follows from two main results: First, we show that factored distributions over spanning tre… ▽ More

    Submitted 16 January, 2013; originally announced January 2013.

    Comments: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000)

    Report number: UAI-P-2000-PG-380-388

  25. arXiv:1207.1358  [pdf

    cs.LG stat.ML

    Unsupervised spectral learning

    Authors: Susan Shortreed, Marina Meila

    Abstract: In spectral clustering and spectral image segmentation, the data is partioned starting from a given matrix of pairwise similarities S. the matrix S is constructed by hand, or learned on a separate training set. In this paper we show how to achieve spectral clustering in unsupervised mode. Our algorithm starts with a set of observed pairwise features, which are possible components of an unknown, pa… ▽ More

    Submitted 4 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

    Report number: UAI-P-2005-PG-534-541

  26. arXiv:1206.5265  [pdf

    cs.LG cs.AI stat.ML

    Consensus ranking under the exponential model

    Authors: Marina Meila, Kapil Phadnis, Arthur Patterson, Jeff A. Bilmes

    Abstract: We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated… ▽ More

    Submitted 20 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007)

    Report number: UAI-P-2007-PG-285-294

  27. arXiv:1206.3270  [pdf

    cs.LG stat.ML

    Estimation and Clustering with Infinite Rankings

    Authors: Marina Meila, Le Bao

    Abstract: This paper presents a natural extension of stagewise ranking to the the case of infinitely many items. We introduce the infinite generalized Mallows model (IGM), describe its properties and give procedures to estimate it from data. For estimation of multimodal distributions we introduce the Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties… ▽ More

    Submitted 13 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

    Report number: UAI-P-2008-PG-393-402

  28. arXiv:1203.3496  [pdf

    cs.LG stat.ML

    Dirichlet Process Mixtures of Generalized Mallows Models

    Authors: Marina Meila, Harr Chen

    Abstract: We present a Dirichlet process mixture model over discrete incomplete rankings and study two Gibbs sampling inference techniques for estimating posterior clusterings. The first approach uses a slice sampling subcomponent for estimating cluster parameters. The second approach marginalizes out several cluster parameters by taking advantage of approximations to the conditional posteriors. We empirica… ▽ More

    Submitted 15 March, 2012; originally announced March 2012.

    Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

    Report number: UAI-P-2010-PG-358-367