Search | arXiv e-print repository

Multilevel Sampling in Algebraic Statistics

Authors: Nathan Kirk, Ivan Gvozdanović, Sonja Petrović

Abstract: This paper proposes a multilevel sampling algorithm for fiber sampling problems in algebraic statistics, inspired by Henry Wynn's suggestion to adapt multilevel Monte Carlo (MLMC) ideas to discrete models. Focusing on log-linear models, we sample from high-dimensional lattice fibers defined by algebraic constraints. Building on Markov basis methods and results from Diaconis and Sturmfels, our algo… ▽ More This paper proposes a multilevel sampling algorithm for fiber sampling problems in algebraic statistics, inspired by Henry Wynn's suggestion to adapt multilevel Monte Carlo (MLMC) ideas to discrete models. Focusing on log-linear models, we sample from high-dimensional lattice fibers defined by algebraic constraints. Building on Markov basis methods and results from Diaconis and Sturmfels, our algorithm uses variable step sizes to accelerate exploration and reduce the need for long burn-in. We introduce a novel Fiber Coverage Score (FCS) based on Voronoi partitioning to assess sample quality, and highlight the utility of the Maximum Mean Discrepancy (MMD) quality metric. Simulations on benchmark fibers show that multilevel sampling outperforms naive MCMC approaches. Our results demonstrate that multilevel methods, when properly applied, provide practical benefits for discrete sampling in algebraic statistics. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 21 pages, 7 figures

MSC Class: 62R01 (Primary) 62-08; 52B20 (Secondary)

arXiv:2405.13950 [pdf, other]

Learning to sample fibers for goodness-of-fit testing

Authors: Ivan Gvozdanović, Sonja Petrović

Abstract: We consider the problem of constructing exact goodness-of-fit tests for discrete exponential family models. This classical problem remains practically unsolved for many types of structured or sparse data, as it rests on a computationally difficult core task: to produce a reliable sample from lattice points in a high-dimensional polytope. We translate the problem into a Markov decision process and… ▽ More We consider the problem of constructing exact goodness-of-fit tests for discrete exponential family models. This classical problem remains practically unsolved for many types of structured or sparse data, as it rests on a computationally difficult core task: to produce a reliable sample from lattice points in a high-dimensional polytope. We translate the problem into a Markov decision process and demonstrate a reinforcement learning approach for learning `good moves' for sampling. We illustrate the approach on data sets and models for which traditional MCMC samplers converge too slowly due to problem size, sparsity structure, and the requirement to use prohibitive non-linear algebra computations in the process. The differentiating factor is the use of scalable tools from \emph{linear} algebra in the context of theoretical guarantees provided by \emph{non-linear} algebra. Our algorithm is based on an actor-critic sampling scheme, with provable convergence. The discovered moves can be used to efficiently obtain an exchangeable sample, significantly cutting computational times with regards to statistical testing. △ Less

Submitted 15 April, 2025; v1 submitted 22 May, 2024; originally announced May 2024.

MSC Class: 62R01

arXiv:2307.02428 [pdf, other]

doi 10.2140/astat.2024.15.61

Sampling lattice points in a polytope: a Bayesian biased algorithm with random updates

Authors: Miles Bakenhus, Sonja Petrović

Abstract: The set of nonnegative integer lattice points in a polytope, also known as the fiber of a linear map, makes an appearance in several applications including optimization and statistics. We address the problem of sampling from this set using three ingredients: an easy-to-compute lattice basis of the constraint matrix, a biased sampling algorithm with a Bayesian framework, and a step-wise selection m… ▽ More The set of nonnegative integer lattice points in a polytope, also known as the fiber of a linear map, makes an appearance in several applications including optimization and statistics. We address the problem of sampling from this set using three ingredients: an easy-to-compute lattice basis of the constraint matrix, a biased sampling algorithm with a Bayesian framework, and a step-wise selection method. The bias embedded in our algorithm updates sampler parameters to improve fiber discovery rate at each step chosen from previously discovered elements. We showcase the performance of the algorithm on several examples, including fibers that are out of reach for the state-of-the-art Markov bases samplers. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: 22 pages, 12 figures

MSC Class: 62R01 (Primary) 62-08; 52B20 (Secondary)

Journal ref: Alg. Stat. 15 (2024) 61-83

arXiv:2306.06270 [pdf, other]

Markov bases: a 25 year update

Authors: Félix Almendra-Hernández, Jesús A. De Loera, Sonja Petrović

Abstract: In this paper, we evaluate the challenges and best practices associated with the Markov bases approach to sampling from conditional distributions. We provide insights and clarifications after 25 years of the publication of the fundamental theorem for Markov bases by Diaconis and Sturmfels. In addition to a literature review we prove three new results on the complexity of Markov bases in hierarchic… ▽ More In this paper, we evaluate the challenges and best practices associated with the Markov bases approach to sampling from conditional distributions. We provide insights and clarifications after 25 years of the publication of the fundamental theorem for Markov bases by Diaconis and Sturmfels. In addition to a literature review we prove three new results on the complexity of Markov bases in hierarchical models, relaxations of the fibers in log-linear models, and limitations of partial sets of moves in providing an irreducible Markov chain. △ Less

Submitted 9 January, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 24 pages, 3 figures

MSC Class: 62R01; 62-08; 62P10; 62H17

arXiv:2108.05555 [pdf, other]

doi 10.1111/sjos.12630

Longitudinal Network Models and Permutation-Uniform Markov Chains

Authors: William K. Schwartz, Sonja Petrović, Hemanshu Kaul

Abstract: Consider longitudinal networks whose edges turn on and off according to a discrete-time Markov chain with exponential-family transition probabilities. We characterize when their joint distributions are also exponential families with the same parameter, improving data reduction. Further we show that the permutation-uniform subclass of these chains permit interpretation as an independent, identicall… ▽ More Consider longitudinal networks whose edges turn on and off according to a discrete-time Markov chain with exponential-family transition probabilities. We characterize when their joint distributions are also exponential families with the same parameter, improving data reduction. Further we show that the permutation-uniform subclass of these chains permit interpretation as an independent, identically distributed sequence on the same state space. We then apply these ideas to temporal exponential random graph models, for which permutation uniformity is well suited, and discuss mean-parameter convergence, dyadic independence, and exchangeability. Our framework facilitates our introducing a new network model; simplifies analysis of some network and autoregressive models from the literature, including by permitting closed-form expressions for maximum likelihood estimates for some models; and facilitates applying standard tools to longitudinal-network Markov chains from either asymptotics or single-observation exponential random graph models. △ Less

Submitted 10 March, 2024; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: 22 pages plus references and appendices. This is the accepted version of the final published article

MSC Class: 60J10 (Primary); 05C80; 60B20; 62M05; 62M02; 62B05; 62F10; 60F99; 60G50; 62R01 (Secondary)

Journal ref: Scandinavian Journal of Statistics 50.3 (September 2023) 1201-1231

arXiv:2106.03676 [pdf, other]

doi 10.2140/involve.2023.16.227

Learning a performance metric of Buchberger's algorithm

Authors: Jelena Mojsilović, Dylan Peifer, Sonja Petrović

Abstract: What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gröbner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these add… ▽ More What can be (machine) learned about the complexity of Buchberger's algorithm? Given a system of polynomials, Buchberger's algorithm computes a Gröbner basis of the ideal these polynomials generate using an iterative procedure based on multivariate long division. The runtime of each step of the algorithm is typically dominated by a series of polynomial additions, and the total number of these additions is a hardware independent performance metric that is often used to evaluate and optimize various implementation choices. In this work we attempt to predict, using just the starting input, the number of polynomial additions that take place during one run of Buchberger's algorithm. Good predictions are useful for quickly estimating difficulty and understanding what features make Gröbner basis computation hard. Our features and methods could also be used for value models in the reinforcement learning approach to optimize Buchberger's algorithm introduced in [Peifer, Stillman, and Halpern-Leistner, 2020]. We show that a multiple linear regression model built from a set of easy-to-compute ideal generator statistics can predict the number of polynomial additions somewhat well, better than an uninformed model, and better than regression models built on some intuitive commutative algebra invariants that are more difficult to compute. We also train a simple recursive neural network that outperforms these linear models. Our work serves as a proof of concept, demonstrating that predicting the number of polynomial additions in Buchberger's algorithm is a feasible problem from the point of view of machine learning. △ Less

Submitted 31 May, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

Journal ref: Involve 16 (2023) 227-248

arXiv:2104.03167 [pdf, other]

Goodness of fit for log-linear ERGMs

Authors: Elizabeth Gross, Sonja Petrović, Despina Stasi

Abstract: Many popular models from the networks literature can be viewed through a common lens of contingency tables on network dyads, resulting in \emph{log-linear ERGMs}: exponential family models for random graphs whose sufficient statistics are linear on the dyads. We propose a new model in this family, the \emph{$p_1$-SBM}, which combines node and group effects common in network formation mechanisms. I… ▽ More Many popular models from the networks literature can be viewed through a common lens of contingency tables on network dyads, resulting in \emph{log-linear ERGMs}: exponential family models for random graphs whose sufficient statistics are linear on the dyads. We propose a new model in this family, the \emph{$p_1$-SBM}, which combines node and group effects common in network formation mechanisms. In particular, it is a generalization of several well-known ERGMs including the stochastic blockmodel for undirected graphs with known block assignment, the degree-corrected version of it, and the directed $p_1$ model without group structure. We frame the problem of testing model fit for the log-linear ERGM class through an exact conditional test whose $p$-value can be approximated efficiently in networks of both small and moderately large sizes. The sampling methods we build rely on a dynamic adaptation of Markov bases. We use quick estimation algorithms adapted from the contingency table literature and effective sampling methods rooted in graph theory and algebraic statistics. The performance and scalability of the method is demonstrated on two data sets from biology: the connectome of \emph{C. elegans} and the interactome of \emph{Arabidopsis thaliana}. These two networks -- a network and a protein-protein interaction network -- have been popular examples in the network science literature. Our work provides a model-based approach to studying them. △ Less

Submitted 3 March, 2024; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: Link to supplementary code provided

MSC Class: 62R01; 62P10; 62-08; 62H17

arXiv:1910.01692 [pdf, other]

Algebraic statistics, tables, and networks: The Fienberg advantage

Authors: Elizabeth Gross, Vishesh Karwa, Sonja Petrović

Abstract: Stephen Fienberg's affinity for contingency table problems and reinterpreting models with a fresh look gave rise to a new approach for hypothesis testing of network models that are linear exponential families. We outline his vision and influence in this fundamental problem, as well as generalizations to multigraphs and hypergraphs. Stephen Fienberg's affinity for contingency table problems and reinterpreting models with a fresh look gave rise to a new approach for hypothesis testing of network models that are linear exponential families. We outline his vision and influence in this fundamental problem, as well as generalizations to multigraphs and hypergraphs. △ Less

Submitted 3 October, 2019; originally announced October 2019.

arXiv:1907.07320 [pdf, other]

What is... a Markov basis?

Authors: Sonja Petrović

Abstract: This short piece defines a Markov basis. The aim is to introduce the statistical concept to mathematicians. This short piece defines a Markov basis. The aim is to introduce the statistical concept to mathematicians. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Comments: AMS Notices piece

arXiv:1612.06040 [pdf, other]

Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels

Authors: Vishesh Karwa, Debdeep Pati, Sonja Petrović, Liam Solus, Nikita Alexeev, Mateja Raič, Dane Wilburne, Robert Williams, Bowei Yan

Abstract: We construct Bayesian and frequentist finite-sample goodness-of-fit tests for three different variants of the stochastic blockmodel for network data. Since all of the stochastic blockmodel variants are log-linear in form when block assignments are known, the tests for the \emph{latent} block model versions combine a block membership estimator with the algebraic statistics machinery for testing goo… ▽ More We construct Bayesian and frequentist finite-sample goodness-of-fit tests for three different variants of the stochastic blockmodel for network data. Since all of the stochastic blockmodel variants are log-linear in form when block assignments are known, the tests for the \emph{latent} block model versions combine a block membership estimator with the algebraic statistics machinery for testing goodness-of-fit in log-linear models. We describe Markov bases and marginal polytopes of the variants of the stochastic blockmodel, and discuss how both facilitate the development of goodness-of-fit tests and understanding of model behavior. The general testing methodology developed here extends to any finite mixture of log-linear models on discrete data, and as such is the first application of the algebraic statistics machinery for latent-variable models. △ Less

Submitted 6 March, 2024; v1 submitted 18 December, 2016; originally announced December 2016.

Comments: substantial revision from v3, updated simulations and theoretical discussions

MSC Class: 62R01; 05C82

Journal ref: Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 86, Issue 1, February 2024, Pages 90-121

arXiv:1612.03054 [pdf, other]

DERGMs: Degeneracy-restricted exponential random graph models

Authors: Vishesh Karwa, Sonja Petrović, Denis Bajić

Abstract: Exponential random graph models, or ERGMs, are a flexible and general class of models for modeling dependent data. While the early literature has shown them to be powerful in capturing many network features of interest, recent work highlights difficulties related to the models' ill behavior, such as most of the probability mass being concentrated on a very small subset of the parameter space. This… ▽ More Exponential random graph models, or ERGMs, are a flexible and general class of models for modeling dependent data. While the early literature has shown them to be powerful in capturing many network features of interest, recent work highlights difficulties related to the models' ill behavior, such as most of the probability mass being concentrated on a very small subset of the parameter space. This behavior limits both the applicability of an ERGM as a model for real data and inference and parameter estimation via the usual Markov chain Monte Carlo algorithms. To address this problem, we propose a new exponential family of models for random graphs that build on the standard ERGM framework. Specifically, we solve the problem of computational intractability and `degenerate' model behavior by an interpretable support restriction. We introduce a new parameter based on the graph-theoretic notion of degeneracy, a measure of sparsity whose value is commonly low in real-worlds networks. The new model family is supported on the sample space of graphs with bounded degeneracy and is called degeneracy-restricted ERGMs, or DERGMs for short. Since DERGMs generalize ERGMs -- the latter is obtained from the former by setting the degeneracy parameter to be maximal -- they inherit good theoretical properties, while at the same time place their mass more uniformly over realistic graphs. The support restriction allows the use of new (and fast) Monte Carlo methods for inference, thus making the models scalable and computationally tractable. We study various theoretical properties of DERGMs and illustrate how the support restriction improves the model behavior. We also present a fast Monte Carlo algorithm for parameter estimation that avoids many issues faced by Markov Chain Monte Carlo algorithms used for inference in ERGMs. △ Less

Submitted 7 January, 2022; v1 submitted 9 December, 2016; originally announced December 2016.

Comments: Version 3

arXiv:1608.06667 [pdf, other]

Coauthorship and citation networks for statisticians: Comment

Authors: Vishesh Karwa, Sonja Petrović

Abstract: This is a comment on the paper arXiv:1410.2840 by Ji and Jin, to appear in the AOAS. This is a comment on the paper arXiv:1410.2840 by Ji and Jin, to appear in the AOAS. △ Less

Submitted 23 August, 2016; originally announced August 2016.

arXiv:1510.02838 [pdf, other]

A survey of discrete methods in (algebraic) statistics for networks

Authors: Sonja Petrović

Abstract: Sampling algorithms, hypergraph degree sequences, and polytopes play a crucial role in statistical analysis of network data. This article offers a brief overview of open problems in this area of discrete mathematics from the point of view of a particular family of statistical models for networks called exponential random graph models. The problems and underlying constructions are also related to w… ▽ More Sampling algorithms, hypergraph degree sequences, and polytopes play a crucial role in statistical analysis of network data. This article offers a brief overview of open problems in this area of discrete mathematics from the point of view of a particular family of statistical models for networks called exponential random graph models. The problems and underlying constructions are also related to well-known concepts in commutative algebra and graph-theoretic concepts in computer science. We outline a few lines of recent work that highlight the natural connection between these fields and unify them into some open problems. While these problems are often relevant in discrete mathematics in their own right, the emphasis here is on statistical relevance with the hope that these lines of research do not remain disjoint. Suggested specific open problems and general research questions should advance algebraic statistics theory as well as applied statistical tools for rigorous statistical analysis of networks. △ Less

Submitted 8 January, 2016; v1 submitted 9 October, 2015; originally announced October 2015.

Comments: Revised for clarity, minor updates, added example, upon suggestions of people mentioned in the acknowledgements section

arXiv:1410.7357 [pdf, other]

Statistical models for cores decomposition of an undirected random graph

Authors: Vishesh Karwa, Michael J. Pelsmajer, Sonja Petrović, Despina Stasi, Dane Wilburne

Abstract: The $k$-core decomposition is a widely studied summary statistic that describes a graph's global connectivity structure. In this paper, we move beyond using $k$-core decomposition as a tool to summarize a graph and propose using $k$-core decomposition as a tool to model random graphs. We propose using the shell distribution vector, a way of summarizing the decomposition, as a sufficient statistic… ▽ More The $k$-core decomposition is a widely studied summary statistic that describes a graph's global connectivity structure. In this paper, we move beyond using $k$-core decomposition as a tool to summarize a graph and propose using $k$-core decomposition as a tool to model random graphs. We propose using the shell distribution vector, a way of summarizing the decomposition, as a sufficient statistic for a family of exponential random graph models. We study the properties and behavior of the model family, implement a Markov chain Monte Carlo algorithm for simulating graphs from the model, implement a direct sampler from the set of graphs with a given shell distribution, and explore the sampling distributions of some of the commonly used complementary statistics as good candidates for heuristic model fitting. These algorithms provide first fundamental steps necessary for solving the following problems: parameter estimation in this ERGM, extending the model to its Bayesian relative, and developing a rigorous methodology for testing goodness of fit of the model and model selection. The methods are applied to a synthetic network as well as the well-known Sampson monks dataset. △ Less

Submitted 28 November, 2016; v1 submitted 27 October, 2014; originally announced October 2014.

Comments: Subsection 3.1 is new: `Sample space restriction and degeneracy of real-world networks'. Several clarifying comments have been added. Discussion now mentions 2 additional specific open problems. Bibliography updated. 25 pages (including appendix), ~10 figures

arXiv:1401.4896 [pdf, other]

Goodness-of-fit for log-linear network models: Dynamic Markov bases using hypergraphs

Authors: Elizabeth Gross, Sonja Petrović, Despina Stasi

Abstract: Social networks and other large sparse data sets pose significant challenges for statistical inference, as many standard statistical methods for testing model fit are not applicable in such settings. Algebraic statistics offers a theoretically justified approach to goodness-of-fit testing that relies on the theory of Markov bases and is intimately connected with the geometry of the model as descri… ▽ More Social networks and other large sparse data sets pose significant challenges for statistical inference, as many standard statistical methods for testing model fit are not applicable in such settings. Algebraic statistics offers a theoretically justified approach to goodness-of-fit testing that relies on the theory of Markov bases and is intimately connected with the geometry of the model as described by its fibers. Most current practices require the computation of the entire basis, which is infeasible in many practical settings. We present a dynamic approach to explore the fiber of a model, which bypasses this issue, and is based on the combinatorics of hypergraphs arising from the toric algebra structure of log-linear models. We demonstrate the approach on the Holland-Leinhardt $p_1$ model for random directed graphs that allows for reciprocated edges. △ Less

Submitted 20 January, 2014; originally announced January 2014.

arXiv:1208.6550 [pdf, ps, other]

Graphical models in Macaulay2

Authors: Luis David García-Puente, Sonja Petrović, Seth Sullivant

Abstract: The Macaulay2 package GraphicalModels contains algorithms for the algebraic study of graphical models associated to undirected, directed and mixed graphs, and associated collections of conditional independence statements. Among the algorithms implemented are procedures for computing the vanishing ideal of graphical models, for generating conditional independence ideals of families of independence… ▽ More The Macaulay2 package GraphicalModels contains algorithms for the algebraic study of graphical models associated to undirected, directed and mixed graphs, and associated collections of conditional independence statements. Among the algorithms implemented are procedures for computing the vanishing ideal of graphical models, for generating conditional independence ideals of families of independence statements associated to graphs, and for checking for identifiable parameters in Gaussian mixed graph models. These procedures can be used to study fundamental problems about graphical models. △ Less

Submitted 8 January, 2013; v1 submitted 31 August, 2012; originally announced August 2012.

Comments: Several changes to address referee comments and suggestions. We will eventually include this package in the standard distribution of Macaulay2. But until then, the associated Macaulay2 file can be found at http://www.shsu.edu/~ldg005/papers.html

MSC Class: 13P25 (Primary) 62-04; 14Q15; 68W30 (Secondary)

arXiv:1105.6145 [pdf, ps, other]

doi 10.1214/12-AOS1078

Maximum lilkelihood estimation in the $β$-model

Authors: Alessandro Rinaldo, Sonja Petrović, Stephen E. Fienberg

Abstract: We study maximum likelihood estimation for the statistical model for undirected random graphs, known as the $β$-model, in which the degree sequences are minimal sufficient statistics. We derive necessary and sufficient conditions, based on the polytope of degree sequences, for the existence of the maximum likelihood estimator (MLE) of the model parameters. We characterize in a combinatorial fashio… ▽ More We study maximum likelihood estimation for the statistical model for undirected random graphs, known as the $β$-model, in which the degree sequences are minimal sufficient statistics. We derive necessary and sufficient conditions, based on the polytope of degree sequences, for the existence of the maximum likelihood estimator (MLE) of the model parameters. We characterize in a combinatorial fashion sample points leading to a nonexistent MLE, and nonestimability of the probability parameters under a nonexistent MLE. We formulate conditions that guarantee that the MLE exists with probability tending to one as the number of nodes increases. △ Less

Submitted 18 June, 2013; v1 submitted 30 May, 2011; originally announced May 2011.

Comments: Published in at http://dx.doi.org/10.1214/12-AOS1078 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1078

Journal ref: Annals of Statistics 2013, Vol. 41, No. 3, 1085-1110

Showing 1–17 of 17 results for author: Petrović, S