-
Efficient Online Random Sampling via Randomness Recycling
Authors:
Thomas L. Draper,
Feras A. Saad
Abstract:
``Randomness recycling'' is a powerful algorithmic technique for reusing a fraction of the random information consumed by a randomized algorithm to reduce its entropy requirements. This article presents a family of efficient randomness recycling algorithms for sampling a sequence $X_1, X_2, X_3, \dots$ of discrete random variables whose joint distribution follows an arbitrary stochastic process. W…
▽ More
``Randomness recycling'' is a powerful algorithmic technique for reusing a fraction of the random information consumed by a randomized algorithm to reduce its entropy requirements. This article presents a family of efficient randomness recycling algorithms for sampling a sequence $X_1, X_2, X_3, \dots$ of discrete random variables whose joint distribution follows an arbitrary stochastic process. We develop randomness recycling strategies to reduce the entropy cost of a variety of prominent sampling algorithms, which include uniform sampling, inverse transform sampling, lookup table sampling, alias sampling, and discrete distribution generating (DDG) tree sampling. Our method achieves an expected amortized entropy cost of $H(X_1,\dots,X_k)/k + \varepsilon$ input random bits per output sample using $O(\log(1/\varepsilon))$ space, which is arbitrarily close to the optimal Shannon entropy rate. The combination of space, time, and entropy properties of our method improve upon the Han and Hoshi interval algorithm and Knuth and Yao entropy-optimal algorithm for sampling a discrete random sequence. An empirical evaluation of the algorithm shows that it achieves state-of-the-art runtime performance on the Fisher-Yates shuffle when using a cryptographically secure pseudorandom number generator. Accompanying the manuscript is a performant random sampling library in the C programming language.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Efficient Rejection Sampling in the Entropy-Optimal Range
Authors:
Thomas L. Draper,
Feras A. Saad
Abstract:
The problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent unbiased coin flips is considered. The Knuth and Yao complexity theory of nonuniform random number generation furnishes a family of "entropy-optimal" sampling algorithms that consume between $H(P)$ and $H(P)+2$ coin flips per generated output, where $H$ is the S…
▽ More
The problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent unbiased coin flips is considered. The Knuth and Yao complexity theory of nonuniform random number generation furnishes a family of "entropy-optimal" sampling algorithms that consume between $H(P)$ and $H(P)+2$ coin flips per generated output, where $H$ is the Shannon entropy function. However, the space complexity of entropy-optimal samplers scales exponentially with the number of bits required to encode $P$. This article introduces a family of efficient rejection samplers and characterizes their entropy, space, and time complexity. Within this family is a distinguished sampling algorithm that requires linearithmic space and preprocessing time, and whose expected entropy cost always falls in the entropy-optimal range $[H(P), H(P)+2)$. No previous sampler for discrete probability distributions is known to achieve these characteristics. Numerical experiments demonstrate performance improvements in runtime and entropy of the proposed algorithm compared to the celebrated alias method.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
Sequential Monte Carlo Learning for Time Series Structure Discovery
Authors:
Feras A. Saad,
Brian J. Patton,
Matthew D. Hoffman,
Rif A. Saurous,
Vikash K. Mansinghka
Abstract:
This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both…
▽ More
This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both in "online" settings, where new data is incorporated sequentially in time, and in "offline" settings, by using nested subsets of historical data to anneal the posterior. Empirical measurements on real-world time series show that our method can deliver 10x--100x runtime speedups over previous MCMC and greedy-search structure learning algorithms targeting the same model family. We use our method to perform the first large-scale evaluation of Gaussian process time series structure learning on a prominent benchmark of 1,428 econometric datasets. The results show that our method discovers sensible models that deliver more accurate point forecasts and interval forecasts over multiple horizons as compared to widely used statistical and neural baselines that struggle on this challenging data.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Estimators of Entropy and Information via Inference in Probabilistic Models
Authors:
Feras A. Saad,
Marco Cusumano-Towner,
Vikash K. Mansinghka
Abstract:
Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use imp…
▽ More
Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.
△ Less
Submitted 12 December, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Hierarchical Infinite Relational Model
Authors:
Feras A. Saad,
Vikash K. Mansinghka
Abstract:
This paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixt…
▽ More
This paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixture is then used to partition the domain entities and model the probability distribution of relation values. The HIRM generalizes the standard infinite relational model and can be used for a variety of data analysis tasks including dependence detection, clustering, and density estimation. We present new algorithms for fully Bayesian posterior inference via Gibbs sampling. We illustrate the efficacy of the method on a density estimation benchmark of twenty object-attribute datasets with up to 18 million cells and use it to discover relational structure in real-world datasets from politics and genomics.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
SPPL: Probabilistic Programming with Fast Exact Symbolic Inference
Authors:
Feras A. Saad,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distribut…
▽ More
We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints. We formalize SPPL via a novel translation strategy from probabilistic programs to sum-product expressions and give sound exact algorithms for conditioning on and computing probabilities of events. SPPL imposes a collection of restrictions on probabilistic programs to ensure they can be translated into sum-product expressions, which allow the system to leverage new techniques for improving the scalability of translation and inference by automatically exploiting probabilistic structure. We implement a prototype of SPPL with a modular architecture and evaluate it on benchmarks the system targets, showing that it obtains up to 3500x speedups over state-of-the-art symbolic systems on tasks such as verifying the fairness of decision tree classifiers, smoothing hidden Markov models, conditioning transformed random variables, and computing rare event probabilities.
△ Less
Submitted 11 June, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
The Fast Loaded Dice Roller: A Near-Optimal Exact Sampler for Discrete Probability Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. We prove that this algorithm, which we call the Fast Loaded Dice Roller (FLDR), is highly efficient in both space and time: (i) the size of the sampler is guaranteed to be linear in the number of bits…
▽ More
This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. We prove that this algorithm, which we call the Fast Loaded Dice Roller (FLDR), is highly efficient in both space and time: (i) the size of the sampler is guaranteed to be linear in the number of bits needed to encode the input distribution; and (ii) the expected number of bits of entropy it consumes per sample is at most 6 bits more than the information-theoretically optimal rate. We present fast implementations of the linear-time preprocessing and near-optimal sampling algorithms using unsigned integer arithmetic. Empirical evaluations on a broad set of probability distributions establish that FLDR is 2x-10x faster in both preprocessing and sampling than multiple baseline algorithms, including the widely-used alias and interval samplers. It also uses up to 10000x less space than the information-theoretically optimal sampler, at the expense of less than 1.5x runtime overhead.
△ Less
Submitted 1 June, 2020; v1 submitted 8 March, 2020;
originally announced March 2020.
-
Optimal Approximate Sampling from Discrete Probability Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
This paper addresses a fundamental problem in random variate generation: given access to a random source that emits a stream of independent fair bits, what is the most accurate and entropy-efficient algorithm for sampling from a discrete probability distribution $(p_1, \dots, p_n)$, where the probabilities of the output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ of the sampling algorithm must be…
▽ More
This paper addresses a fundamental problem in random variate generation: given access to a random source that emits a stream of independent fair bits, what is the most accurate and entropy-efficient algorithm for sampling from a discrete probability distribution $(p_1, \dots, p_n)$, where the probabilities of the output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ of the sampling algorithm must be specified using at most $k$ bits of precision? We present a theoretical framework for formulating this problem and provide new techniques for finding sampling algorithms that are optimal both statistically (in the sense of sampling accuracy) and information-theoretically (in the sense of entropy consumption). We leverage these results to build a system that, for a broad family of measures of statistical accuracy, delivers a sampling algorithm whose expected entropy usage is minimal among those that induce the same distribution (i.e., is "entropy-optimal") and whose output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ is a closest approximation to the target distribution $(p_1, \dots, p_n)$ among all entropy-optimal sampling algorithms that operate within the specified $k$-bit precision. This optimal approximate sampler is also a closer approximation than any (possibly entropy-suboptimal) sampler that consumes a bounded amount of entropy with the specified precision, a class which includes floating-point implementations of inversion sampling and related methods found in many software libraries. We evaluate the accuracy, entropy consumption, precision requirements, and wall-clock runtime of our optimal approximate sampling algorithms on a broad set of distributions, demonstrating the ways that they are superior to existing approximate samplers and establishing that they often consume significantly fewer resources than are needed by exact samplers.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling
Authors:
Feras A. Saad,
Marco F. Cusumano-Towner,
Ulrich Schaechtle,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
We present new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. These techniques work with probabilistic domain-specific data modeling languages that capture key properties of a broad class of data generating processes, using Bayesian inference to synthesize probabilistic programs in these modeling languages given observed data. We…
▽ More
We present new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. These techniques work with probabilistic domain-specific data modeling languages that capture key properties of a broad class of data generating processes, using Bayesian inference to synthesize probabilistic programs in these modeling languages given observed data. We provide a precise formulation of Bayesian synthesis for automatic data modeling that identifies sufficient conditions for the resulting synthesis procedure to be sound. We also derive a general class of synthesis algorithms for domain-specific languages specified by probabilistic context-free grammars and establish the soundness of our approach for these languages. We apply the techniques to automatically synthesize probabilistic programs for time series data and multivariate tabular data. We show how to analyze the structure of the synthesized programs to compute, for key qualitative properties of interest, the probability that the underlying data generating process exhibits each of these properties. Second, we translate probabilistic programs in the domain-specific language into probabilistic programs in Venture, a general-purpose probabilistic programming system. The translated Venture programs are then executed to obtain predictions of new time series data and new multivariate data records. Experimental results show that our techniques can accurately infer qualitative structure in multiple real-world data sets and outperform standard data analysis methods in forecasting and predicting new data.
△ Less
Submitted 14 July, 2019;
originally announced July 2019.
-
A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Nathanael L. Ackerman,
Vikash K. Mansinghka
Abstract:
The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedu…
▽ More
The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series
Authors:
Feras A. Saad,
Vikash K. Mansinghka
Abstract:
This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant…
▽ More
This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant process prior. Within a cluster, all time series are modeled jointly using a novel "temporally-reweighted" extension of the Chinese restaurant process mixture. Markov chain Monte Carlo techniques are used to obtain samples from the posterior distribution, which are then used to form predictive inferences. We apply the technique to challenging forecasting and imputation tasks using seasonal flu data from the US Center for Disease Control and Prevention, demonstrating superior forecasting accuracy and competitive imputation accuracy as compared to multiple widely used baselines. We further show that the model discovers interpretable clusters in datasets with hundreds of time series, using macroeconomic data from the Gapminder Foundation.
△ Less
Submitted 1 April, 2018; v1 submitted 18 October, 2017;
originally announced October 2017.