-
Distillation of Discrete Diffusion through Dimensional Correlations
Authors:
Satoshi Hayakawa,
Yuhta Takida,
Masaaki Imaizumi,
Hiromi Wakaki,
Yuki Mitsufuji
Abstract:
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in la…
▽ More
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially require {\it many sampling steps}. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available at https://github.com/sony/di4c .
△ Less
Submitted 8 May, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
A Quadrature Approach for General-Purpose Batch Bayesian Optimization via Probabilistic Lifting
Authors:
Masaki Adachi,
Satoshi Hayakawa,
Martin Jørgensen,
Saad Hamid,
Harald Oberhauser,
Michael A. Osborne
Abstract:
Parallelisation in Bayesian optimisation is a common strategy but faces several challenges: the need for flexibility in acquisition functions and kernel choices, flexibility dealing with discrete and continuous variables simultaneously, model misspecification, and lastly fast massive parallelisation. To address these challenges, we introduce a versatile and modular framework for batch Bayesian opt…
▽ More
Parallelisation in Bayesian optimisation is a common strategy but faces several challenges: the need for flexibility in acquisition functions and kernel choices, flexibility dealing with discrete and continuous variables simultaneously, model misspecification, and lastly fast massive parallelisation. To address these challenges, we introduce a versatile and modular framework for batch Bayesian optimisation via probabilistic lifting with kernel quadrature, called SOBER, which we present as a Python library based on GPyTorch/BoTorch. Our framework offers the following unique benefits: (1) Versatility in downstream tasks under a unified approach. (2) A gradient-free sampler, which does not require the gradient of acquisition functions, offering domain-agnostic sampling (e.g., discrete and mixed variables, non-Euclidean space). (3) Flexibility in domain prior distribution. (4) Adaptive batch size (autonomous determination of the optimal batch size). (5) Robustness against a misspecified reproducing kernel Hilbert space. (6) Natural stopping criterion.
△ Less
Submitted 19 April, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Adaptive Batch Sizes for Active Learning A Probabilistic Numerics Approach
Authors:
Masaki Adachi,
Satoshi Hayakawa,
Martin Jørgensen,
Xingchen Wan,
Vu Nguyen,
Harald Oberhauser,
Michael A. Osborne
Abstract:
Active learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed -- larger batches are more costly, smaller batches lead to slower wall-clock run-times -- and the trade-off may change over the run (larger batches are often preferable earlier). To address…
▽ More
Active learning parallelization is widely used, but typically relies on fixing the batch size throughout experimentation. This fixed approach is inefficient because of a dynamic trade-off between cost and speed -- larger batches are more costly, smaller batches lead to slower wall-clock run-times -- and the trade-off may change over the run (larger batches are often preferable earlier). To address this trade-off, we propose a novel Probabilistic Numerics framework that adaptively changes batch sizes. By framing batch selection as a quadrature task, our integration-error-aware algorithm facilitates the automatic tuning of batch sizes to meet predefined quadrature precision objectives, akin to how typical optimizers terminate based on convergence thresholds. This approach obviates the necessity for exhaustive searches across all potential batch sizes. We also extend this to scenarios with constrained active learning and constrained optimization, interpreting constraint violations as reductions in the precision requirement, to subsequently adapt batch construction. Through extensive experiments, we demonstrate that our approach significantly enhances learning efficiency and flexibility in diverse Bayesian batch active learning and Bayesian optimization applications.
△ Less
Submitted 21 February, 2024; v1 submitted 9 June, 2023;
originally announced June 2023.
-
SOBER: Highly Parallel Bayesian Optimization and Bayesian Quadrature over Discrete and Mixed Spaces
Authors:
Masaki Adachi,
Satoshi Hayakawa,
Saad Hamid,
Martin Jørgensen,
Harald Oberhauser,
Micheal A. Osborne
Abstract:
Batch Bayesian optimisation and Bayesian quadrature have been shown to be sample-efficient methods of performing optimisation and quadrature where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOB…
▽ More
Batch Bayesian optimisation and Bayesian quadrature have been shown to be sample-efficient methods of performing optimisation and quadrature where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch global optimisation and quadrature with arbitrary acquisition functions and kernels over discrete and mixed spaces. The key to our approach is to reformulate batch selection for global optimisation as a quadrature problem, which relaxes acquisition function maximisation (non-convex) to kernel recombination (convex). Bridging global optimisation and quadrature can efficiently solve both tasks by balancing the merits of exploitative Bayesian optimisation and explorative Bayesian quadrature. We show that SOBER outperforms 11 competitive baselines on 12 synthetic and diverse real-world tasks.
△ Less
Submitted 5 July, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Sampling-based Nyström Approximation and Kernel Quadrature
Authors:
Satoshi Hayakawa,
Harald Oberhauser,
Terry Lyons
Abstract:
We analyze the Nyström approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nyström approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nyström…
▽ More
We analyze the Nyström approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nyström approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nyström approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.
△ Less
Submitted 22 May, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Hypercontractivity Meets Random Convex Hulls: Analysis of Randomized Multivariate Cubatures
Authors:
Satoshi Hayakawa,
Harald Oberhauser,
Terry Lyons
Abstract:
Given a probability measure $μ$ on a set $\mathcal{X}$ and a vector-valued function $\varphi$, a common problem is to construct a discrete probability measure on $\mathcal{X}$ such that the push-forward of these two probability measures under $\varphi$ is the same. This construction is at the heart of numerical integration methods that run under various names such as quadrature, cubature, or recom…
▽ More
Given a probability measure $μ$ on a set $\mathcal{X}$ and a vector-valued function $\varphi$, a common problem is to construct a discrete probability measure on $\mathcal{X}$ such that the push-forward of these two probability measures under $\varphi$ is the same. This construction is at the heart of numerical integration methods that run under various names such as quadrature, cubature, or recombination. A natural approach is to sample points from $μ$ until their convex hull of their image under $\varphi$ includes the mean of $\varphi$. Here we analyze the computational complexity of this approach when $\varphi$ exhibits a graded structure by using so-called hypercontractivity. The resulting theorem not only covers the classical cubature case of multivariate polynomials, but also integration on pathspace, as well as kernel quadrature for product measures.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination
Authors:
Masaki Adachi,
Satoshi Hayakawa,
Martin Jørgensen,
Harald Oberhauser,
Michael A. Osborne
Abstract:
Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, t…
▽ More
Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.
△ Less
Submitted 27 January, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Positively Weighted Kernel Quadrature via Subsampling
Authors:
Satoshi Hayakawa,
Harald Oberhauser,
Terry Lyons
Abstract:
We study kernel quadrature rules with convex weights. Our approach combines the spectral properties of the kernel with recombination results about point measures. This results in effective algorithms that construct convex quadrature rules using only access to i.i.d. samples from the underlying measure and evaluation of the kernel and that result in a small worst-case error. In addition to our theo…
▽ More
We study kernel quadrature rules with convex weights. Our approach combines the spectral properties of the kernel with recombination results about point measures. This results in effective algorithms that construct convex quadrature rules using only access to i.i.d. samples from the underlying measure and evaluation of the kernel and that result in a small worst-case error. In addition to our theoretical results and the benefits resulting from convex weights, our experiments indicate that this construction can compete with the optimal bounds in well-known examples.
△ Less
Submitted 11 October, 2022; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Estimating the probability that a given vector is in the convex hull of a random sample
Authors:
Satoshi Hayakawa,
Terry Lyons,
Harald Oberhauser
Abstract:
For a $d$-dimensional random vector $X$, let $p_{n, X}(θ)$ be the probability that the convex hull of $n$ independent copies of $X$ contains a given point $θ$. We provide several sharp inequalities regarding $p_{n, X}(θ)$ and $N_X(θ)$ denoting the smallest $n$ for which $p_{n, X}(θ)\ge1/2$. As a main result, we derive the totally general inequality $1/2 \le α_X(θ)N_X(θ)\le 3d + 1$, where $α_X(θ)$…
▽ More
For a $d$-dimensional random vector $X$, let $p_{n, X}(θ)$ be the probability that the convex hull of $n$ independent copies of $X$ contains a given point $θ$. We provide several sharp inequalities regarding $p_{n, X}(θ)$ and $N_X(θ)$ denoting the smallest $n$ for which $p_{n, X}(θ)\ge1/2$. As a main result, we derive the totally general inequality $1/2 \le α_X(θ)N_X(θ)\le 3d + 1$, where $α_X(θ)$ (a.k.a. the Tukey depth) is the minimum probability that $X$ is in a fixed closed halfspace containing the point $θ$. We also show several applications of our general results: one is a moment-based bound on $N_X(\mathbb{E}[X])$, which is an important quantity in randomized approaches to cubature construction or measure reduction problem. Another application is the determination of the canonical convex body included in a random convex polytope given by independent copies of $X$, where our combinatorial approach allows us to generalize existing results in random matrix community significantly.
△ Less
Submitted 22 March, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
Monte Carlo construction of cubature on Wiener space
Authors:
Satoshi Hayakawa,
Ken'ichiro Tanaka
Abstract:
In this paper, we investigate application of mathematical optimization to construction of a cubature formula on Wiener space, which is a weak approximation method of stochastic differential equations introduced by Lyons and Victoir (Cubature on Wiener Space, Proc. R. Soc. Lond. A 460, 169--198). After giving a brief review of the cubature theory on Wiener space, we show that a cubature formula of…
▽ More
In this paper, we investigate application of mathematical optimization to construction of a cubature formula on Wiener space, which is a weak approximation method of stochastic differential equations introduced by Lyons and Victoir (Cubature on Wiener Space, Proc. R. Soc. Lond. A 460, 169--198). After giving a brief review of the cubature theory on Wiener space, we show that a cubature formula of general dimension and degree can be obtained through a Monte Carlo sampling and linear programming. This paper also includes an extension of stochastic Tchakaloff's theorem, which technically yields the proof of our primary result.
△ Less
Submitted 19 October, 2021; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Monte Carlo Cubature Construction
Authors:
Satoshi Hayakawa
Abstract:
In numerical integration, cubature methods are effective, especially when the integrands can be well-approximated by known test functions, such as polynomials. However, the construction of cubature formulas has not generally been known, and existing examples only represent the particular domains of integrands, such as hypercubes and spheres. In this study, we show that cubature formulas can be con…
▽ More
In numerical integration, cubature methods are effective, especially when the integrands can be well-approximated by known test functions, such as polynomials. However, the construction of cubature formulas has not generally been known, and existing examples only represent the particular domains of integrands, such as hypercubes and spheres. In this study, we show that cubature formulas can be constructed for probability measures provided that we have an i.i.d. sampler from the measure and the mean values of given test functions. Moreover, the proposed method also works as a means of data compression, even if sufficient prior information of the measure is not available.
△ Less
Submitted 24 January, 2020; v1 submitted 3 January, 2020;
originally announced January 2020.
-
Convergence analysis of approximation formulas for analytic functions via duality for potential energy minimization
Authors:
Satoshi Hayakawa,
Ken'ichiro Tanaka
Abstract:
We investigate the approximation formulas that were proposed by Tanaka & Sugihara (2019), in weighted Hardy spaces, which are analytic function spaces with certain asymptotic decay. Under the criterion of minimum worst error of $n$-point approximation formulas, we demonstrate that the formulas are nearly optimal. We also obtain the upper bounds of the approximation errors that coincide with the ex…
▽ More
We investigate the approximation formulas that were proposed by Tanaka & Sugihara (2019), in weighted Hardy spaces, which are analytic function spaces with certain asymptotic decay. Under the criterion of minimum worst error of $n$-point approximation formulas, we demonstrate that the formulas are nearly optimal. We also obtain the upper bounds of the approximation errors that coincide with the existing heuristic bounds in asymptotic order by duality theorem for the minimization problem of potential energy.
△ Less
Submitted 21 October, 2022; v1 submitted 7 June, 2019;
originally announced June 2019.
-
On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces
Authors:
Satoshi Hayakawa,
Taiji Suzuki
Abstract:
Deep learning has been applied to various tasks in the field of machine learning and has shown superiority to other common procedures such as kernel methods. To provide a better theoretical understanding of the reasons for its success, we discuss the performance of deep learning and other methods on a nonparametric regression problem with a Gaussian noise. Whereas existing theoretical studies of d…
▽ More
Deep learning has been applied to various tasks in the field of machine learning and has shown superiority to other common procedures such as kernel methods. To provide a better theoretical understanding of the reasons for its success, we discuss the performance of deep learning and other methods on a nonparametric regression problem with a Gaussian noise. Whereas existing theoretical studies of deep learning have been based mainly on mathematical theories of well-known function classes such as Hölder and Besov classes, we focus on function classes with discontinuity and sparsity, which are those naturally assumed in practice. To highlight the effectiveness of deep learning, we compare deep learning with a class of linear estimators representative of a class of shallow estimators. It is shown that the minimax risk of a linear estimator on the convex hull of a target function class does not differ from that of the original target function class. This results in the suboptimality of linear methods over a simple but non-convex function class, on which deep learning can attain nearly the minimax-optimal rate. In addition to this extreme case, we consider function classes with sparse wavelet coefficients. On these function classes, deep learning also attains the minimax rate up to log factors of the sample size, and linear methods are still suboptimal if the assumed sparsity is strong. We also point out that the parameter sharing of deep neural networks can remarkably reduce the complexity of the model in our setting.
△ Less
Submitted 20 September, 2019; v1 submitted 22 May, 2019;
originally announced May 2019.