-
Information-theoretic Bayesian Optimization: Survey and Tutorial
Authors:
Eduardo C. Garrido-Merchán
Abstract:
Several scenarios require the optimization of non-convex black-box functions, that are noisy expensive to evaluate functions with unknown analytical expression, whose gradients are hence not accessible. For example, the hyper-parameter tuning problem of machine learning models. Bayesian optimization is a class of methods with state-of-the-art performance delivering a solution to this problem in re…
▽ More
Several scenarios require the optimization of non-convex black-box functions, that are noisy expensive to evaluate functions with unknown analytical expression, whose gradients are hence not accessible. For example, the hyper-parameter tuning problem of machine learning models. Bayesian optimization is a class of methods with state-of-the-art performance delivering a solution to this problem in real scenarios. It uses an iterative process that employs a probabilistic surrogate model, typically a Gaussian process, of the objective function to be optimized computing a posterior predictive distribution of the black-box function. Based on the information given by this posterior predictive distribution, Bayesian optimization includes the computation of an acquisition function that represents, for every input space point, the utility of evaluating that point in the next iteraiton if the objective of the process is to retrieve a global extremum. This paper is a survey of the information theoretical acquisition functions, whose performance typically outperforms the rest of acquisition functions. The main concepts of the field of information theory are also described in detail to make the reader aware of why information theory acquisition functions deliver great results in Bayesian optimization and how can we approximate them when they are intractable. We also cover how information theory acquisition functions can be adapted to complex optimization scenarios such as the multi-objective, constrained, non-myopic, multi-fidelity, parallel and asynchronous settings and provide further lines of research.
△ Less
Submitted 22 January, 2025;
originally announced February 2025.
-
Alpha Entropy Search for New Information-based Bayesian Optimization
Authors:
Daniel Fernández-Sánchez,
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
Bayesian optimization (BO) methods based on information theory have obtained state-of-the-art results in several tasks. These techniques heavily rely on the Kullback-Leibler (KL) divergence to compute the acquisition function. In this work, we introduce a novel information-based class of acquisition functions for BO called Alpha Entropy Search (AES). AES is based on the α-divergence, that generali…
▽ More
Bayesian optimization (BO) methods based on information theory have obtained state-of-the-art results in several tasks. These techniques heavily rely on the Kullback-Leibler (KL) divergence to compute the acquisition function. In this work, we introduce a novel information-based class of acquisition functions for BO called Alpha Entropy Search (AES). AES is based on the α-divergence, that generalizes the KL divergence. Iteratively, AES selects the next evaluation point as the one whose associated target value has the highest level of the dependency with respect to the location and associated value of the global maximum of the optimization problem. Dependency is measured in terms of the α-divergence, as an alternative to the KL divergence. Intuitively, this favors the evaluation of the objective function at the most informative points about the global maximum. The α-divergence has a free parameter α, which determines the behavior of the divergence, trading-off evaluating differences between distributions at a single mode, and evaluating differences globally. Therefore, different values of α result in different acquisition functions. AES acquisition lacks a closed-form expression. However, we propose an efficient and accurate approximation using a truncated Gaussian distribution. In practice, the value of α can be chosen by the practitioner, but here we suggest to use a combination of acquisition functions obtained by simultaneously considering a range of values of α. We provide an implementation of AES in BOTorch and we evaluate its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network. These experiments show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Multi-Objective Hyperparameter Optimization in Machine Learning -- An Overview
Authors:
Florian Karl,
Tobias Pielok,
Julia Moosbauer,
Florian Pfisterer,
Stefan Coors,
Martin Binder,
Lennart Schneider,
Janek Thomas,
Jakob Richter,
Michel Lang,
Eduardo C. Garrido-Merchán,
Juergen Branke,
Bernd Bischl
Abstract:
Hyperparameter optimization constitutes a large part of typical modern machine learning workflows. This arises from the fact that machine learning methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metric…
▽ More
Hyperparameter optimization constitutes a large part of typical modern machine learning workflows. This arises from the fact that machine learning methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. In this work, we introduce the reader to the basics of multi-objective hyperparameter optimization and motivate its usefulness in applied ML. Furthermore, we provide an extensive survey of existing optimization strategies, both from the domain of evolutionary algorithms and Bayesian optimization. We illustrate the utility of MOO in several specific ML applications, considering objectives such as operating conditions, prediction time, sparseness, fairness, interpretability and robustness.
△ Less
Submitted 6 June, 2024; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Many Objective Bayesian Optimization
Authors:
Lucia Asencio Martín,
Eduardo C. Garrido-Merchán
Abstract:
Some real problems require the evaluation of expensive and noisy objective functions. Moreover, the analytical expression of these objective functions may be unknown. These functions are known as black-boxes, for example, estimating the generalization error of a machine learning algorithm and computing its prediction time in terms of its hyper-parameters. Multi-objective Bayesian optimization (MOB…
▽ More
Some real problems require the evaluation of expensive and noisy objective functions. Moreover, the analytical expression of these objective functions may be unknown. These functions are known as black-boxes, for example, estimating the generalization error of a machine learning algorithm and computing its prediction time in terms of its hyper-parameters. Multi-objective Bayesian optimization (MOBO) is a set of methods that has been successfully applied for the simultaneous optimization of black-boxes. Concretely, BO methods rely on a probabilistic model of the objective functions, typically a Gaussian process. This model generates a predictive distribution of the objectives. However, MOBO methods have problems when the number of objectives in a multi-objective optimization problem are 3 or more, which is the many objective setting. In particular, the BO process is more costly as more objectives are considered, computing the quality of the solution via the hyper-volume is also more costly and, most importantly, we have to evaluate every objective function, wasting expensive computational, economic or other resources. However, as more objectives are involved in the optimization problem, it is highly probable that some of them are redundant and not add information about the problem solution. A measure that represents how similar are GP predictive distributions is proposed. We also propose a many objective Bayesian optimization algorithm that uses this metric to determine whether two objectives are redundant. The algorithm stops evaluating one of them if the similarity is found, saving resources and not hurting the performance of the multi-objective BO algorithm. We show empirical evidence in a set of toy, synthetic, benchmark and real experiments that GPs predictive distributions of the effectiveness of the metric and the algorithm.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
A Similarity Measure of Gaussian Process Predictive Distributions
Authors:
Lucia Asencio-Martín,
Eduardo C. Garrido-Merchán
Abstract:
Some scenarios require the computation of a predictive distribution of a new value evaluated on an objective function conditioned on previous observations. We are interested on using a model that makes valid assumptions on the objective function whose values we are trying to predict. Some of these assumptions may be smoothness or stationarity. Gaussian process (GPs) are probabilistic models that c…
▽ More
Some scenarios require the computation of a predictive distribution of a new value evaluated on an objective function conditioned on previous observations. We are interested on using a model that makes valid assumptions on the objective function whose values we are trying to predict. Some of these assumptions may be smoothness or stationarity. Gaussian process (GPs) are probabilistic models that can be interpreted as flexible distributions over functions. They encode the assumptions through covariance functions, making hypotheses about new data through a predictive distribution by being fitted to old observations. We can face the case where several GPs are used to model different objective functions. GPs are non-parametric models whose complexity is cubic on the number of observations. A measure that represents how similar is one GP predictive distribution with respect to another would be useful to stop using one GP when they are modelling functions of the same input space. We are really inferring that two objective functions are correlated, so one GP is enough to model both of them by performing a transformation of the prediction of the other function in case of inverse correlation. We show empirical evidence in a set of synthetic and benchmark experiments that GPs predictive distributions can be compared and that one of them is enough to predict two correlated functions in the same input space. This similarity metric could be extremely useful used to discard objectives in Bayesian many-objective optimization.
△ Less
Submitted 20 January, 2021;
originally announced January 2021.
-
Improved Max-value Entropy Search for Multi-objective Bayesian Optimization with Constraints
Authors:
Daniel Fernández-Sánchez,
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
We present MESMOC+, an improved version of Max-value Entropy search for Multi-Objective Bayesian optimization with Constraints (MESMOC). MESMOC+ can be used to solve constrained multi-objective problems when the objectives and the constraints are expensive to evaluate. MESMOC+ works by minimizing the entropy of the solution of the optimization problem in function space, i.e., the Pareto frontier,…
▽ More
We present MESMOC+, an improved version of Max-value Entropy search for Multi-Objective Bayesian optimization with Constraints (MESMOC). MESMOC+ can be used to solve constrained multi-objective problems when the objectives and the constraints are expensive to evaluate. MESMOC+ works by minimizing the entropy of the solution of the optimization problem in function space, i.e., the Pareto frontier, to guide the search for the optimum. The cost of MESMOC+ is linear in the number of objectives and constraints. Furthermore, it is often significantly smaller than the cost of alternative methods based on minimizing the entropy of the Pareto set. The reason for this is that it is easier to approximate the required computations in MESMOC+. Moreover, MESMOC+'s acquisition function is expressed as the sum of one acquisition per each black-box (objective or constraint). Thus, it can be used in a decoupled evaluation setting in which one chooses not only the next input location to evaluate, but also which black-box to evaluate there. We compare MESMOC+ with related methods in synthetic and real optimization problems. These experiments show that the entropy estimation provided by MESMOC+ is more accurate than that of previous methods. This leads to better optimization results. MESMOC+ is also competitive with other information-based methods for constrained multi-objective Bayesian optimization, but it is significantly faster.
△ Less
Submitted 2 April, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Comparing BERT against traditional machine learning text classification
Authors:
Santiago González-Carvajal,
Eduardo C. Garrido-Merchán
Abstract:
The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of differ…
▽ More
The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of different approaches that have been used throughout the years with success. In this work, we first present BERT and include a little review on classical NLP approaches. Then, we empirically test with a suite of experiments dealing different scenarios the behaviour of BERT against the traditional TF-IDF vocabulary fed to machine learning algorithms. Our purpose of this work is to add empirical evidence to support or refuse the use of BERT as a default on NLP tasks. Experiments show the superiority of BERT and its independence of features of the NLP problem such as the language of the text adding empirical evidence to use BERT as a default technique to be used in NLP problems.
△ Less
Submitted 12 January, 2021; v1 submitted 26 May, 2020;
originally announced May 2020.
-
Parallel Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints
Authors:
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
Real-world problems often involve the optimization of several objectives under multiple constraints. An example is the hyper-parameter tuning problem of machine learning algorithms. In particular, the minimization of the estimation of the generalization error of a deep neural network and at the same time the minimization of its prediction time. We may also consider as a constraint that the deep ne…
▽ More
Real-world problems often involve the optimization of several objectives under multiple constraints. An example is the hyper-parameter tuning problem of machine learning algorithms. In particular, the minimization of the estimation of the generalization error of a deep neural network and at the same time the minimization of its prediction time. We may also consider as a constraint that the deep neural network must be implemented in a chip with an area below some size. Here, both the objectives and the constraint are black boxes, i.e., functions whose analytical expressions are unknown and are expensive to evaluate. Bayesian optimization (BO) methodologies have given state-of-the-art results for the optimization of black-boxes. Nevertheless, most BO methods are sequential and evaluate the objectives and the constraints at just one input location, iteratively. Sometimes, however, we may have resources to evaluate several configurations in parallel. Notwithstanding, no parallel BO method has been proposed to deal with the optimization of multiple objectives under several constraints. If the expensive evaluations can be carried out in parallel (as when a cluster of computers is available), sequential evaluations result in a waste of resources. This article introduces PPESMOC, Parallel Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints, an information-based batch method for the simultaneous optimization of multiple expensive-to-evaluate black-box functions under the presence of several constraints. Iteratively, PPESMOC selects a batch of input locations at which to evaluate the black-boxes so as to maximally reduce the entropy of the Pareto set of the optimization problem. We present empirical evidence in the form of synthetic, benchmark and real-world experiments that illustrate the effectiveness of PPESMOC.
△ Less
Submitted 1 July, 2021; v1 submitted 1 April, 2020;
originally announced April 2020.
-
Multi-class Gaussian Process Classification with Noisy Inputs
Authors:
Carlos Villacampa-Calvo,
Bryan Zaldivar,
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
It is a common practice in the machine learning community to assume that the observed data are noise-free in the input attributes. Nevertheless, scenarios with input noise are common in real problems, as measurements are never perfectly accurate. If this input noise is not taken into account, a supervised machine learning method is expected to perform sub-optimally. In this paper, we focus on mult…
▽ More
It is a common practice in the machine learning community to assume that the observed data are noise-free in the input attributes. Nevertheless, scenarios with input noise are common in real problems, as measurements are never perfectly accurate. If this input noise is not taken into account, a supervised machine learning method is expected to perform sub-optimally. In this paper, we focus on multi-class classification problems and use Gaussian processes (GPs) as the underlying classifier. Motivated by a data set coming from the astrophysics domain, we hypothesize that the observed data may contain noise in the inputs. Therefore, we devise several multi-class GP classifiers that can account for input noise. Such classifiers can be efficiently trained using variational inference to approximate the posterior distribution of the latent variables of the model. Moreover, in some situations, the amount of noise can be known before-hand. If this is the case, it can be readily introduced in the proposed methods. This prior information is expected to lead to better performance results. We have evaluated the proposed methods by carrying out several experiments, involving synthetic and real data. These include several data sets from the UCI repository, the MNIST data set and a data set coming from astrophysics. The results obtained show that, although the classification error is similar across methods, the predictive distribution of the proposed methods is better, in terms of the test log-likelihood, than the predictive distribution of a classifier based on GPs that ignores input noise.
△ Less
Submitted 30 December, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Bayesian optimization of the PC algorithm for learning Gaussian Bayesian networks
Authors:
Irene Córdoba,
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato,
Concha Bielza,
Pedro Larrañaga
Abstract:
The PC algorithm is a popular method for learning the structure of Gaussian Bayesian networks. It carries out statistical tests to determine absent edges in the network. It is hence governed by two parameters: (i) The type of test, and (ii) its significance level. These parameters are usually set to values recommended by an expert. Nevertheless, such an approach can suffer from human bias, leading…
▽ More
The PC algorithm is a popular method for learning the structure of Gaussian Bayesian networks. It carries out statistical tests to determine absent edges in the network. It is hence governed by two parameters: (i) The type of test, and (ii) its significance level. These parameters are usually set to values recommended by an expert. Nevertheless, such an approach can suffer from human bias, leading to suboptimal reconstruction results. In this paper we consider a more principled approach for choosing these parameters in an automatic way. For this we optimize a reconstruction score evaluated on a set of different Gaussian Bayesian networks. This objective is expensive to evaluate and lacks a closed-form expression, which means that Bayesian optimization (BO) is a natural choice. BO methods use a model to guide the search and are hence able to exploit smoothness properties of the objective surface. We show that the parameters found by a BO method outperform those found by a random search strategy and the expert recommendation. Importantly, we have found that an often overlooked statistical test provides the best over-all reconstruction results.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes
Authors:
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
Bayesian Optimization (BO) methods are useful for optimizing functions that are expen- sive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. The acquisition function guides the optimization process and m…
▽ More
Bayesian Optimization (BO) methods are useful for optimizing functions that are expen- sive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. The acquisition function guides the optimization process and measures the expected utility of performing an evaluation of the objective at a new point. GPs assume continous input variables. When this is not the case, for example when some of the input variables take categorical or integer values, one has to introduce extra approximations. Consider a suggested input location taking values in the real line. Before doing the evaluation of the objective, a common approach is to use a one hot encoding approximation for categorical variables, or to round to the closest integer, in the case of integer-valued variables. We show that this can lead to problems in the optimization process and describe a more principled approach to account for input variables that are categorical or integer-valued. We illustrate in both synthetic and a real experiments the utility of our approach, which significantly improves the results of standard BO methods using Gaussian processes on problems with categorical or integer-valued variables.
△ Less
Submitted 22 May, 2018; v1 submitted 9 May, 2018;
originally announced May 2018.
-
Dealing with Integer-valued Variables in Bayesian Optimization with Gaussian Processes
Authors:
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
Bayesian optimization (BO) methods are useful for optimizing functions that are expensive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. This function guides the optimization process and measures the e…
▽ More
Bayesian optimization (BO) methods are useful for optimizing functions that are expensive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. This function guides the optimization process and measures the expected utility of performing an evaluation of the objective at a new point. GPs assume continous input variables. When this is not the case, such as when some of the input variables take integer values, one has to introduce extra approximations. A common approach is to round the suggested variable value to the closest integer before doing the evaluation of the objective. We show that this can lead to problems in the optimization process and describe a more principled approach to account for input variables that are integer-valued. We illustrate in both synthetic and a real experiments the utility of our approach, which significantly improves the results of standard BO methods on problems involving integer-valued variables.
△ Less
Submitted 13 June, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints
Authors:
Eduardo C. Garrido-Merchán,
Daniel Hernández-Lobato
Abstract:
This work presents PESMOC, Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints, an information-based strategy for the simultaneous optimization of multiple expensive-to-evaluate black-box functions under the presence of several constraints. PESMOC can hence be used to solve a wide range of optimization problems. Iteratively, PESMOC chooses an input location on whic…
▽ More
This work presents PESMOC, Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints, an information-based strategy for the simultaneous optimization of multiple expensive-to-evaluate black-box functions under the presence of several constraints. PESMOC can hence be used to solve a wide range of optimization problems. Iteratively, PESMOC chooses an input location on which to evaluate the objective functions and the constraints so as to maximally reduce the entropy of the Pareto set of the corresponding optimization problem. The constraints considered in PESMOC are assumed to have similar properties to those of the objective functions in typical Bayesian optimization problems. That is, they do not have a known expression (which prevents gradient computation), their evaluation is considered to be very expensive, and the resulting observations may be corrupted by noise. These constraints arise in a plethora of expensive black-box optimization problems. We carry out synthetic experiments to illustrate the effectiveness of PESMOC, where we sample both the objectives and the constraints from a Gaussian process prior. The results obtained show that PESMOC is able to provide better recommendations with a smaller number of evaluations than a strategy based on random search.
△ Less
Submitted 26 September, 2016; v1 submitted 5 September, 2016;
originally announced September 2016.