Bayesian Optimization

In many real-world problems, from scientific research to industrial design, we face the challenge of finding the optimal set of inputs for a system where we have limited knowledge and high costs.

The goal is to solve the optimization problem:

$$x^{*} = \arg \max_{x \in X} f(x)$$

Where $x$ represents a vector of input parameters (e.g., hyperparameters of a machine learning model) and $f(x)$ is the objective function we want to maximize (e.g., model accuracy).

This task becomes exceptionally difficult when $f(x)$ operates as a "black box." This means we can provide an input $x$ and observe the output $y$, but we do not know the analytical form of the function itself.

Key Optimization Constraints

Traditional optimization methods often fail because of a combination of the following constraints:

The Bayesian Optimization Solution

Bayesian Optimization is a sequential, model-based strategy designed specifically for these "expensive black-box" problems. Instead of blindly guessing parameters, it intelligently selects new points to evaluate.

It works by building a probabilistic surrogate model that approximates $f(x)$. This surrogate model is cheap to evaluate and provides not only a prediction for $f(x)$ but also a measure of uncertainty about that prediction. The core idea is to use this uncertainty to guide the search, balancing exploitation (sampling where the model predicts a high value) and exploration (sampling where the model is most uncertain) to efficiently converge on the global optimum in as few evaluations as possible.

Gaussian Processes (GPs) for Surrogate Modeling

To intelligently select new points, we first need a way to represent our beliefs about the objective function $f(x)$. We use a Gaussian Process (GP) for this. A GP is a powerful tool for building a probabilistic model from data.

Part A: The Intuition First – Predicting Temperature in a Room

Imagine you want to find the hottest spot in a large room. You can only place a limited number of temperature sensors, and each measurement takes time. How do you guess the temperature at a spot where you don't have a sensor?

You'd use intuition:

A Gaussian Process is simply a mathematical tool that formalizes exactly this intuition. It provides two crucial pieces of information for any point in the room:

Part B: The Engine Room – How a Gaussian Process "Thinks"

A GP uses a component called a kernel function to define this notion of "closeness" or "similarity." The kernel is a rulebook that tells the GP how correlated two points are based on how far apart they are. The most popular kernel is the Squared Exponential Kernel (also called RBF).

Let's look at its hyperparameters in simple terms:

What Is a Gaussian Process?

A Gaussian Process is not a single function but rather a distribution over functions. A formal definition states:

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

This means that for any set of input points $x_1, x_2, \dots, x_n$, the corresponding function values $f(x_1), f(x_2), \dots, f(x_n)$ are modeled as being drawn from a multivariate Gaussian distribution.

A GP is completely specified by two components:

We write the GP prior as: $f(x) \sim GP(m(x), k(x, x'))$.

Deep Dive: The Kernel Function

The kernel function measures the similarity between points. A common choice for smooth functions is the Squared Exponential Kernel (also known as the Radial Basis Function or RBF kernel):

$$k(x_i, x_j) = \sigma_f^2 \exp\left(-\frac{1}{2l^2} ||x_i - x_j||^2\right)$$

Let's break down its components:

The kernel's power lies in its ability to encode assumptions. If we believe our function changes rapidly, we choose a smaller length scale. If we believe it has periodic behavior, we could choose a periodic kernel.

The Posterior Update: Learning from Data

The magic happens when we combine our GP prior with observed data $D = \{(x_i, y_i)\}_{i=1}^n$ to form a posterior distribution. A key property of multivariate Gaussians is that conditioning a joint Gaussian distribution on a subset of variables results in another Gaussian distribution.

We start by defining the joint distribution between the observed data points $y$ and the predictions $f_*$ at new test points $x_*$:

$$\begin{bmatrix} y \\ f_* \end{bmatrix} \sim \mathcal{N} \left( 0, \begin{bmatrix} K + \sigma_n^2 I & K_*^T \\ K_* & K_{**} \end{bmatrix} \right)$$

Where:

By applying the rules of Gaussian conditioning, we can derive the posterior distribution for $f_*$ given the observations $D$. This posterior distribution gives us a predictive mean and variance for any new point $x_*$:

This ability to quantify uncertainty is what makes GPs the ideal surrogate model for Bayesian Optimization.

Acquisition Functions for Smart Selection

We have now built our surrogate model (the Gaussian Process), which acts as a "map" of our expensive function. This map tells us two things for any point: a prediction (mean) and our confidence in that prediction (variance). Now, we need a strategy to decide where to sample next. This strategy is called an acquisition function. Its purpose is to look at the map and suggest the single best point to investigate, balancing risk versus reward.

Part 1: The Intuition

Let's explore the core challenge using an analogy: finding the best restaurant in a new city when you only have a few nights to experiment.

The Exploration vs. Exploitation Dilemma

You face a constant choice:

The perfect strategy balances these two ideas. You want to exploit a promising area, but you also want to explore uncertain areas in case they are even better. An acquisition function is simply a mathematical rule that calculates a "desirability score" for every possible restaurant, helping you make the best choice each night.

Part 2: Mathematical Deep Dive

For a technical audience, an acquisition function $a(x)$ is a heuristic function that we optimize at each iteration to select the next query point $x_{\text{next}} = \arg \max_{x} a(x)$. This optimization is cheap because $a(x)$ depends only on the surrogate model's posterior mean $\mu(x)$ and variance $\sigma^2(x)$, not the expensive black-box function $f(x)$. Let $f(x^+)$ be the value of the best sample observed so far (the "incumbent").

1. Probability of Improvement (PI)

Core Question: What is the probability that sampling at point $x$ will result in a value greater than our current best $f(x^+)$?

Mathematical Formulation: We want to calculate $P(f(x) > f(x^+))$. Since our surrogate model assumes $f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))$, we can standardize this. Let $Z = \frac{f(x) - \mu(x)}{\sigma(x)}$, where $Z \sim \mathcal{N}(0, 1)$.

$$PI(x) = P\left(Z > \frac{f(x^+) - \mu(x)}{\sigma(x)}\right) = \Phi\left(\frac{\mu(x) - f(x^+)}{\sigma(x)}\right)$$

To control the exploitation-exploration balance, a small trade-off parameter $\xi \ge 0$ is often introduced:

$$PI(x) = \Phi\left(\frac{\mu(x) - f(x^+) - \xi}{\sigma(x)}\right)$$

where $\Phi$ is the Cumulative Distribution Function (CDF) of the standard normal distribution.

Pros and Cons:

2. Expected Improvement (EI)

Core Question: By how much do we expect a new point $x$ to improve upon our current best $f(x^+)$? This considers both the probability and magnitude of improvement.

Mathematical Formulation: First, define the improvement function $I(x) = \max(0, f(x) - f(x^+))$.

We want to calculate the expectation of this improvement: $EI(x) = E[I(x)]$. The calculation involves integrating over the posterior distribution of $f(x)$.

The final closed-form solution (after introducing the trade-off parameter $\xi$) is:

$$EI(x) = (\mu(x) - f(x^+) - \xi)\Phi(Z) + \sigma(x)\phi(Z)$$

where $Z = \frac{\mu(x) - f(x^+) - \xi}{\sigma(x)}$, $\Phi$ is the standard normal CDF, and $\phi$ is the standard normal Probability Density Function (PDF).

Pros and Cons:

3. Upper Confidence Bound (UCB)

Core Question: How can we directly balance the predicted mean (exploitation) against the uncertainty (exploration) using a tunable knob?

Mathematical Formulation: This method directly implements the principle of "optimism under uncertainty." We calculate a score based on an upper confidence bound of the function value.

$$UCB(x) = \mu(x) + \kappa\sigma(x)$$

Pros and Cons:

The Bayesian Optimization Loop: Step-by-Step Algorithm

We have now covered both core components: the Gaussian Process (GP), which acts as our predictive map, and the Acquisition Function, which acts as our search strategy. The Bayesian Optimization algorithm simply iterates between these two components to find the optimum.

Part 1: The Intuition

Let's return to our analogy of finding the hottest spot in a room using a limited number of sensors. The process works like this:

  1. Initial Guess: Place 2-3 sensors at random locations in the room. This gives you a very basic, low-confidence map of the temperature.
  2. Model and Decide: Look at your map (the Gaussian Process). Your strategy (the acquisition function) points to a new location and says, "Based on our current data, this spot offers the best combination of being potentially very hot and being very uncertain. We should measure here next."
  3. Measure and Update: You place a new sensor at that exact location and take a measurement (the expensive function evaluation).
  4. Repeat: You add this new measurement to your data. Now, you recalculate your entire map (the GP). The map becomes more accurate, especially around the new sensor. You then go back to step 2 to decide where to place the next sensor.

This cycle repeats. Each new sensor placement is more intelligent than the last because it leverages all previous measurements. You quickly stop exploring cold or well-understood areas and focus your efforts on the most promising, uncertain regions.

Part 2: The Detailed Algorithm

Here is the formal, iterative process used in Bayesian Optimization. Let $f$ be our expensive black-box objective function.

  1. Step 1: Initialization

    Action: Select an initial set of $n_0$ sample points, often chosen randomly or using a space-filling design. Evaluate the function at these points to create the initial dataset $D_0 = \{(x_i, y_i)\}_{i=1}^{n_0}$.

    Technical Detail: While simple random sampling can work for initialization, methods like Latin Hypercube Sampling (LHS) are often preferred. LHS provides better coverage across the entire search space, ensuring the initial Gaussian Process model has a more representative view of the function landscape from the start.

  2. Step 2: Surrogate Model Fitting

    Action: For each iteration $t=1,2,...$, fit the Gaussian Process surrogate model to all currently observed data $D_{t-1}$.

    Technical Detail: This step involves more than just calculating the posterior. To ensure the GP accurately reflects the observed data's characteristics, we must first optimize the kernel's hyperparameters (e.g., length scale $l$ and signal variance $\sigma_f^2$). This is done by maximizing the marginal log-likelihood on the data $D_{t-1}$. Optimizing these hyperparameters ensures the model's assumptions about smoothness and variance best match the data collected so far.

  3. Step 3: Acquisition Function Optimization

    Action: Propose the next point to sample, $x_t$, by finding the location that maximizes the chosen acquisition function $a(x)$.

    $$x_t = \arg \max_{x \in X} a(x)$$

    Technical Detail: This is a crucial point: we are not optimizing the expensive function $f(x)$ here. We are optimizing the acquisition function $a(x)$, which is very cheap to calculate because it only depends on the GP's posterior mean and variance. Standard numerical optimization algorithms (like L-BFGS, CMA-ES, or even random search on a dense grid) can be used to solve for $x_t$ quickly.

  4. Step 4: Evaluation and Data Augmentation

    Action: Evaluate the true objective function $f$ at the new point $x_t$ to get the result $y_t = f(x_t) + \epsilon$.

    Technical Detail: Augment the dataset with the new observation: $D_t = D_{t-1} \cup \{(x_t, y_t)\}$. Then, return to Step 2 for the next iteration. The loop continues until a predefined budget of evaluations is exhausted or convergence criteria are met.

Conclusion: Why and When to Use Bayesian Optimization

We have now explored Bayesian Optimization from first principles. We started with a common challenge: optimizing an expensive function where every evaluation costs significant time or resources. We saw how traditional methods like grid search are too inefficient for this task. The solution, Bayesian Optimization, works by building an intelligent, probabilistic map of the objective function using a Gaussian Process. This surrogate model provides not only predictions but also a vital measure of uncertainty. We then use an acquisition function to leverage this uncertainty, creating a smart search strategy that constantly balances exploitation (investigating promising areas) with exploration (reducing uncertainty in unknown areas). The result is a powerful and sample-efficient algorithm that can find better solutions in fewer steps than most other methods, making it ideal for high-stakes problems.

Practical Applications

Bayesian Optimization is not just theoretical; it's used extensively across various fields: