Winners curse in Bayesian optimisation

2023/03/14

Problem

Take an objective function f(x) f(x) with GP prior and i.i.d. Gaussian noise with variance σ2 \sigma^2 , so yi=f(xi)+ϵi y_i = f(x_i) + \epsilon_i with ϵiN(0,σ2) \epsilon_i \sim \mathcal{N}(0, \sigma^2) . Assume two locations x1,x2 x_1, x_2 are sufficiently separated that their function values f(x1) f(x_1) and f(x2) f(x_2) are independent. Compute the posterior p(ϵ1ϵ2y1,y2) p(\epsilon_1-\epsilon_2|y_1, y_2) and use it to justify that the expected improvement in Bayesian optimisation suffers winners curse for a noisy objective. The winners curse says that for noisy optimisation problems, the lowest value is also likely the most noisy.

Source: Probabilistic Numerics by Henning, Osborne and Kersting, Ex. 32.2

Solution

First we need to compute p(ϵiyi) p(\epsilon_i| y_i) , which we can do using Bayes rule,

p(ϵiyi)=p(yiϵi)p(ϵi)p(yi) p(\epsilon_i| y_i) = \frac{p(y_i| \epsilon_i)p(\epsilon_i)}{p(y_i)}

we have that p(ϵi)=N(ϵi;0,σ2) p(\epsilon_i) = \mathcal{N}(\epsilon_i; 0, \sigma^2) , and that p(yiϵi)=N(yi;ϵi,σf2) p(y_i| \epsilon_i) = \mathcal{N}(y_i; \epsilon_i, \sigma_f^2) where we have assumed that the prior over f f has 0 mean, and variance σf2 \sigma_f^2 . So we have,

p(ϵiyi)N(yi;ϵi,σf2)N(ϵi;0,σ2) p(\epsilon_i| y_i) \propto \mathcal{N}(y_i; \epsilon_i, \sigma_f^2)\mathcal{N}(\epsilon_i; 0, \sigma^2)

which we can identify as a product of Gaussian PDFs. This results in another Gaussian PDF, see e.g these notes,

p(ϵiyi)=N(ϵi;μϵi,σϵi2), p(\epsilon_i| y_i) = \mathcal{N}(\epsilon_i; \mu_{\epsilon_i}, \sigma^2_{\epsilon_i}),

where

μϵi=yiσ2σ2+σf2,σϵi2=σf2σ2σ2+σf2. \begin{align} \mu_{\epsilon_i} &= \frac{y_i\sigma^2}{\sigma^2 + \sigma_f^2}, \\ \sigma^2_{\epsilon_i} &= \frac{\sigma_f^2\sigma^2}{\sigma^2 + \sigma_f^2}. \end{align}

We can see here that the variance is independent of data. Now we have computed the posterior of the noise given the data, we can compute the posterior of the difference between the noises given the data. Since the function values at the input points are independent then the posterior over the difference is just the difference of independent Gaussian random variables, in which case we simply subtract the means and sum the variances, to give

p(ϵ1ϵ2=Δy1,y2)=N(Δ;σ2σ2+σf2(y1y2),2σϵ2). p(\epsilon_1 - \epsilon_2=\Delta| y_1, y_2) = \mathcal{N}\left(\Delta; \frac{\sigma^2}{\sigma^2 + \sigma_f^2}(y_1-y_2), 2\sigma^2_{\epsilon}\right).

The key point here is that the mean of this distribution depends on the difference between the data values. So

y1>y2    E[ϵ1ϵ2]>0    E[ϵ1]>E[ϵ2]. y_1>y_2 \implies \mathbb{E}[\epsilon_1 - \epsilon_2]> 0 \implies \mathbb{E}[\epsilon_1] > \mathbb{E}[\epsilon_2].

If we assume that the values are lower that the prior mean (i.e. 0), which is to be expected, then we expect the noise for the lower point to be a larger (absolute) value. This is the winners curse described in the book.