Problem
Take an objective function f(x) with GP prior and i.i.d. Gaussian noise with variance σ2, so yi=f(xi)+ϵi with ϵi∼N(0,σ2). Assume two locations x1,x2 are sufficiently separated that their function values f(x1) and f(x2) are independent. Compute the posterior p(ϵ1−ϵ2∣y1,y2) and use it to justify that the expected improvement in Bayesian optimisation suffers winners curse for a noisy objective. The winners curse says that for noisy optimisation problems, the lowest value is also likely the most noisy.
Source: Probabilistic Numerics by Henning, Osborne and Kersting, Ex. 32.2
Solution
First we need to compute p(ϵi∣yi), which we can do using Bayes rule,
p(ϵi∣yi)=p(yi)p(yi∣ϵi)p(ϵi)
we have that p(ϵi)=N(ϵi;0,σ2), and that p(yi∣ϵi)=N(yi;ϵi,σf2) where we have assumed that the prior over f has 0 mean, and variance σf2. So we have,
p(ϵi∣yi)∝N(yi;ϵi,σf2)N(ϵi;0,σ2)
which we can identify as a product of Gaussian PDFs. This results in another Gaussian PDF, see e.g these notes,
p(ϵi∣yi)=N(ϵi;μϵi,σϵi2),
where
μϵiσϵi2=σ2+σf2yiσ2,=σ2+σf2σf2σ2.
We can see here that the variance is independent of data. Now we have computed the posterior of the noise given the data, we can compute the posterior of the difference between the noises given the data. Since the function values at the input points are independent then the posterior over the difference is just the difference of independent Gaussian random variables, in which case we simply subtract the means and sum the variances, to give
p(ϵ1−ϵ2=Δ∣y1,y2)=N(Δ;σ2+σf2σ2(y1−y2),2σϵ2).
The key point here is that the mean of this distribution depends on the difference between the data values. So
y1>y2⟹E[ϵ1−ϵ2]>0⟹E[ϵ1]>E[ϵ2].
If we assume that the values are lower that the prior mean (i.e. 0), which is to be expected, then we expect the noise for the lower point to be a larger (absolute) value. This is the winners curse described in the book.