It is only the third week and I have already failed to manage my goal of posting one solution per week, not a good sign for the future… In my defense, I was busy last week attending the Probabilistic Numerics Spring School, at which I learned lots about probabilistic ODE solvers, on which this weeks late question is based.
Probabilistic ODE solvers work by placing a Markovian GP prior (usually the \( q \)-times integrated Wiener process) over the solution, given this prior, we then condition on the fact samples of this GP must solve the ODE at a given set of time steps, that is to say the derivatives of the GP prior must match the derivative function of the ODE. We can infer the solution to the ODE by using filtering and smoothing techniques from signal processing. In this question, the task is to form an SDE for the state misalignment, i.e the difference between the derivatives of the prior and the ODE derivative function. It is interesting to note that we can also condition on other properties of the system, for example, conditioning on energy conservation in a Hamiltonian system leads to a symplectic solver, see here for details. The whole idea is fascinating and deserves a look if you have never come across it.
Problem
We take our prior to be the SDE
$$ dX(t)=FX(t)dt + Ld\omega, $$with \( X \in\mathbb{R}^{D} \), \( F,L\in\mathbb{R}^{D \times D} \) and \( d\omega \in\mathbb{R}^{D} \) being an increment of the Wiener process with diffusion \( Q\in\mathbb{R}^{D \times D} \). We also have two projection matrices which allow us to obtain the parts of the SDE prior that represent ODE state, and ODE state derivative,
$$ \mathbf x(t) = H_0X(t), \qquad \mathbf x'(t) = HX(t). $$with \( H,H_0\in\mathbb{R}^{d \times D} \) where \( d \) is our ODE dimension. The state misalignment is given by
$$ Z(t) = \phi(X(t))=HX(t) - f(H_0X(t)). $$where \( f: \mathbb{R}^D \mapsto \mathbb{R}^D \) is our ODE derivative function and \( Z(t)\in\mathbb{R}^d \) is called the observation process, and is given by another SDE. Our task is to find an expression for this SDE and state under which conditions it is Gaussian process.
Source: Probabilistic Numerics by Henning, Osborne and Kersting, Ex. 38.1
Solution
The Itô formula for transformations of SDEs by scalar funtion \( g(X(t)) \) is given by
$$ dg = (\nabla g) dX + \frac{1}{2}\operatorname{tr}[(\nabla^\top \nabla g) dXdX^\top]. $$(Note: here I have used the gradient as row vector) We need to use this to compute the the transformation represented by the state misalignment, which is a vector function. Fortunately, the Itô formula applies to each componnt, so
$$ d\phi_i = (\nabla \phi_i) dX + \frac{1}{2}\operatorname{tr}[(\nabla^\top \nabla \phi_i) dXdX^\top]. $$with
$$ \phi_i = \sum_j H_{i,j} X_j + f_i(H_0 X). $$We now need to compute the Jacobian and the Hessian given in the formula. This took me a long time as my vector calculus is a little rusty, and I am not 100% sure I have the correct answer but I got them to be,
$$ \nabla \phi_i = H_{i,-} + \nabla f_i(H_0 X)H_0^\top $$where \( \nabla f_i \) is the Jacobian of \( f \) for output \( i \) and \( H_{i,-}\in \mathbb{R}^D \) is the \( i \)-th column of \( H \), and for the Hessain,
$$ \nabla^\top\nabla \phi_i = H_0[\nabla^\top\nabla f_i(H_0 X)]H_0^\top $$where \( \nabla^\top\nabla f_i \) is the Hessian of \( f \) for output \( i \). We also need that \begin{align} dXdX^\top &= (FX(t)dt + Ld\omega)(FX(t)dt + Ld\omega)^\top \\\ &=LQL^\top dt \end{align} where we use the combination rules for mixed differentials given in the Itô formula. Subbing all this into the formula and collecting terms, we obtain the SDEs, \begin{align} d\phi_i = &\left(H_{i,-}FX(t) + \nabla f_i(H_0 X)H_0^\top X(t) + \frac{1}{2}\operatorname{tr}(H_0[\nabla^\top\nabla f_i(H_0 X)]H_0^\top LQL^\top)\right)dt \\\ &+ (H_{i,-}+ \nabla f_i(H_0 X)H_0^\top)Ld\omega \end{align} which can then be simulated to give the observation process. This SDE will be a GP when \( f \) is linear, since GP representations of SDEs do not have multiplicative noise (see Särkkä and Solin book, 6.1), so we require that the gradient of \( f \) be independent of \( X \) which is only true when we have a linear derivative function.