Note

This tutorial was generated from an IPython notebook that can be downloaded here.

Hyperparameter optimization

This notebook was made with the following version of george:

import george
george.__version__
'1.0.0.dev0'

In this tutorial, we’ll reproduce the analysis for Figure 5.6 in Chapter 5 of Rasmussen & Williams (R&W). The data are measurements of the atmospheric CO2 concentration made at Mauna Loa, Hawaii (Keeling & Whorf 2004). The dataset is said to be available online but I couldn’t seem to download it from the original source. Luckily the statsmodels package includes a copy that we can load as follows:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as pl

data = sm.datasets.get_rdataset("co2").data
t = np.array(data.time)
y = np.array(data.co2)

pl.plot(t, y, ".k")
pl.xlim(t.min(), t.max())
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
../../_images/hyper_4_0.png

In this figure, you can see that there is periodic (or quasi-periodic) signal with a year-long period superimposed on a long term trend. We will follow R&W and model these effects non-parametrically using a complicated covariance function. The covariance function that we’ll use is:

\[k(r) = k_1(r) + k_2(r) + k_3(r) + k_4(r)\]

where

\[\begin{split}\begin{eqnarray} k_1(r) &=& \theta_1^2 \, \exp \left(-\frac{r^2}{2\,\theta_2} \right) \\ k_2(r) &=& \theta_3^2 \, \exp \left(-\frac{r^2}{2\,\theta_4} -\theta_5\,\sin^2\left( \frac{\pi\,r}{\theta_6}\right) \right) \\ k_3(r) &=& \theta_7^2 \, \left [ 1 + \frac{r^2}{2\,\theta_8\,\theta_9} \right ]^{-\theta_8} \\ k_4(r) &=& \theta_{10}^2 \, \exp \left(-\frac{r^2}{2\,\theta_{11}} \right) + \theta_{12}^2\,\delta_{ij} \end{eqnarray}\end{split}\]

We can implement this kernel in George as follows (we’ll use the R&W results as the hyperparameters for now):

from george import kernels

k1 = 66**2 * kernels.ExpSquaredKernel(metric=67**2)
k2 = 2.4**2 * kernels.ExpSquaredKernel(90**2) * kernels.ExpSine2Kernel(gamma=2/1.3**2, period=1)
k3 = 0.66**2 * kernels.RationalQuadraticKernel(alpha=0.78, metric=1.2**2)
k4 = 0.18**2 * kernels.ExpSquaredKernel(1.6**2)
kernel = k1 + k2 + k3 + k4

Optimization

If we want to find the “best-fit” hyperparameters, we should optimize an objective function. The two standard functions (as described in Chapter 5 of R&W) are the marginalized ln-likelihood and the cross validation likelihood. George implements the former in the GP.lnlikelihood function and the gradient with respect to the hyperparameters in the GP.grad_lnlikelihood function:

import george
gp = george.GP(kernel, mean=np.mean(y), fit_mean=True,
               white_noise=np.log(0.19**2), fit_white_noise=True)
gp.compute(t)
print(gp.lnlikelihood(y))
print(gp.grad_lnlikelihood(y))
-85.5547708528
[  1.31881763e-02  -1.64386469e+01  -4.94655690e-01   3.78791021e-01
   1.08296383e+00  -1.54394027e+00  -1.14486240e+00  -3.98205196e+01
  -4.34545333e+00  -1.02722671e+00   3.23160562e+00   1.24565433e+00
  -4.89958197e-02]

We’ll use a gradient based optimization routine from SciPy to fit this model as follows:

import scipy.optimize as op

# Define the objective function (negative log-likelihood in this case).
def nll(p):
    gp.set_vector(p)
    ll = gp.lnlikelihood(y, quiet=True)
    return -ll if np.isfinite(ll) else 1e25

# And the gradient of the objective function.
def grad_nll(p):
    gp.set_vector(p)
    return -gp.grad_lnlikelihood(y, quiet=True)

# You need to compute the GP once before starting the optimization.
gp.compute(t)

# Print the initial ln-likelihood.
print(gp.lnlikelihood(y))

# Run the optimization routine.
p0 = gp.get_vector()
results = op.minimize(nll, p0, jac=grad_nll, method="L-BFGS-B")

# Update the kernel and print the final log-likelihood.
gp.set_vector(results.x)
print(gp.lnlikelihood(y))
-85.5547708528
-83.1662492912

Warning: An optimization code something like this should work on most problems but the results can be very sensitive to your choice of initialization and algorithm. If the results are nonsense, try choosing a better initial guess or try a different value of the ``method`` parameter in ``op.minimize``.

After running this optimization, we find a final ln-likelihood of -83; slightly better than the result in R&W. We can plot our prediction of the CO2 concentration into the future using our optimized Gaussian process model by running:

x = np.linspace(max(t), 2025, 2000)
mu, var = gp.predict(y, x, return_var=True)
std = np.sqrt(var)

pl.plot(t, y, ".k")
pl.fill_between(x, mu+std, mu-std, color="g", alpha=0.5)

pl.xlim(t.min(), 2025)
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
../../_images/hyper_12_0.png

Sampling & Marginalization

The prediction made in the previous section take into account uncertainties due to the fact that a Gaussian process is stochastic but it doesn’t take into account any uncertainties in the values of the hyperparameters. This won’t matter if the hyperparameters are very well constrained by the data but in this case, many of the parameters are actually poorly constrained. To take this effect into account, we can apply prior probability functions to the hyperparameters and marginalize using Markov chain Monte Carlo (MCMC). To do this, we’ll use the emcee package.

First, we define the probabilistic model:

def lnprob(p):
    # Trivial uniform prior.
    if np.any((-100 > p[1:]) + (p[1:] > 100)):
        return -np.inf

    # Update the kernel and compute the lnlikelihood.
    gp.set_vector(p)
    return gp.lnlikelihood(y, quiet=True)

In this function, we’ve applied a prior on every parameter that is uniform between -100 and 100 for every parameter. In real life, you should probably use something more intelligent but this will work for this problem. The quiet argument in the call to GP.lnlikelihood() means that that function will return -numpy.inf if the kernel is invalid or if there are any linear algebra errors (otherwise it would raise an exception).

Then, we run the sampler (this will probably take a while to run if you want to repeat this analysis):

import emcee

gp.compute(t)

# Set up the sampler.
nwalkers, ndim = 36, len(gp)
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob)

# Initialize the walkers.
p0 = gp.get_vector() + 1e-4 * np.random.randn(nwalkers, ndim)

print("Running burn-in")
p0, _, _ = sampler.run_mcmc(p0, 200)

print("Running production chain")
sampler.run_mcmc(p0, 200);
Running burn-in
Running production chain

After this run, you can plot 50 samples from the marginalized predictive probability distribution:

x = np.linspace(max(t), 2025, 250)
for i in range(50):
    # Choose a random walker and step.
    w = np.random.randint(sampler.chain.shape[0])
    n = np.random.randint(sampler.chain.shape[1])
    gp.set_vector(sampler.chain[w, n])

    # Plot a single sample.
    pl.plot(x, gp.sample_conditional(y, x), "g", alpha=0.1)

pl.plot(t, y, ".k")

pl.xlim(t.min(), 2025)
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
../../_images/hyper_18_0.png

Comparing this to the same figure in the previous section, you’ll notice that the error bars on the prediction are now substantially larger than before. This is because we are now considering all the predictions that are consistent with the data, not just the “best” prediction. In general, even though it requires much more computation, it is more conservative (and honest) to take all these sources of uncertainty into account.