Note
This tutorial was generated from an IPython notebook that can be downloaded here.
Hyperparameter optimization¶
This notebook was made with the following version of george:
import george
george.__version__
'1.0.0.dev0'
In this tutorial, we’ll reproduce the analysis for Figure 5.6 in Chapter 5 of Rasmussen & Williams (R&W). The data are measurements of the atmospheric CO2 concentration made at Mauna Loa, Hawaii (Keeling & Whorf 2004). The dataset is said to be available online but I couldn’t seem to download it from the original source. Luckily the statsmodels package includes a copy that we can load as follows:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as pl
data = sm.datasets.get_rdataset("co2").data
t = np.array(data.time)
y = np.array(data.co2)
pl.plot(t, y, ".k")
pl.xlim(t.min(), t.max())
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
In this figure, you can see that there is periodic (or quasi-periodic) signal with a year-long period superimposed on a long term trend. We will follow R&W and model these effects non-parametrically using a complicated covariance function. The covariance function that we’ll use is:
where
We can implement this kernel in George as follows (we’ll use the R&W results as the hyperparameters for now):
from george import kernels
k1 = 66**2 * kernels.ExpSquaredKernel(metric=67**2)
k2 = 2.4**2 * kernels.ExpSquaredKernel(90**2) * kernels.ExpSine2Kernel(gamma=2/1.3**2, period=1)
k3 = 0.66**2 * kernels.RationalQuadraticKernel(alpha=0.78, metric=1.2**2)
k4 = 0.18**2 * kernels.ExpSquaredKernel(1.6**2)
kernel = k1 + k2 + k3 + k4
Optimization¶
If we want to find the “best-fit” hyperparameters, we should optimize
an objective function. The two standard functions (as described in
Chapter 5 of R&W) are the marginalized ln-likelihood and the cross
validation likelihood. George implements the former in the
GP.lnlikelihood
function and the gradient with respect to the
hyperparameters in the GP.grad_lnlikelihood
function:
import george
gp = george.GP(kernel, mean=np.mean(y), fit_mean=True,
white_noise=np.log(0.19**2), fit_white_noise=True)
gp.compute(t)
print(gp.lnlikelihood(y))
print(gp.grad_lnlikelihood(y))
-85.5547708528
[ 1.31881763e-02 -1.64386469e+01 -4.94655690e-01 3.78791021e-01
1.08296383e+00 -1.54394027e+00 -1.14486240e+00 -3.98205196e+01
-4.34545333e+00 -1.02722671e+00 3.23160562e+00 1.24565433e+00
-4.89958197e-02]
We’ll use a gradient based optimization routine from SciPy to fit this model as follows:
import scipy.optimize as op
# Define the objective function (negative log-likelihood in this case).
def nll(p):
gp.set_vector(p)
ll = gp.lnlikelihood(y, quiet=True)
return -ll if np.isfinite(ll) else 1e25
# And the gradient of the objective function.
def grad_nll(p):
gp.set_vector(p)
return -gp.grad_lnlikelihood(y, quiet=True)
# You need to compute the GP once before starting the optimization.
gp.compute(t)
# Print the initial ln-likelihood.
print(gp.lnlikelihood(y))
# Run the optimization routine.
p0 = gp.get_vector()
results = op.minimize(nll, p0, jac=grad_nll, method="L-BFGS-B")
# Update the kernel and print the final log-likelihood.
gp.set_vector(results.x)
print(gp.lnlikelihood(y))
-85.5547708528
-83.1662492912
Warning: An optimization code something like this should work on most problems but the results can be very sensitive to your choice of initialization and algorithm. If the results are nonsense, try choosing a better initial guess or try a different value of the ``method`` parameter in ``op.minimize``.
After running this optimization, we find a final ln-likelihood of -83; slightly better than the result in R&W. We can plot our prediction of the CO2 concentration into the future using our optimized Gaussian process model by running:
x = np.linspace(max(t), 2025, 2000)
mu, var = gp.predict(y, x, return_var=True)
std = np.sqrt(var)
pl.plot(t, y, ".k")
pl.fill_between(x, mu+std, mu-std, color="g", alpha=0.5)
pl.xlim(t.min(), 2025)
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
Sampling & Marginalization¶
The prediction made in the previous section take into account uncertainties due to the fact that a Gaussian process is stochastic but it doesn’t take into account any uncertainties in the values of the hyperparameters. This won’t matter if the hyperparameters are very well constrained by the data but in this case, many of the parameters are actually poorly constrained. To take this effect into account, we can apply prior probability functions to the hyperparameters and marginalize using Markov chain Monte Carlo (MCMC). To do this, we’ll use the emcee package.
First, we define the probabilistic model:
def lnprob(p):
# Trivial uniform prior.
if np.any((-100 > p[1:]) + (p[1:] > 100)):
return -np.inf
# Update the kernel and compute the lnlikelihood.
gp.set_vector(p)
return gp.lnlikelihood(y, quiet=True)
In this function, we’ve applied a prior on every parameter that is
uniform between -100 and 100 for every parameter. In real life, you
should probably use something more intelligent but this will work for
this problem. The quiet argument in the call to GP.lnlikelihood()
means that that function will return -numpy.inf
if the kernel is
invalid or if there are any linear algebra errors (otherwise it would
raise an exception).
Then, we run the sampler (this will probably take a while to run if you want to repeat this analysis):
import emcee
gp.compute(t)
# Set up the sampler.
nwalkers, ndim = 36, len(gp)
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob)
# Initialize the walkers.
p0 = gp.get_vector() + 1e-4 * np.random.randn(nwalkers, ndim)
print("Running burn-in")
p0, _, _ = sampler.run_mcmc(p0, 200)
print("Running production chain")
sampler.run_mcmc(p0, 200);
Running burn-in
Running production chain
After this run, you can plot 50 samples from the marginalized predictive probability distribution:
x = np.linspace(max(t), 2025, 250)
for i in range(50):
# Choose a random walker and step.
w = np.random.randint(sampler.chain.shape[0])
n = np.random.randint(sampler.chain.shape[1])
gp.set_vector(sampler.chain[w, n])
# Plot a single sample.
pl.plot(x, gp.sample_conditional(y, x), "g", alpha=0.1)
pl.plot(t, y, ".k")
pl.xlim(t.min(), 2025)
pl.xlabel("year")
pl.ylabel("CO$_2$ in ppm");
Comparing this to the same figure in the previous section, you’ll notice that the error bars on the prediction are now substantially larger than before. This is because we are now considering all the predictions that are consistent with the data, not just the “best” prediction. In general, even though it requires much more computation, it is more conservative (and honest) to take all these sources of uncertainty into account.