Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)#

This example is based on Section 5.4.3 of “Gaussian Processes for Machine Learning” [1]. It illustrates an example of complex kernel engineering and hyperparameter optimization using gradient ascent on the log-marginal-likelihood. The data consists of the monthly average atmospheric CO2 concentrations (in parts per million by volume (ppm)) collected at the Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to model the CO2 concentration as a function of the time \(t\) and extrapolate for years after 2001.

References

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Build the dataset#

We will derive a dataset from the Mauna Loa Observatory that collected air samples. We are interested in estimating the concentration of CO2 and extrapolate it for further years. First, we load the original dataset available in OpenML as a pandas dataframe. This will be replaced with Polars once fetch_openml adds a native support for it.

from sklearn.datasets import fetch_openml

co2 = fetch_openml(data_id=41187, as_frame=True)
co2.frame.head()

	year	month	day	weight	station	co2
0	1958	3	29	4	MLO	316.1
1	1958	4	5	6	MLO	317.3
2	1958	4	12	4	MLO	317.6
3	1958	4	19	6	MLO	317.5
4	1958	4	26	2	MLO	316.4

First, we process the original dataframe to create a date column and select it along with the CO2 column.

import polars as pl

co2_data = pl.DataFrame(co2.frame[["year", "month", "day", "co2"]]).select(
    pl.date("year", "month", "day"), "co2"
)
co2_data.head()

shape: (5, 2)

date	co2
date	f64
1958-03-29	316.1
1958-04-05	317.3
1958-04-12	317.6
1958-04-19	317.5
1958-04-26	316.4

co2_data["date"].min(), co2_data["date"].max()

(datetime.date(1958, 3, 29), datetime.date(2001, 12, 29))

We see that we get CO2 concentration for some days from March, 1958 to December, 2001. We can plot the raw information to have a better understanding.

import matplotlib.pyplot as plt

plt.plot(co2_data["date"], co2_data["co2"])
plt.xlabel("date")
plt.ylabel("CO$_2$ concentration (ppm)")
_ = plt.title("Raw air samples measurements from the Mauna Loa Observatory")

Raw air samples measurements from the Mauna Loa Observatory

We will preprocess the dataset by taking a monthly average and drop months for which no measurements were collected. Such a processing will have a smoothing effect on the data.

co2_data = (
    co2_data.sort(by="date")
    .group_by_dynamic("date", every="1mo")
    .agg(pl.col("co2").mean())
    .drop_nulls()
)
plt.plot(co2_data["date"], co2_data["co2"])
plt.xlabel("date")
plt.ylabel("Monthly average of CO$_2$ concentration (ppm)")
_ = plt.title(
    "Monthly average of air samples measurements\nfrom the Mauna Loa Observatory"
)

Monthly average of air samples measurements from the Mauna Loa Observatory

The idea in this example will be to predict the CO2 concentration in function of the date. We are as well interested in extrapolating for upcoming year after 2001.

As a first step, we will divide the data and the target to estimate. The data being a date, we will convert it into a numeric.

X = co2_data.select(
    pl.col("date").dt.year() + pl.col("date").dt.month() / 12
).to_numpy()
y = co2_data["co2"].to_numpy()

Design the proper kernel#

To design the kernel to use with our Gaussian process, we can make some assumption regarding the data at hand. We observe that they have several characteristics: we see a long term rising trend, a pronounced seasonal variation and some smaller irregularities. We can use different appropriate kernel that would capture these features.

First, the long term rising trend could be fitted using a radial basis function (RBF) kernel with a large length-scale parameter. The RBF kernel with a large length-scale enforces this component to be smooth. A trending increase is not enforced as to give a degree of freedom to our model. The specific length-scale and the amplitude are free hyperparameters.

from sklearn.gaussian_process.kernels import RBF

long_term_trend_kernel = 50.0**2 * RBF(length_scale=50.0)

The seasonal variation is explained by the periodic exponential sine squared kernel with a fixed periodicity of 1 year. The length-scale of this periodic component, controlling its smoothness, is a free parameter. In order to allow decaying away from exact periodicity, the product with an RBF kernel is taken. The length-scale of this RBF component controls the decay time and is a further free parameter. This type of kernel is also known as locally periodic kernel.

from sklearn.gaussian_process.kernels import ExpSineSquared

seasonal_kernel = (
    2.0**2
    * RBF(length_scale=100.0)
    * ExpSineSquared(length_scale=1.0, periodicity=1.0, periodicity_bounds="fixed")
)

The small irregularities are to be explained by a rational quadratic kernel component, whose length-scale and alpha parameter, which quantifies the diffuseness of the length-scales, are to be determined. A rational quadratic kernel is equivalent to an RBF kernel with several length-scale and will better accommodate the different irregularities.

from sklearn.gaussian_process.kernels import RationalQuadratic

irregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)

Finally, the noise in the dataset can be accounted with a kernel consisting of an RBF kernel contribution, which shall explain the correlated noise components such as local weather phenomena, and a white kernel contribution for the white noise. The relative amplitudes and the RBF’s length scale are further free parameters.

from sklearn.gaussian_process.kernels import WhiteKernel

noise_kernel = 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(
    noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5)
)

Thus, our final kernel is an addition of all previous kernel.

co2_kernel = (
    long_term_trend_kernel + seasonal_kernel + irregularities_kernel + noise_kernel
)
co2_kernel

50**2 * RBF(length_scale=50) + 2**2 * RBF(length_scale=100) * ExpSineSquared(length_scale=1, periodicity=1) + 0.5**2 * RationalQuadratic(alpha=1, length_scale=1) + 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=0.01)

Model fitting and extrapolation#

Now, we are ready to use a Gaussian process regressor and fit the available data. To follow the example from the literature, we will subtract the mean from the target. We could have used normalize_y=True. However, doing so would have also scaled the target (dividing y by its standard deviation). Thus, the hyperparameters of the different kernel would have had different meaning since they would not have been expressed in ppm.

from sklearn.gaussian_process import GaussianProcessRegressor

y_mean = y.mean()
gaussian_process = GaussianProcessRegressor(kernel=co2_kernel, normalize_y=False)
gaussian_process.fit(X, y - y_mean)

GaussianProcessRegressor(kernel=50**2 * RBF(length_scale=50) + 2**2 * RBF(length_scale=100) * ExpSineSquared(length_scale=1, periodicity=1) + 0.5**2 * RationalQuadratic(alpha=1, length_scale=1) + 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=0.01))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GaussianProcessRegressor

?Documentation for GaussianProcessRegressoriFitted

Parameters

	kernel kernel: kernel instance, default=None The kernel specifying the covariance function of the GP. If `None` is passed, the kernel `ConstantKernel() * RBF()` is used as default. Note that the kernel hyperparameters are optimized during fitting unless the bounds are marked as `"fixed"` or the argument `optimizer` is set to `None`.	50*2 RBF(l...se_level=0.01)
	kernel__k1	50*2 RBF(l...ength_scale=1)
	kernel__k2	0.1*2 RBF(...se_level=0.01)
	kernel__k1__k1	50*2 RBF(l...periodicity=1)
	kernel__k1__k2	0.5*2 Rati...ength_scale=1)
	kernel__k1__k1__k1	50*2 RBF(length_scale=50)
	kernel__k1__k1__k2	2*2 RBF(le...periodicity=1)
	kernel__k1__k1__k1__k1	50**2
	kernel__k1__k1__k1__k2	RBF(length_scale=50)
	kernel__k1__k1__k1__k1__constant_value	2500.0
	kernel__k1__k1__k1__k1__constant_value_bounds	(1e-05, ...)
	kernel__k1__k1__k1__k2__length_scale	50.0
	kernel__k1__k1__k1__k2__length_scale_bounds	(1e-05, ...)
	kernel__k1__k1__k2__k1	2*2 RBF(length_scale=100)
	kernel__k1__k1__k2__k2	ExpSineSquare...periodicity=1)
	kernel__k1__k1__k2__k1__k1	2**2
	kernel__k1__k1__k2__k1__k2	RBF(length_scale=100)
	kernel__k1__k1__k2__k1__k1__constant_value	4.0
	kernel__k1__k1__k2__k1__k1__constant_value_bounds	(1e-05, ...)
	kernel__k1__k1__k2__k1__k2__length_scale	100.0
	kernel__k1__k1__k2__k1__k2__length_scale_bounds	(1e-05, ...)
	kernel__k1__k1__k2__k2__length_scale	1.0
	kernel__k1__k1__k2__k2__periodicity	1.0
	kernel__k1__k1__k2__k2__length_scale_bounds	(1e-05, ...)
	kernel__k1__k1__k2__k2__periodicity_bounds	'fixed'
	kernel__k1__k2__k1	0.5**2
	kernel__k1__k2__k2	RationalQuadr...ength_scale=1)
	kernel__k1__k2__k1__constant_value	0.25
	kernel__k1__k2__k1__constant_value_bounds	(1e-05, ...)
	kernel__k1__k2__k2__length_scale	1.0
	kernel__k1__k2__k2__alpha	1.0
	kernel__k1__k2__k2__length_scale_bounds	(1e-05, ...)
	kernel__k1__k2__k2__alpha_bounds	(1e-05, ...)
	kernel__k2__k1	0.1*2 RBF(length_scale=0.1)
	kernel__k2__k2	WhiteKernel(noise_level=0.01)
	kernel__k2__k1__k1	0.1**2
	kernel__k2__k1__k2	RBF(length_scale=0.1)
	kernel__k2__k1__k1__constant_value	0.010000000000000002
	kernel__k2__k1__k1__constant_value_bounds	(1e-05, ...)
	kernel__k2__k1__k2__length_scale	0.1
	kernel__k2__k1__k2__length_scale_bounds	(1e-05, ...)
	kernel__k2__k2__noise_level	0.010000000000000002
	kernel__k2__k2__noise_level_bounds	(1e-05, ...)
	alpha alpha: float or ndarray of shape (n_samples,), default=1e-10 Value added to the diagonal of the kernel matrix during fitting. This can prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix. It can also be interpreted as the variance of additional Gaussian measurement noise on the training observations. Note that this is different from using a `WhiteKernel`. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with :class:`~sklearn.linear_model.Ridge`. For an example illustrating how the alpha parameter controls the noise variance in Gaussian Process Regression, see :ref:`sphx_glr_auto_examples_gaussian_process_plot_gpr_noisy_targets.py`.	1e-10
	optimizer optimizer: "fmin_l_bfgs_b", callable or None, default="fmin_l_bfgs_b" Can either be one of the internally supported optimizers for optimizing the kernel's parameters, specified by a string, or an externally defined optimizer passed as a callable. If a callable is passed, it must have the signature:: def optimizer(obj_func, initial_theta, bounds): # * 'obj_func': the objective function to be minimized, which # takes the hyperparameters theta as a parameter and an # optional flag eval_gradient, which determines if the # gradient is returned additionally to the function value # * 'initial_theta': the initial value for theta, which can be # used by local optimizers # * 'bounds': the bounds on the values of theta .... # Returned are the best found hyperparameters theta and # the corresponding value of the target function. return theta_opt, func_min Per default, the L-BFGS-B algorithm from `scipy.optimize.minimize` is used. If None is passed, the kernel's parameters are kept fixed. Available internal optimizers are: `{'fmin_l_bfgs_b'}`.	'fmin_l_bfgs_b'
	n_restarts_optimizer n_restarts_optimizer: int, default=0 The number of restarts of the optimizer for finding the kernel's parameters which maximize the log-marginal likelihood. The first run of the optimizer is performed from the kernel's initial parameters, the remaining ones (if any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If greater than 0, all bounds must be finite. Note that `n_restarts_optimizer == 0` implies that one run is performed.	0
	normalize_y normalize_y: bool, default=False Whether or not to normalize the target values `y` by removing the mean and scaling to unit-variance. This is recommended for cases where zero-mean, unit-variance priors are used. Note that, in this implementation, the normalisation is reversed before the GP predictions are reported. .. versionchanged:: 0.23	False
	copy_X_train copy_X_train: bool, default=True If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.	True
	n_targets n_targets: int, default=None The number of dimensions of the target values. Used to decide the number of outputs when sampling from the prior distributions (i.e. calling :meth:`sample_y` before :meth:`fit`). This parameter is ignored once :meth:`fit` has been called. .. versionadded:: 1.3	None
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation used to initialize the centers. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`.	None

Fitted attributes

Name	Type	Value
L_ L_: array-like of shape (n_samples, n_samples) Lower-triangular Cholesky decomposition of the kernel in ``X_train_``.	ndarray[float64](521, 521)	[[44.86, 0. , 0. ,..., 0. , 0. , 0. ], [44.85, 0.96, 0. ,..., 0. , 0. , 0. ], [44.83, 1.55, 0.75,..., 0. , 0. , 0. ], ..., [31.35, 2.17, 2. ,..., 0.29, 0. , 0. ], [31.31, 1.83, 1.85,..., 0.19, 0.29, 0. ], [31.29, 1.5 , 1.68,..., 0.16, 0.19, 0.29]]
X_train_ X_train_: array-like of shape (n_samples, n_features) or list of object Feature vectors or other representations of training data (also required for prediction).	ndarray[float64](521, 1)	[[1958.25], [1958.33], [1958.42], ..., [2001.83], [2001.92], [2002. ]]
alpha_ alpha_: array-like of shape (n_samples,) Dual coefficients of training data points in kernel space.	ndarray[float64](521,)	[-0.86, 2.37,-3.4 ,..., 1.17,-2.45, 2.96]
kernel_ kernel_: kernel instance The kernel used for prediction. The structure of the kernel is the same as the one passed as parameter but with optimized hyperparameters.	Sum	44.8*2 RBF..._level=0.0367)
log_marginal_likelihood_value_ log_marginal_likelihood_value_: float The log-marginal-likelihood of ``self.kernel_.theta``.	float64	-115.1
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	1
y_train_ y_train_: array-like of shape (n_samples,) or (n_samples, n_targets) Target values in training data (also required for prediction).	ndarray[float64](521,)	[-23.72,-22.62,-22.39,..., 28.23, 29.55, 31.2 ]

Now, we will use the Gaussian process to predict on:

training data to inspect the goodness of fit;
future data to see the extrapolation done by the model.

Thus, we create synthetic data from 1958 to the current month. In addition, we need to add the subtracted mean computed during training.

import datetime

import numpy as np

today = datetime.datetime.now()
current_month = today.year + today.month / 12
X_test = np.linspace(start=1958, stop=current_month, num=1_000).reshape(-1, 1)
mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)
mean_y_pred += y_mean

plt.plot(X, y, color="black", linestyle="dashed", label="Measurements")
plt.plot(X_test, mean_y_pred, color="tab:blue", alpha=0.4, label="Gaussian process")
plt.fill_between(
    X_test.ravel(),
    mean_y_pred - std_y_pred,
    mean_y_pred + std_y_pred,
    color="tab:blue",
    alpha=0.2,
)
plt.legend()
plt.xlabel("Year")
plt.ylabel("Monthly average of CO$_2$ concentration (ppm)")
_ = plt.title(
    "Monthly average of air samples measurements\nfrom the Mauna Loa Observatory"
)

Our fitted model is capable to fit previous data properly and extrapolate to future year with confidence.

Interpretation of kernel hyperparameters#

Now, we can have a look at the hyperparameters of the kernel.

gaussian_process.kernel_

44.8**2 * RBF(length_scale=51.6) + 2.64**2 * RBF(length_scale=91.5) * ExpSineSquared(length_scale=1.48, periodicity=1) + 0.536**2 * RationalQuadratic(alpha=2.89, length_scale=0.968) + 0.188**2 * RBF(length_scale=0.122) + WhiteKernel(noise_level=0.0367)

Thus, most of the target signal, with the mean subtracted, is explained by a long-term rising trend for ~45 ppm and a length-scale of ~52 years. The periodic component has an amplitude of ~2.6ppm, a decay time of ~90 years and a length-scale of ~1.5. The long decay time indicates that we have a component very close to a seasonal periodicity. The correlated noise has an amplitude of ~0.2 ppm with a length scale of ~0.12 years and a white-noise contribution of ~0.04 ppm. Thus, the overall noise level is very small, indicating that the data can be very well explained by the model.

Total running time of the script: (0 minutes 4.136 seconds)