{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)\n\nThis example is based on Section 5.4.3 of \"Gaussian Processes for Machine\nLearning\" [1]_. It illustrates an example of complex kernel engineering\nand hyperparameter optimization using gradient ascent on the\nlog-marginal-likelihood. The data consists of the monthly average atmospheric\nCO2 concentrations (in parts per million by volume (ppm)) collected at the\nMauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to\nmodel the CO2 concentration as a function of the time $t$ and extrapolate\nfor years after 2001.\n\n.. rubric:: References\n\n.. [1] [Rasmussen, Carl Edward. \"Gaussian processes in machine learning.\"\n    Summer school on machine learning. Springer, Berlin, Heidelberg, 2003](http://www.gaussianprocess.org/gpml/chapters/RW.pdf).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\n\n# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: BSD 3 clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Build the dataset\n\nWe will derive a dataset from the Mauna Loa Observatory that collected air\nsamples. We are interested in estimating the concentration of CO2 and\nextrapolate it for further year. First, we load the original dataset available\nin OpenML as a pandas dataframe. This will be replaced with Polars\nonce `fetch_openml` adds a native support for it.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import fetch_openml\n\nco2 = fetch_openml(data_id=41187, as_frame=True)\nco2.frame.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, we process the original dataframe to create a date column and select\nit along with the CO2 column.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import polars as pl\n\nco2_data = pl.DataFrame(co2.frame[[\"year\", \"month\", \"day\", \"co2\"]]).select(\n    pl.date(\"year\", \"month\", \"day\"), \"co2\"\n)\nco2_data.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "co2_data[\"date\"].min(), co2_data[\"date\"].max()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that we get CO2 concentration for some days from March, 1958 to\nDecember, 2001. We can plot these raw information to have a better\nunderstanding.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nplt.plot(co2_data[\"date\"], co2_data[\"co2\"])\nplt.xlabel(\"date\")\nplt.ylabel(\"CO$_2$ concentration (ppm)\")\n_ = plt.title(\"Raw air samples measurements from the Mauna Loa Observatory\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will preprocess the dataset by taking a monthly average and drop month\nfor which no measurements were collected. Such a processing will have an\nsmoothing effect on the data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "co2_data = (\n    co2_data.sort(by=\"date\")\n    .group_by_dynamic(\"date\", every=\"1mo\")\n    .agg(pl.col(\"co2\").mean())\n    .drop_nulls()\n)\nplt.plot(co2_data[\"date\"], co2_data[\"co2\"])\nplt.xlabel(\"date\")\nplt.ylabel(\"Monthly average of CO$_2$ concentration (ppm)\")\n_ = plt.title(\n    \"Monthly average of air samples measurements\\nfrom the Mauna Loa Observatory\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The idea in this example will be to predict the CO2 concentration in function\nof the date. We are as well interested in extrapolating for upcoming year\nafter 2001.\n\nAs a first step, we will divide the data and the target to estimate. The data\nbeing a date, we will convert it into a numeric.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X = co2_data.select(\n    pl.col(\"date\").dt.year() + pl.col(\"date\").dt.month() / 12\n).to_numpy()\ny = co2_data[\"co2\"].to_numpy()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Design the proper kernel\n\nTo design the kernel to use with our Gaussian process, we can make some\nassumption regarding the data at hand. We observe that they have several\ncharacteristics: we see a long term rising trend, a pronounced seasonal\nvariation and some smaller irregularities. We can use different appropriate\nkernel that would capture these features.\n\nFirst, the long term rising trend could be fitted using a radial basis\nfunction (RBF) kernel with a large length-scale parameter. The RBF kernel\nwith a large length-scale enforces this component to be smooth. An trending\nincrease is not enforced as to give a degree of freedom to our model. The\nspecific length-scale and the amplitude are free hyperparameters.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.gaussian_process.kernels import RBF\n\nlong_term_trend_kernel = 50.0**2 * RBF(length_scale=50.0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The seasonal variation is explained by the periodic exponential sine squared\nkernel with a fixed periodicity of 1 year. The length-scale of this periodic\ncomponent, controlling its smoothness, is a free parameter. In order to allow\ndecaying away from exact periodicity, the product with an RBF kernel is\ntaken. The length-scale of this RBF component controls the decay time and is\na further free parameter. This type of kernel is also known as locally\nperiodic kernel.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.gaussian_process.kernels import ExpSineSquared\n\nseasonal_kernel = (\n    2.0**2\n    * RBF(length_scale=100.0)\n    * ExpSineSquared(length_scale=1.0, periodicity=1.0, periodicity_bounds=\"fixed\")\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The small irregularities are to be explained by a rational quadratic kernel\ncomponent, whose length-scale and alpha parameter, which quantifies the\ndiffuseness of the length-scales, are to be determined. A rational quadratic\nkernel is equivalent to an RBF kernel with several length-scale and will\nbetter accommodate the different irregularities.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.gaussian_process.kernels import RationalQuadratic\n\nirregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, the noise in the dataset can be accounted with a kernel consisting\nof an RBF kernel contribution, which shall explain the correlated noise\ncomponents such as local weather phenomena, and a white kernel contribution\nfor the white noise. The relative amplitudes and the RBF's length scale are\nfurther free parameters.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.gaussian_process.kernels import WhiteKernel\n\nnoise_kernel = 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(\n    noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5)\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Thus, our final kernel is an addition of all previous kernel.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "co2_kernel = (\n    long_term_trend_kernel + seasonal_kernel + irregularities_kernel + noise_kernel\n)\nco2_kernel"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Model fitting and extrapolation\n\nNow, we are ready to use a Gaussian process regressor and fit the available\ndata. To follow the example from the literature, we will subtract the mean\nfrom the target. We could have used `normalize_y=True`. However, doing so\nwould have also scaled the target (dividing `y` by its standard deviation).\nThus, the hyperparameters of the different kernel would have had different\nmeaning since they would not have been expressed in ppm.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.gaussian_process import GaussianProcessRegressor\n\ny_mean = y.mean()\ngaussian_process = GaussianProcessRegressor(kernel=co2_kernel, normalize_y=False)\ngaussian_process.fit(X, y - y_mean)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we will use the Gaussian process to predict on:\n\n- training data to inspect the goodness of fit;\n- future data to see the extrapolation done by the model.\n\nThus, we create synthetic data from 1958 to the current month. In addition,\nwe need to add the subtracted mean computed during training.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import datetime\n\nimport numpy as np\n\ntoday = datetime.datetime.now()\ncurrent_month = today.year + today.month / 12\nX_test = np.linspace(start=1958, stop=current_month, num=1_000).reshape(-1, 1)\nmean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)\nmean_y_pred += y_mean"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plt.plot(X, y, color=\"black\", linestyle=\"dashed\", label=\"Measurements\")\nplt.plot(X_test, mean_y_pred, color=\"tab:blue\", alpha=0.4, label=\"Gaussian process\")\nplt.fill_between(\n    X_test.ravel(),\n    mean_y_pred - std_y_pred,\n    mean_y_pred + std_y_pred,\n    color=\"tab:blue\",\n    alpha=0.2,\n)\nplt.legend()\nplt.xlabel(\"Year\")\nplt.ylabel(\"Monthly average of CO$_2$ concentration (ppm)\")\n_ = plt.title(\n    \"Monthly average of air samples measurements\\nfrom the Mauna Loa Observatory\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our fitted model is capable to fit previous data properly and extrapolate to\nfuture year with confidence.\n\n## Interpretation of kernel hyperparameters\n\nNow, we can have a look at the hyperparameters of the kernel.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "gaussian_process.kernel_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Thus, most of the target signal, with the mean subtracted, is explained by a\nlong-term rising trend for ~45 ppm and a length-scale of ~52 years. The\nperiodic component has an amplitude of ~2.6ppm, a decay time of ~90 years and\na length-scale of ~1.5. The long decay time indicates that we have a\ncomponent very close to a seasonal periodicity. The correlated noise has an\namplitude of ~0.2 ppm with a length scale of ~0.12 years and a white-noise\ncontribution of ~0.04 ppm. Thus, the overall noise level is very small,\nindicating that the data can be very well explained by the model.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.19"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}