.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/applications/plot_cyclical_feature_engineering.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_applications_plot_cyclical_feature_engineering.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_applications_plot_cyclical_feature_engineering.py:


================================
Time-related feature engineering
================================

This notebook introduces different strategies to leverage time-related features
for a bike sharing demand regression task that is highly dependent on business
cycles (days, weeks, months) and yearly season cycles.

In the process, we introduce how to perform periodic feature engineering using
the :class:`sklearn.preprocessing.SplineTransformer` class and its
`extrapolation="periodic"` option.

.. GENERATED FROM PYTHON SOURCE LINES 17-21

Data exploration on the Bike Sharing Demand dataset
---------------------------------------------------

We start by loading the data from the OpenML repository.

.. GENERATED FROM PYTHON SOURCE LINES 21-26

.. code-block:: default

    from sklearn.datasets import fetch_openml

    bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
    df = bike_sharing.frame


.. GENERATED FROM PYTHON SOURCE LINES 27-34

To get a quick understanding of the periodic patterns of the data, let us
have a look at the average demand per hour during a week.

Note that the week starts on a Sunday, during the weekend. We can clearly
distinguish the commute patterns in the morning and evenings of the work days
and the leisure use of the bikes on the weekends with a more spread peak
demand around the middle of the days:

.. GENERATED FROM PYTHON SOURCE LINES 34-48

.. code-block:: default

    import matplotlib.pyplot as plt


    fig, ax = plt.subplots(figsize=(12, 4))
    average_week_demand = df.groupby(["weekday", "hour"]).mean()["count"]
    average_week_demand.plot(ax=ax)
    _ = ax.set(
        title="Average hourly bike demand during the week",
        xticks=[i * 24 for i in range(7)],
        xticklabels=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
        xlabel="Time of the week",
        ylabel="Number of bike rentals",
    )


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_001.png
   :alt: Average hourly bike demand during the week
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 49-51

The target of the prediction problem is the absolute count of bike rentals on
a hourly basis:

.. GENERATED FROM PYTHON SOURCE LINES 52-54

.. code-block:: default

    df["count"].max()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    977.0


.. GENERATED FROM PYTHON SOURCE LINES 55-70

Let us rescale the target variable (number of hourly bike rentals) to predict
a relative demand so that the mean absolute error is more easily interpreted
as a fraction of the maximum demand.

.. note::

    The fit method of the models used in this notebook all minimize the
    mean squared error to estimate the conditional mean instead of the mean
    absolute error that would fit an estimator of the conditional median.

    When reporting performance measure on the test set in the discussion, we
    instead choose to focus on the mean absolute error that is more
    intuitive than the (root) mean squared error. Note, however, that the
    best models for one metric are also the best for the other in this
    study.

.. GENERATED FROM PYTHON SOURCE LINES 71-73

.. code-block:: default

    y = df["count"] / df["count"].max()


.. GENERATED FROM PYTHON SOURCE LINES 74-81

.. code-block:: default

    fig, ax = plt.subplots(figsize=(12, 4))
    y.hist(bins=30, ax=ax)
    _ = ax.set(
        xlabel="Fraction of rented fleet demand",
        ylabel="Number of hours",
    )


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_002.png
   :alt: plot cyclical feature engineering
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 82-87

The input feature data frame is a time annotated hourly log of variables
describing the weather conditions. It includes both numerical and categorical
variables. Note that the time information has already been expanded into
several complementary columns.


.. GENERATED FROM PYTHON SOURCE LINES 87-90

.. code-block:: default

    X = df.drop("count", axis="columns")
    X


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>season</th>
          <th>year</th>
          <th>month</th>
          <th>hour</th>
          <th>holiday</th>
          <th>weekday</th>
          <th>workingday</th>
          <th>weather</th>
          <th>temp</th>
          <th>feel_temp</th>
          <th>humidity</th>
          <th>windspeed</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>spring</td>
          <td>0.0</td>
          <td>1.0</td>
          <td>0.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>9.84</td>
          <td>14.395</td>
          <td>0.81</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>1</th>
          <td>spring</td>
          <td>0.0</td>
          <td>1.0</td>
          <td>1.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>9.02</td>
          <td>13.635</td>
          <td>0.80</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>2</th>
          <td>spring</td>
          <td>0.0</td>
          <td>1.0</td>
          <td>2.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>9.02</td>
          <td>13.635</td>
          <td>0.80</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>3</th>
          <td>spring</td>
          <td>0.0</td>
          <td>1.0</td>
          <td>3.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>9.84</td>
          <td>14.395</td>
          <td>0.75</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>4</th>
          <td>spring</td>
          <td>0.0</td>
          <td>1.0</td>
          <td>4.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>9.84</td>
          <td>14.395</td>
          <td>0.75</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>17374</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>19.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17375</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>20.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17376</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>21.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17377</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>22.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>13.635</td>
          <td>0.56</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>17378</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>23.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>13.635</td>
          <td>0.65</td>
          <td>8.9981</td>
        </tr>
      </tbody>
    </table>
    <p>17379 rows × 12 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 91-101

.. note::

   If the time information was only present as a date or datetime column, we
   could have expanded it into hour-in-the-day, day-in-the-week,
   day-in-the-month, month-in-the-year using pandas:
   https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

We now introspect the distribution of the categorical variables, starting
with `"weather"`:


.. GENERATED FROM PYTHON SOURCE LINES 101-103

.. code-block:: default

    X["weather"].value_counts()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    clear         11413
    misty          4544
    rain           1419
    heavy_rain        3
    Name: weather, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 104-108

Since there are only 3 `"heavy_rain"` events, we cannot use this category to
train machine learning models with cross validation. Instead, we simplify the
representation by collapsing those into the `"rain"` category.


.. GENERATED FROM PYTHON SOURCE LINES 108-109

.. code-block:: default

    X["weather"].replace(to_replace="heavy_rain", value="rain", inplace=True)


.. GENERATED FROM PYTHON SOURCE LINES 110-112

.. code-block:: default

    X["weather"].value_counts()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    clear    11413
    misty     4544
    rain      1422
    Name: weather, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 113-115

As expected, the `"season"` variable is well balanced:


.. GENERATED FROM PYTHON SOURCE LINES 115-117

.. code-block:: default

    X["season"].value_counts()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    fall      4496
    summer    4409
    spring    4242
    winter    4232
    Name: season, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 118-130

Time-based cross-validation
---------------------------

Since the dataset is a time-ordered event log (hourly demand), we will use a
time-sensitive cross-validation splitter to evaluate our demand forecasting
model as realistically as possible. We use a gap of 2 days between the train
and test side of the splits. We also limit the training set size to make the
performance of the CV folds more stable.

1000 test datapoints should be enough to quantify the performance of the
model. This represents a bit less than a month and a half of contiguous test
data:

.. GENERATED FROM PYTHON SOURCE LINES 130-140

.. code-block:: default


    from sklearn.model_selection import TimeSeriesSplit

    ts_cv = TimeSeriesSplit(
        n_splits=5,
        gap=48,
        max_train_size=10000,
        test_size=1000,
    )


.. GENERATED FROM PYTHON SOURCE LINES 141-143

Let us manually inspect the various splits to check that the
`TimeSeriesSplit` works as we expect, starting with the first split:

.. GENERATED FROM PYTHON SOURCE LINES 143-146

.. code-block:: default

    all_splits = list(ts_cv.split(X, y))
    train_0, test_0 = all_splits[0]


.. GENERATED FROM PYTHON SOURCE LINES 147-149

.. code-block:: default

    X.iloc[test_0]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>season</th>
          <th>year</th>
          <th>month</th>
          <th>hour</th>
          <th>holiday</th>
          <th>weekday</th>
          <th>workingday</th>
          <th>weather</th>
          <th>temp</th>
          <th>feel_temp</th>
          <th>humidity</th>
          <th>windspeed</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>12379</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>0.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>clear</td>
          <td>22.14</td>
          <td>25.760</td>
          <td>0.68</td>
          <td>27.9993</td>
        </tr>
        <tr>
          <th>12380</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>1.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>21.32</td>
          <td>25.000</td>
          <td>0.77</td>
          <td>22.0028</td>
        </tr>
        <tr>
          <th>12381</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>2.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>rain</td>
          <td>21.32</td>
          <td>25.000</td>
          <td>0.72</td>
          <td>19.9995</td>
        </tr>
        <tr>
          <th>12382</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>3.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>rain</td>
          <td>20.50</td>
          <td>24.240</td>
          <td>0.82</td>
          <td>12.9980</td>
        </tr>
        <tr>
          <th>12383</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>4.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>rain</td>
          <td>20.50</td>
          <td>24.240</td>
          <td>0.82</td>
          <td>12.9980</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>13374</th>
          <td>fall</td>
          <td>1.0</td>
          <td>7.0</td>
          <td>11.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>34.44</td>
          <td>40.150</td>
          <td>0.53</td>
          <td>15.0013</td>
        </tr>
        <tr>
          <th>13375</th>
          <td>fall</td>
          <td>1.0</td>
          <td>7.0</td>
          <td>12.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>34.44</td>
          <td>39.395</td>
          <td>0.49</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>13376</th>
          <td>fall</td>
          <td>1.0</td>
          <td>7.0</td>
          <td>13.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>34.44</td>
          <td>39.395</td>
          <td>0.49</td>
          <td>19.0012</td>
        </tr>
        <tr>
          <th>13377</th>
          <td>fall</td>
          <td>1.0</td>
          <td>7.0</td>
          <td>14.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>36.08</td>
          <td>40.910</td>
          <td>0.42</td>
          <td>7.0015</td>
        </tr>
        <tr>
          <th>13378</th>
          <td>fall</td>
          <td>1.0</td>
          <td>7.0</td>
          <td>15.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>35.26</td>
          <td>40.150</td>
          <td>0.47</td>
          <td>16.9979</td>
        </tr>
      </tbody>
    </table>
    <p>1000 rows × 12 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 150-152

.. code-block:: default

    X.iloc[train_0]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>season</th>
          <th>year</th>
          <th>month</th>
          <th>hour</th>
          <th>holiday</th>
          <th>weekday</th>
          <th>workingday</th>
          <th>weather</th>
          <th>temp</th>
          <th>feel_temp</th>
          <th>humidity</th>
          <th>windspeed</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>2331</th>
          <td>summer</td>
          <td>0.0</td>
          <td>4.0</td>
          <td>1.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>25.42</td>
          <td>31.060</td>
          <td>0.50</td>
          <td>6.0032</td>
        </tr>
        <tr>
          <th>2332</th>
          <td>summer</td>
          <td>0.0</td>
          <td>4.0</td>
          <td>2.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>24.60</td>
          <td>31.060</td>
          <td>0.53</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>2333</th>
          <td>summer</td>
          <td>0.0</td>
          <td>4.0</td>
          <td>3.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>23.78</td>
          <td>27.275</td>
          <td>0.56</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>2334</th>
          <td>summer</td>
          <td>0.0</td>
          <td>4.0</td>
          <td>4.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>22.96</td>
          <td>26.515</td>
          <td>0.64</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>2335</th>
          <td>summer</td>
          <td>0.0</td>
          <td>4.0</td>
          <td>5.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>22.14</td>
          <td>25.760</td>
          <td>0.68</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>12326</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>19.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>26.24</td>
          <td>31.060</td>
          <td>0.36</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>12327</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>20.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>25.42</td>
          <td>31.060</td>
          <td>0.35</td>
          <td>19.0012</td>
        </tr>
        <tr>
          <th>12328</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>21.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>24.60</td>
          <td>31.060</td>
          <td>0.40</td>
          <td>7.0015</td>
        </tr>
        <tr>
          <th>12329</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>22.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>23.78</td>
          <td>27.275</td>
          <td>0.46</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>12330</th>
          <td>summer</td>
          <td>1.0</td>
          <td>6.0</td>
          <td>23.0</td>
          <td>False</td>
          <td>6.0</td>
          <td>False</td>
          <td>clear</td>
          <td>22.96</td>
          <td>26.515</td>
          <td>0.52</td>
          <td>7.0015</td>
        </tr>
      </tbody>
    </table>
    <p>10000 rows × 12 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 153-154

We now inspect the last split:

.. GENERATED FROM PYTHON SOURCE LINES 154-156

.. code-block:: default

    train_4, test_4 = all_splits[4]


.. GENERATED FROM PYTHON SOURCE LINES 157-159

.. code-block:: default

    X.iloc[test_4]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>season</th>
          <th>year</th>
          <th>month</th>
          <th>hour</th>
          <th>holiday</th>
          <th>weekday</th>
          <th>workingday</th>
          <th>weather</th>
          <th>temp</th>
          <th>feel_temp</th>
          <th>humidity</th>
          <th>windspeed</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>16379</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>5.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>13.94</td>
          <td>16.665</td>
          <td>0.66</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>16380</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>6.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>13.94</td>
          <td>16.665</td>
          <td>0.71</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>16381</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>7.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>clear</td>
          <td>13.12</td>
          <td>16.665</td>
          <td>0.76</td>
          <td>6.0032</td>
        </tr>
        <tr>
          <th>16382</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>8.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>clear</td>
          <td>13.94</td>
          <td>16.665</td>
          <td>0.71</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>16383</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>9.0</td>
          <td>False</td>
          <td>2.0</td>
          <td>True</td>
          <td>misty</td>
          <td>14.76</td>
          <td>18.940</td>
          <td>0.71</td>
          <td>0.0000</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>17374</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>19.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17375</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>20.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17376</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>21.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>12.880</td>
          <td>0.60</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>17377</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>22.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>13.635</td>
          <td>0.56</td>
          <td>8.9981</td>
        </tr>
        <tr>
          <th>17378</th>
          <td>spring</td>
          <td>1.0</td>
          <td>12.0</td>
          <td>23.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>10.66</td>
          <td>13.635</td>
          <td>0.65</td>
          <td>8.9981</td>
        </tr>
      </tbody>
    </table>
    <p>1000 rows × 12 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 160-162

.. code-block:: default

    X.iloc[train_4]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>season</th>
          <th>year</th>
          <th>month</th>
          <th>hour</th>
          <th>holiday</th>
          <th>weekday</th>
          <th>workingday</th>
          <th>weather</th>
          <th>temp</th>
          <th>feel_temp</th>
          <th>humidity</th>
          <th>windspeed</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>6331</th>
          <td>winter</td>
          <td>0.0</td>
          <td>9.0</td>
          <td>9.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>26.24</td>
          <td>28.790</td>
          <td>0.89</td>
          <td>12.9980</td>
        </tr>
        <tr>
          <th>6332</th>
          <td>winter</td>
          <td>0.0</td>
          <td>9.0</td>
          <td>10.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>26.24</td>
          <td>28.790</td>
          <td>0.89</td>
          <td>12.9980</td>
        </tr>
        <tr>
          <th>6333</th>
          <td>winter</td>
          <td>0.0</td>
          <td>9.0</td>
          <td>11.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>clear</td>
          <td>27.88</td>
          <td>31.820</td>
          <td>0.79</td>
          <td>15.0013</td>
        </tr>
        <tr>
          <th>6334</th>
          <td>winter</td>
          <td>0.0</td>
          <td>9.0</td>
          <td>12.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>27.88</td>
          <td>31.820</td>
          <td>0.79</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>6335</th>
          <td>winter</td>
          <td>0.0</td>
          <td>9.0</td>
          <td>13.0</td>
          <td>False</td>
          <td>1.0</td>
          <td>True</td>
          <td>misty</td>
          <td>28.70</td>
          <td>33.335</td>
          <td>0.74</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>16326</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>0.0</td>
          <td>False</td>
          <td>0.0</td>
          <td>False</td>
          <td>misty</td>
          <td>12.30</td>
          <td>15.150</td>
          <td>0.70</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>16327</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>1.0</td>
          <td>False</td>
          <td>0.0</td>
          <td>False</td>
          <td>clear</td>
          <td>12.30</td>
          <td>14.395</td>
          <td>0.70</td>
          <td>12.9980</td>
        </tr>
        <tr>
          <th>16328</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>2.0</td>
          <td>False</td>
          <td>0.0</td>
          <td>False</td>
          <td>clear</td>
          <td>11.48</td>
          <td>14.395</td>
          <td>0.81</td>
          <td>7.0015</td>
        </tr>
        <tr>
          <th>16329</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>3.0</td>
          <td>False</td>
          <td>0.0</td>
          <td>False</td>
          <td>misty</td>
          <td>12.30</td>
          <td>15.150</td>
          <td>0.81</td>
          <td>11.0014</td>
        </tr>
        <tr>
          <th>16330</th>
          <td>winter</td>
          <td>1.0</td>
          <td>11.0</td>
          <td>4.0</td>
          <td>False</td>
          <td>0.0</td>
          <td>False</td>
          <td>misty</td>
          <td>12.30</td>
          <td>14.395</td>
          <td>0.81</td>
          <td>12.9980</td>
        </tr>
      </tbody>
    </table>
    <p>10000 rows × 12 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 163-182

All is well. We are now ready to do some predictive modeling!

Gradient Boosting
-----------------

Gradient Boosting Regression with decision trees is often flexible enough to
efficiently handle heteorogenous tabular data with a mix of categorical and
numerical features as long as the number of samples is large enough.

Here, we do minimal ordinal encoding for the categorical variables and then
let the model know that it should treat those as categorical variables by
using a dedicated tree splitting rule. Since we use an ordinal encoder, we
pass the list of categorical values explicitly to use a logical order when
encoding the categories as integers instead of the lexicographical order.
This also has the added benefit of preventing any issue with unknown
categories when using cross-validation.

The numerical variables need no preprocessing and, for the sake of simplicity,
we only try the default hyper-parameters for this model:

.. GENERATED FROM PYTHON SOURCE LINES 182-216

.. code-block:: default

    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import OrdinalEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.ensemble import HistGradientBoostingRegressor
    from sklearn.model_selection import cross_validate


    categorical_columns = [
        "weather",
        "season",
        "holiday",
        "workingday",
    ]
    categories = [
        ["clear", "misty", "rain"],
        ["spring", "summer", "fall", "winter"],
        ["False", "True"],
        ["False", "True"],
    ]
    ordinal_encoder = OrdinalEncoder(categories=categories)


    gbrt_pipeline = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("categorical", ordinal_encoder, categorical_columns),
            ],
            remainder="passthrough",
        ),
        HistGradientBoostingRegressor(
            categorical_features=range(4),
        ),
    )


.. GENERATED FROM PYTHON SOURCE LINES 217-219

Lets evaluate our gradient boosting model with the mean absolute error of the
relative demand averaged across our 5 time-based cross-validation splits:

.. GENERATED FROM PYTHON SOURCE LINES 220-240

.. code-block:: default


    def evaluate(model, X, y, cv):
        cv_results = cross_validate(
            model,
            X,
            y,
            cv=cv,
            scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
        )
        mae = -cv_results["test_neg_mean_absolute_error"]
        rmse = -cv_results["test_neg_root_mean_squared_error"]
        print(
            f"Mean Absolute Error:     {mae.mean():.3f} +/- {mae.std():.3f}\n"
            f"Root Mean Squared Error: {rmse.mean():.3f} +/- {rmse.std():.3f}"
        )


    evaluate(gbrt_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.044 +/- 0.003
    Root Mean Squared Error: 0.068 +/- 0.005


.. GENERATED FROM PYTHON SOURCE LINES 241-258

This model has an average error around 4 to 5% of the maximum demand. This is
quite good for a first trial without any hyper-parameter tuning! We just had
to make the categorical variables explicit. Note that the time related
features are passed as is, i.e. without processing them. But this is not much
of a problem for tree-based models as they can learn a non-monotonic
relationship between ordinal input features and the target.

This is not the case for linear regression models as we will see in the
following.

Naive linear regression
-----------------------

As usual for linear models, categorical variables need to be one-hot encoded.
For consistency, we scale the numerical features to the same 0-1 range using
class:`sklearn.preprocessing.MinMaxScaler`, although in this case it does not
impact the results much because they are already on comparable scales:

.. GENERATED FROM PYTHON SOURCE LINES 258-280

.. code-block:: default

    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.linear_model import RidgeCV
    import numpy as np


    one_hot_encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
    alphas = np.logspace(-6, 6, 25)
    naive_linear_pipeline = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("categorical", one_hot_encoder, categorical_columns),
            ],
            remainder=MinMaxScaler(),
        ),
        RidgeCV(alphas=alphas),
    )


    evaluate(naive_linear_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.142 +/- 0.014
    Root Mean Squared Error: 0.184 +/- 0.020


.. GENERATED FROM PYTHON SOURCE LINES 281-307

The performance is not good: the average error is around 14% of the maximum
demand. This is more than three times higher than the average error of the
gradient boosting model. We can suspect that the naive original encoding
(merely min-max scaled) of the periodic time-related features might prevent
the linear regression model to properly leverage the time information: linear
regression does not automatically model non-monotonic relationships between
the input features and the target. Non-linear terms have to be engineered in
the input.

For example, the raw numerical encoding of the `"hour"` feature prevents the
linear model from recognizing that an increase of hour in the morning from 6
to 8 should have a strong positive impact on the number of bike rentals while
an increase of similar magnitude in the evening from 18 to 20 should have a
strong negative impact on the predicted number of bike rentals.

Time-steps as categories
------------------------

Since the time features are encoded in a discrete manner using integers (24
unique values in the "hours" feature), we could decide to treat those as
categorical variables using a one-hot encoding and thereby ignore any
assumption implied by the ordering of the hour values.

Using one-hot encoding for the time features gives the linear model a lot
more flexibility as we introduce one additional feature per discrete time
level.

.. GENERATED FROM PYTHON SOURCE LINES 308-321

.. code-block:: default

    one_hot_linear_pipeline = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("categorical", one_hot_encoder, categorical_columns),
                ("one_hot_time", one_hot_encoder, ["hour", "weekday", "month"]),
            ],
            remainder=MinMaxScaler(),
        ),
        RidgeCV(alphas=alphas),
    )

    evaluate(one_hot_linear_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.099 +/- 0.011
    Root Mean Squared Error: 0.131 +/- 0.011


.. GENERATED FROM PYTHON SOURCE LINES 322-351

The average error rate of this model is 10% which is much better than using
the original (ordinal) encoding of the time feature, confirming our intuition
that the linear regression model benefits from the added flexibility to not
treat time progression in a monotonic manner.

However, this introduces a very large number of new features. If the time of
the day was represented in minutes since the start of the day instead of
hours, one-hot encoding would have introduced 1440 features instead of 24.
This could cause some significant overfitting. To avoid this we could use
:func:`sklearn.preprocessing.KBinsDiscretizer` instead to re-bin the number
of levels of fine-grained ordinal or numerical variables while still
benefitting from the non-monotonic expressivity advantages of one-hot
encoding.

Finally, we also observe that one-hot encoding completely ignores the
ordering of the hour levels while this could be an interesting inductive bias
to preserve to some level. In the following we try to explore smooth,
non-monotonic encoding that locally preserves the relative ordering of time
features.

Trigonometric features
----------------------

As a first attempt, we can try to encode each of those periodic features
using a sine and cosine transformation with the matching period.

Each ordinal time feature is transformed into 2 features that together encode
equivalent information in a non-monotonic way, and more importantly without
any jump between the first and the last value of the periodic range.

.. GENERATED FROM PYTHON SOURCE LINES 351-362

.. code-block:: default

    from sklearn.preprocessing import FunctionTransformer


    def sin_transformer(period):
        return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi))


    def cos_transformer(period):
        return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))


.. GENERATED FROM PYTHON SOURCE LINES 363-365

Let us visualize the effect of this feature expansion on some synthetic hour
data with a bit of extrapolation beyond hour=23:

.. GENERATED FROM PYTHON SOURCE LINES 366-377

.. code-block:: default

    import pandas as pd

    hour_df = pd.DataFrame(
        np.arange(26).reshape(-1, 1),
        columns=["hour"],
    )
    hour_df["hour_sin"] = sin_transformer(24).fit_transform(hour_df)["hour"]
    hour_df["hour_cos"] = cos_transformer(24).fit_transform(hour_df)["hour"]
    hour_df.plot(x="hour")
    _ = plt.title("Trigonometric encoding for the 'hour' feature")


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_003.png
   :alt: Trigonometric encoding for the 'hour' feature
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 378-383

Let's use a 2D scatter plot with the hours encoded as colors to better see
how this representation maps the 24 hours of the day to a 2D space, akin to
some sort of a 24 hour version of an analog clock. Note that the "25th" hour
is mapped back to the 1st hour because of the periodic nature of the
sine/cosine representation.

.. GENERATED FROM PYTHON SOURCE LINES 384-392

.. code-block:: default

    fig, ax = plt.subplots(figsize=(7, 5))
    sp = ax.scatter(hour_df["hour_sin"], hour_df["hour_cos"], c=hour_df["hour"])
    ax.set(
        xlabel="sin(hour)",
        ylabel="cos(hour)",
    )
    _ = fig.colorbar(sp)


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_004.png
   :alt: plot cyclical feature engineering
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 393-394

We can now build a feature extraction pipeline using this strategy:

.. GENERATED FROM PYTHON SOURCE LINES 395-414

.. code-block:: default

    cyclic_cossin_transformer = ColumnTransformer(
        transformers=[
            ("categorical", one_hot_encoder, categorical_columns),
            ("month_sin", sin_transformer(12), ["month"]),
            ("month_cos", cos_transformer(12), ["month"]),
            ("weekday_sin", sin_transformer(7), ["weekday"]),
            ("weekday_cos", cos_transformer(7), ["weekday"]),
            ("hour_sin", sin_transformer(24), ["hour"]),
            ("hour_cos", cos_transformer(24), ["hour"]),
        ],
        remainder=MinMaxScaler(),
    )
    cyclic_cossin_linear_pipeline = make_pipeline(
        cyclic_cossin_transformer,
        RidgeCV(alphas=alphas),
    )
    evaluate(cyclic_cossin_linear_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.125 +/- 0.014
    Root Mean Squared Error: 0.166 +/- 0.020


.. GENERATED FROM PYTHON SOURCE LINES 415-427

The performance of our linear regression model with this simple feature
engineering is a bit better than using the original ordinal time features but
worse than using the one-hot encoded time features. We will further analyze
possible reasons for this disappointing outcome at the end of this notebook.

Periodic spline features
------------------------

We can try an alternative encoding of the periodic time-related features
using spline transformations with a large enough number of splines, and as a
result a larger number of expanded features compared to the sine/cosine
transformation:

.. GENERATED FROM PYTHON SOURCE LINES 428-444

.. code-block:: default

    from sklearn.preprocessing import SplineTransformer


    def periodic_spline_transformer(period, n_splines=None, degree=3):
        if n_splines is None:
            n_splines = period
        n_knots = n_splines + 1  # periodic and include_bias is True
        return SplineTransformer(
            degree=degree,
            n_knots=n_knots,
            knots=np.linspace(0, period, n_knots).reshape(n_knots, 1),
            extrapolation="periodic",
            include_bias=True,
        )


.. GENERATED FROM PYTHON SOURCE LINES 445-447

Again, let us visualize the effect of this feature expansion on some
synthetic hour data with a bit of extrapolation beyond hour=23:

.. GENERATED FROM PYTHON SOURCE LINES 448-461

.. code-block:: default

    hour_df = pd.DataFrame(
        np.linspace(0, 26, 1000).reshape(-1, 1),
        columns=["hour"],
    )
    splines = periodic_spline_transformer(24, n_splines=12).fit_transform(hour_df)
    splines_df = pd.DataFrame(
        splines,
        columns=[f"spline_{i}" for i in range(splines.shape[1])],
    )
    pd.concat([hour_df, splines_df], axis="columns").plot(x="hour", cmap=plt.cm.tab20b)
    _ = plt.title("Periodic spline-based encoding for the 'hour' feature")


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_005.png
   :alt: Periodic spline-based encoding for the 'hour' feature
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 462-471

Thanks to the use of the `extrapolation="periodic"` parameter, we observe
that the feature encoding stays smooth when extrapolating beyond midnight.

We can now build a predictive pipeline using this alternative periodic
feature engineering strategy.

It is possible to use fewer splines than discrete levels for those ordinal
values. This makes spline-based encoding more efficient than one-hot encoding
while preserving most of the expressivity:

.. GENERATED FROM PYTHON SOURCE LINES 471-486

.. code-block:: default

    cyclic_spline_transformer = ColumnTransformer(
        transformers=[
            ("categorical", one_hot_encoder, categorical_columns),
            ("cyclic_month", periodic_spline_transformer(12, n_splines=6), ["month"]),
            ("cyclic_weekday", periodic_spline_transformer(7, n_splines=3), ["weekday"]),
            ("cyclic_hour", periodic_spline_transformer(24, n_splines=12), ["hour"]),
        ],
        remainder=MinMaxScaler(),
    )
    cyclic_spline_linear_pipeline = make_pipeline(
        cyclic_spline_transformer,
        RidgeCV(alphas=alphas),
    )
    evaluate(cyclic_spline_linear_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.097 +/- 0.011
    Root Mean Squared Error: 0.132 +/- 0.013


.. GENERATED FROM PYTHON SOURCE LINES 487-500

Spline features make it possible for the linear model to successfully
leverage the periodic time-related features and reduce the error from ~14% to
~10% of the maximum demand, which is similar to what we observed with the
one-hot encoded features.

Qualitative analysis of the impact of features on linear model predictions
--------------------------------------------------------------------------

Here, we want to visualize the impact of the feature engineering choices on
the time related shape of the predictions.

To do so we consider an arbitrary time-based split to compare the predictions
on a range of held out data points.

.. GENERATED FROM PYTHON SOURCE LINES 500-512

.. code-block:: default

    naive_linear_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    naive_linear_predictions = naive_linear_pipeline.predict(X.iloc[test_0])

    one_hot_linear_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    one_hot_linear_predictions = one_hot_linear_pipeline.predict(X.iloc[test_0])

    cyclic_cossin_linear_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    cyclic_cossin_linear_predictions = cyclic_cossin_linear_pipeline.predict(X.iloc[test_0])

    cyclic_spline_linear_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    cyclic_spline_linear_predictions = cyclic_spline_linear_pipeline.predict(X.iloc[test_0])


.. GENERATED FROM PYTHON SOURCE LINES 513-515

We visualize those predictions by zooming on the last 96 hours (4 days) of
the test set to get some qualitative insights:

.. GENERATED FROM PYTHON SOURCE LINES 515-543

.. code-block:: default

    last_hours = slice(-96, None)
    fig, ax = plt.subplots(figsize=(12, 4))
    fig.suptitle("Predictions by linear models")
    ax.plot(
        y.iloc[test_0].values[last_hours],
        "x-",
        alpha=0.2,
        label="Actual demand",
        color="black",
    )
    ax.plot(naive_linear_predictions[last_hours], "x-", label="Ordinal time features")
    ax.plot(
        cyclic_cossin_linear_predictions[last_hours],
        "x-",
        label="Trigonometric time features",
    )
    ax.plot(
        cyclic_spline_linear_predictions[last_hours],
        "x-",
        label="Spline-based time features",
    )
    ax.plot(
        one_hot_linear_predictions[last_hours],
        "x-",
        label="One-hot time features",
    )
    _ = ax.legend()


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_006.png
   :alt: Predictions by linear models
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 544-570

We can draw the following conclusions from the above plot:

- The **raw ordinal time-related features** are problematic because they do
  not capture the natural periodicity: we observe a big jump in the
  predictions at the end of each day when the hour features goes from 23 back
  to 0. We can expect similar artifacts at the end of each week or each year.

- As expected, the **trigonometric features** (sine and cosine) do not have
  these discontinuities at midnight, but the linear regression model fails to
  leverage those features to properly model intra-day variations.
  Using trigonometric features for higher harmonics or additional
  trigonometric features for the natural period with different phases could
  potentially fix this problem.

- the **periodic spline-based features** fix those two problems at once: they
  give more expressivity to the linear model by making it possible to focus
  on specific hours thanks to the use of 12 splines. Furthermore the
  `extrapolation="periodic"` option enforces a smooth representation between
  `hour=23` and `hour=0`.

- The **one-hot encoded features** behave similarly to the periodic
  spline-based features but are more spiky: for instance they can better
  model the morning peak during the week days since this peak lasts shorter
  than an hour. However, we will see in the following that what can be an
  advantage for linear models is not necessarily one for more expressive
  models.

.. GENERATED FROM PYTHON SOURCE LINES 572-574

We can also compare the number of features extracted by each feature
engineering pipeline:

.. GENERATED FROM PYTHON SOURCE LINES 574-576

.. code-block:: default

    naive_linear_pipeline[:-1].transform(X).shape


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (17379, 19)


.. GENERATED FROM PYTHON SOURCE LINES 577-579

.. code-block:: default

    one_hot_linear_pipeline[:-1].transform(X).shape


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (17379, 59)


.. GENERATED FROM PYTHON SOURCE LINES 580-582

.. code-block:: default

    cyclic_cossin_linear_pipeline[:-1].transform(X).shape


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (17379, 22)


.. GENERATED FROM PYTHON SOURCE LINES 583-585

.. code-block:: default

    cyclic_spline_linear_pipeline[:-1].transform(X).shape


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (17379, 37)


.. GENERATED FROM PYTHON SOURCE LINES 586-602

This confirms that the one-hot encoding and the spline encoding strategies
create a lot more features for the time representation than the alternatives,
which in turn gives the downstream linear model more flexibility (degrees of
freedom) to avoid underfitting.

Finally, we observe that none of the linear models can approximate the true
bike rentals demand, especially for the peaks that can be very sharp at rush
hours during the working days but much flatter during the week-ends: the most
accurate linear models based on splines or one-hot encoding tend to forecast
peaks of commuting-related bike rentals even on the week-ends and
under-estimate the commuting-related events during the working days.

These systematic prediction errors reveal a form of under-fitting and can be
explained by the lack of interactions terms between features, e.g.
"workingday" and features derived from "hours". This issue will be addressed
in the following section.

.. GENERATED FROM PYTHON SOURCE LINES 604-615

Modeling pairwise interactions with splines and polynomial features
-------------------------------------------------------------------

Linear models do not automatically capture interaction effects between input
features. It does not help that some features are marginally non-linear as is
the case with features constructed by `SplineTransformer` (or one-hot
encoding or binning).

However, it is possible to use the `PolynomialFeatures` class on coarse
grained spline encoded hours to model the "workingday"/"hours" interaction
explicitly without introducing too many new variables:

.. GENERATED FROM PYTHON SOURCE LINES 615-629

.. code-block:: default

    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.pipeline import FeatureUnion


    hour_workday_interaction = make_pipeline(
        ColumnTransformer(
            [
                ("cyclic_hour", periodic_spline_transformer(24, n_splines=8), ["hour"]),
                ("workingday", FunctionTransformer(lambda x: x == "True"), ["workingday"]),
            ]
        ),
        PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
    )


.. GENERATED FROM PYTHON SOURCE LINES 630-633

Those features are then combined with the ones already computed in the
previous spline-base pipeline. We can observe a nice performance improvemnt
by modeling this pairwise interaction explicitly:

.. GENERATED FROM PYTHON SOURCE LINES 633-645

.. code-block:: default


    cyclic_spline_interactions_pipeline = make_pipeline(
        FeatureUnion(
            [
                ("marginal", cyclic_spline_transformer),
                ("interactions", hour_workday_interaction),
            ]
        ),
        RidgeCV(alphas=alphas),
    )
    evaluate(cyclic_spline_interactions_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.078 +/- 0.009
    Root Mean Squared Error: 0.104 +/- 0.009


.. GENERATED FROM PYTHON SOURCE LINES 646-662

Modeling non-linear feature interactions with kernels
-----------------------------------------------------

The previous analysis highlighted the need to model the interactions between
`"workingday"` and `"hours"`. Another example of a such a non-linear
interaction that we would like to model could be the impact of the rain that
might not be the same during the working days and the week-ends and holidays
for instance.

To model all such interactions, we could either use a polynomial expansion on
all marginal features at once, after their spline-based expansion. However,
this would create a quadratic number of features which can cause overfitting
and computational tractability issues.

Alternatively, we can use the Nyström method to compute an approximate
polynomial kernel expansion. Let us try the latter:

.. GENERATED FROM PYTHON SOURCE LINES 662-672

.. code-block:: default

    from sklearn.kernel_approximation import Nystroem


    cyclic_spline_poly_pipeline = make_pipeline(
        cyclic_spline_transformer,
        Nystroem(kernel="poly", degree=2, n_components=300, random_state=0),
        RidgeCV(alphas=alphas),
    )
    evaluate(cyclic_spline_poly_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.053 +/- 0.002
    Root Mean Squared Error: 0.076 +/- 0.004


.. GENERATED FROM PYTHON SOURCE LINES 673-683

We observe that this model can almost rival the performance of the gradient
boosted trees with an average error around 5% of the maximum demand.

Note that while the final step of this pipeline is a linear regression model,
the intermediate steps such as the spline feature extraction and the Nyström
kernel approximation are highly non-linear. As a result the compound pipeline
is much more expressive than a simple linear regression model with raw features.

For the sake of completeness, we also evaluate the combination of one-hot
encoding and kernel approximation:

.. GENERATED FROM PYTHON SOURCE LINES 684-699

.. code-block:: default


    one_hot_poly_pipeline = make_pipeline(
        ColumnTransformer(
            transformers=[
                ("categorical", one_hot_encoder, categorical_columns),
                ("one_hot_time", one_hot_encoder, ["hour", "weekday", "month"]),
            ],
            remainder="passthrough",
        ),
        Nystroem(kernel="poly", degree=2, n_components=300, random_state=0),
        RidgeCV(alphas=alphas),
    )
    evaluate(one_hot_poly_pipeline, X, y, cv=ts_cv)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Mean Absolute Error:     0.082 +/- 0.006
    Root Mean Squared Error: 0.111 +/- 0.011


.. GENERATED FROM PYTHON SOURCE LINES 700-709

While one-hot encoded features were competitive with spline-based features
when using linear models, this is no longer the case when using a low-rank
approximation of a non-linear kernel: this can be explained by the fact that
spline features are smoother and allow the kernel approximation to find a
more expressive decision function.

Let us now have a qualitative look at the predictions of the kernel models
and of the gradient boosted trees that should be able to better model
non-linear interactions between features:

.. GENERATED FROM PYTHON SOURCE LINES 709-718

.. code-block:: default

    gbrt_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    gbrt_predictions = gbrt_pipeline.predict(X.iloc[test_0])

    one_hot_poly_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    one_hot_poly_predictions = one_hot_poly_pipeline.predict(X.iloc[test_0])

    cyclic_spline_poly_pipeline.fit(X.iloc[train_0], y.iloc[train_0])
    cyclic_spline_poly_predictions = cyclic_spline_poly_pipeline.predict(X.iloc[test_0])


.. GENERATED FROM PYTHON SOURCE LINES 719-720

Again we zoom on the last 4 days of the test set:

.. GENERATED FROM PYTHON SOURCE LINES 720-749

.. code-block:: default


    last_hours = slice(-96, None)
    fig, ax = plt.subplots(figsize=(12, 4))
    fig.suptitle("Predictions by non-linear regression models")
    ax.plot(
        y.iloc[test_0].values[last_hours],
        "x-",
        alpha=0.2,
        label="Actual demand",
        color="black",
    )
    ax.plot(
        gbrt_predictions[last_hours],
        "x-",
        label="Gradient Boosted Trees",
    )
    ax.plot(
        one_hot_poly_predictions[last_hours],
        "x-",
        label="One-hot + polynomial kernel",
    )
    ax.plot(
        cyclic_spline_poly_predictions[last_hours],
        "x-",
        label="Splines + polynomial kernel",
    )
    _ = ax.legend()


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_007.png
   :alt: Predictions by non-linear regression models
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_007.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 750-771

First, note that trees can naturally model non-linear feature interactions
since, by default, decision trees are allowed to grow beyond a depth of 2
levels.

Here, we can observe that the combinations of spline features and non-linear
kernels works quite well and can almost rival the accuracy of the gradient
boosting regression trees.

On the contrary, one-hot encoded time features do not perform that well with
the low rank kernel model. In particular, they significantly over-estimate
the low demand hours more than the competing models.

We also observe that none of the models can successfully predict some of the
peak rentals at the rush hours during the working days. It is possible that
access to additional features would be required to further improve the
accuracy of the predictions. For instance, it could be useful to have access
to the geographical repartition of the fleet at any point in time or the
fraction of bikes that are immobilized because they need servicing.

Let us finally get a more quantative look at the prediction errors of those
three models using the true vs predicted demand scatter plots:

.. GENERATED FROM PYTHON SOURCE LINES 771-795

.. code-block:: default

    fig, axes = plt.subplots(ncols=3, figsize=(12, 4), sharey=True)
    fig.suptitle("Non-linear regression models")
    predictions = [
        one_hot_poly_predictions,
        cyclic_spline_poly_predictions,
        gbrt_predictions,
    ]
    labels = [
        "One hot + polynomial kernel",
        "Splines + polynomial kernel",
        "Gradient Boosted Trees",
    ]
    for ax, pred, label in zip(axes, predictions, labels):
        ax.scatter(y.iloc[test_0].values, pred, alpha=0.3, label=label)
        ax.plot([0, 1], [0, 1], "--", label="Perfect model")
        ax.set(
            xlim=(0, 1),
            ylim=(0, 1),
            xlabel="True demand",
            ylabel="Predicted demand",
        )
        ax.legend()

    plt.show()


.. image-sg:: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_008.png
   :alt: Non-linear regression models
   :srcset: /auto_examples/applications/images/sphx_glr_plot_cyclical_feature_engineering_008.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 796-832

This visualization confirms the conclusions we draw on the previous plot.

All models under-estimate the high demand events (working day rush hours),
but gradient boosting a bit less so. The low demand events are well predicted
on average by gradient boosting while the one-hot polynomial regression
pipeline seems to systematically over-estimate demand in that regime. Overall
the predictions of the gradient boosted trees are closer to the diagonal than
for the kernel models.

Concluding remarks
------------------

We note that we could have obtained slightly better results for kernel models
by using more components (higher rank kernel approximation) at the cost of
longer fit and prediction durations. For large values of `n_components`, the
performance of the one-hot encoded features would even match the spline
features.

The `Nystroem` + `RidgeCV` regressor could also have been replaced by
:class:`~sklearn.neural_network.MLPRegressor` with one or two hidden layers
and we would have obtained quite similar results.

The dataset we used in this case study is sampled on a hourly basis. However
cyclic spline-based features could model time-within-day or time-within-week
very efficiently with finer-grained time resolutions (for instance with
measurements taken every minute instead of every hours) without introducing
more features. One-hot encoding time representations would not offer this
flexibility.

Finally, in this notebook we used `RidgeCV` because it is very efficient from
a computational point of view. However, it models the target variable as a
Gaussian random variable with constant variance. For positive regression
problems, it is likely that using a Poisson or Gamma distribution would make
more sense. This could be achieved by using
`GridSearchCV(TweedieRegressor(power=2), param_grid({"alpha": alphas}))`
instead of `RidgeCV`.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  21.211 seconds)


.. _sphx_glr_download_auto_examples_applications_plot_cyclical_feature_engineering.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.1.X?urlpath=lab/tree/notebooks/auto_examples/applications/plot_cyclical_feature_engineering.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_cyclical_feature_engineering.py <plot_cyclical_feature_engineering.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_cyclical_feature_engineering.ipynb <plot_cyclical_feature_engineering.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_