#### Note

The inner validation done during early stopping is not optimal for\n time series.

\n\n## Support for missing values\nHGBT models have native support of missing values. During training, the tree\ngrower decides where samples with missing values should go (left or right\nchild) at each split, based on the potential gain. When predicting, these\nsamples are sent to the learnt child accordingly. If a feature had no missing\nvalues during training, then for prediction, samples with missing values for that\nfeature are sent to the child with the most samples (as seen during fit).\n\nThe present example shows how HGBT regressions deal with values missing\ncompletely at random (MCAR), i.e. the missingness does not depend on the\nobserved data or the unobserved data. We can simulate such scenario by\nrandomly replacing values from randomly selected features with `nan` values.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n\nfrom sklearn.metrics import root_mean_squared_error\n\nrng = np.random.RandomState(42)\nfirst_week = slice(0, 336) # first week in the test set as 7 * 48 = 336\nmissing_fraction_list = [0, 0.01, 0.03]\n\n\ndef generate_missing_values(X, missing_fraction):\n total_cells = X.shape[0] * X.shape[1]\n num_missing_cells = int(total_cells * missing_fraction)\n row_indices = rng.choice(X.shape[0], num_missing_cells, replace=True)\n col_indices = rng.choice(X.shape[1], num_missing_cells, replace=True)\n X_missing = X.copy()\n X_missing.iloc[row_indices, col_indices] = np.nan\n return X_missing\n\n\nfig, ax = plt.subplots(figsize=(12, 6))\nax.plot(y_test.values[first_week], label=\"Actual transfer\")\n\nfor missing_fraction in missing_fraction_list:\n X_train_missing = generate_missing_values(X_train, missing_fraction)\n X_test_missing = generate_missing_values(X_test, missing_fraction)\n hgbt.fit(X_train_missing, y_train)\n y_pred = hgbt.predict(X_test_missing[first_week])\n rmse = root_mean_squared_error(y_test[first_week], y_pred)\n ax.plot(\n y_pred[first_week],\n label=f\"missing_fraction={missing_fraction}, RMSE={rmse:.3f}\",\n alpha=0.5,\n )\nax.set(\n title=\"Daily energy transfer predictions on data with MCAR values\",\n xticks=[(i + 0.2) * 48 for i in range(7)],\n xticklabels=[\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"],\n xlabel=\"Time of the week\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend(loc=\"lower right\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the model degrades as the proportion of missing values increases.\n\n## Support for quantile loss\n\nThe quantile loss in regression enables a view of the variability or\nuncertainty of the target variable. For instance, predicting the 5th and 95th\npercentiles can provide a 90% prediction interval, i.e. the range within which\nwe expect a new observed value to fall with 90% probability.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_pinball_loss\n\nquantiles = [0.95, 0.05]\npredictions = []\n\nfig, ax = plt.subplots(figsize=(12, 6))\nax.plot(y_test.values[first_week], label=\"Actual transfer\")\n\nfor quantile in quantiles:\n hgbt_quantile = HistGradientBoostingRegressor(\n loss=\"quantile\", quantile=quantile, **common_params\n )\n hgbt_quantile.fit(X_train, y_train)\n y_pred = hgbt_quantile.predict(X_test[first_week])\n\n predictions.append(y_pred)\n score = mean_pinball_loss(y_test[first_week], y_pred)\n ax.plot(\n y_pred[first_week],\n label=f\"quantile={quantile}, pinball loss={score:.2f}\",\n alpha=0.5,\n )\n\nax.fill_between(\n range(len(predictions[0][first_week])),\n predictions[0][first_week],\n predictions[1][first_week],\n color=colors[0],\n alpha=0.1,\n)\nax.set(\n title=\"Daily energy transfer predictions with quantile loss\",\n xticks=[(i + 0.2) * 48 for i in range(7)],\n xticklabels=[\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"],\n xlabel=\"Time of the week\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend(loc=\"lower right\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe a tendence to over-estimate the energy transfer. This could be be\nquantitatively confirmed by computing empirical coverage numbers as done in\nthe `calibration of confidence intervals section