{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Ridge coefficients as a function of the L2 Regularization\n\nA model that overfits learns the training data too well, capturing both the\nunderlying patterns and the noise in the data. However, when applied to unseen\ndata, the learned associations may not hold. We normally detect this when we\napply our trained predictions to the test data and see the statistical\nperformance drop significantly compared to the training data.\n\nOne way to overcome overfitting is through regularization, which can be done by\npenalizing large weights (coefficients) in linear models, forcing the model to\nshrink all coefficients. Regularization reduces a model's reliance on specific\ninformation obtained from the training samples.\n\nThis example illustrates how L2 regularization in a\n:class:`~sklearn.linear_model.Ridge` regression affects a model's performance by\nadding a penalty term to the loss that increases with the coefficients\n$\\beta$.\n\nThe regularized loss function is given by: $\\mathcal{L}(X, y, \\beta) =\n\\| y - X \\beta \\|^{2}_{2} + \\alpha \\| \\beta \\|^{2}_{2}$\n\nwhere $X$ is the input data, $y$ is the target variable,\n$\\beta$ is the vector of coefficients associated with the features, and\n$\\alpha$ is the regularization strength.\n\nThe regularized loss function aims to balance the trade-off between accurately\npredicting the training set and to prevent overfitting.\n\nIn this regularized loss, the left-hand side (e.g. $\\|y -\nX\\beta\\|^{2}_{2}$) measures the squared difference between the actual target\nvariable, $y$, and the predicted values. Minimizing this term alone could\nlead to overfitting, as the model may become too complex and sensitive to noise\nin the training data.\n\nTo address overfitting, Ridge regularization adds a constraint, called a penalty\nterm, ($\\alpha \\| \\beta\\|^{2}_{2}$) to the loss function. This penalty\nterm is the sum of the squares of the model's coefficients, multiplied by the\nregularization strength $\\alpha$. By introducing this constraint, Ridge\nregularization discourages any single coefficient $\\beta_{i}$ from taking\nan excessively large value and encourages smaller and more evenly distributed\ncoefficients. Higher values of $\\alpha$ force the coefficients towards\nzero. However, an excessively high $\\alpha$ can result in an underfit\nmodel that fails to capture important patterns in the data.\n\nTherefore, the regularized loss function combines the prediction accuracy term\nand the penalty term. By adjusting the regularization strength, practitioners\ncan fine-tune the degree of constraint imposed on the weights, training a model\ncapable of generalizing well to unseen data while avoiding overfitting.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: Kornel Kielczewski -- <kornel.k@plusnet.pl>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Purpose of this example\nFor the purpose of showing how Ridge regularization works, we will create a\nnon-noisy data set. Then we will train a regularized model on a range of\nregularization strengths ($\\alpha$) and plot how the trained\ncoefficients and the mean squared error between those and the original values\nbehave as functions of the regularization strength.\n\n### Creating a non-noisy data set\nWe make a toy data set with 100 samples and 10 features, that's suitable to\ndetect regression. Out of the 10 features, 8 are informative and contribute to\nthe regression, while the remaining 2 features do not have any effect on the\ntarget variable (their true coefficients are 0). Please note that in this\nexample the data is non-noisy, hence we can expect our regression model to\nrecover exactly the true coefficients w.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_regression\n\nX, y, w = make_regression(\n    n_samples=100, n_features=10, n_informative=8, coef=True, random_state=1\n)\n\n# Obtain the true coefficients\nprint(f\"The true coefficient of this regression problem are:\\n{w}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Training the Ridge Regressor\nWe use :class:`~sklearn.linear_model.Ridge`, a linear model with L2\nregularization. We train several models, each with a different value for the\nmodel parameter `alpha`, which is a positive constant that multiplies the\npenalty term, controlling the regularization strength. For each trained model\nwe then compute the error between the true coefficients `w` and the\ncoefficients found by the model `clf`. We store the identified coefficients\nand the calculated errors for the corresponding coefficients in lists, which\nmakes it convenient for us to plot them.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nfrom sklearn.linear_model import Ridge\nfrom sklearn.metrics import mean_squared_error\n\nclf = Ridge()\n\n# Generate values for `alpha` that are evenly distributed on a logarithmic scale\nalphas = np.logspace(-3, 4, 200)\ncoefs = []\nerrors_coefs = []\n\n# Train the model with different regularisation strengths\nfor a in alphas:\n    clf.set_params(alpha=a).fit(X, y)\n    coefs.append(clf.coef_)\n    errors_coefs.append(mean_squared_error(clf.coef_, w))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Plotting trained Coefficients and Mean Squared Errors\nWe now plot the 10 different regularized coefficients as a function of the\nregularization parameter `alpha` where each color represents a different\ncoefficient.\n\nOn the right-hand-side, we plot how the errors of the coefficients from the\nestimator change as a function of regularization.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport pandas as pd\n\nalphas = pd.Index(alphas, name=\"alpha\")\ncoefs = pd.DataFrame(coefs, index=alphas, columns=[f\"Feature {i}\" for i in range(10)])\nerrors = pd.Series(errors_coefs, index=alphas, name=\"Mean squared error\")\n\nfig, axs = plt.subplots(1, 2, figsize=(20, 6))\n\ncoefs.plot(\n    ax=axs[0],\n    logx=True,\n    title=\"Ridge coefficients as a function of the regularization strength\",\n)\naxs[0].set_ylabel(\"Ridge coefficient values\")\nerrors.plot(\n    ax=axs[1],\n    logx=True,\n    title=\"Coefficient error as a function of the regularization strength\",\n)\n_ = axs[1].set_ylabel(\"Mean squared error\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Interpreting the plots\nThe plot on the left-hand side shows how the regularization strength (`alpha`)\naffects the Ridge regression coefficients. Smaller values of `alpha` (weak\nregularization), allow the coefficients to closely resemble the true\ncoefficients (`w`) used to generate the data set. This is because no\nadditional noise was added to our artificial data set. As `alpha` increases,\nthe coefficients shrink towards zero, gradually reducing the impact of the\nfeatures that were formerly more significant.\n\nThe right-hand side plot shows the mean squared error (MSE) between the\ncoefficients found by the model and the true coefficients (`w`). It provides a\nmeasure that relates to how exact our ridge model is in comparison to the true\ngenerative model. A low error means that it found coefficients closer to the\nones of the true generative model. In this case, since our toy data set was\nnon-noisy, we can see that the least regularized model retrieves coefficients\nclosest to the true coefficients (`w`) (error is close to 0).\n\nWhen `alpha` is small, the model captures the intricate details of the\ntraining data, whether those were caused by noise or by actual information. As\n`alpha` increases, the highest coefficients shrink more rapidly, rendering\ntheir corresponding features less influential in the training process. This\ncan enhance a model's ability to generalize to unseen data (if there was a lot\nof noise to capture), but it also poses the risk of losing performance if the\nregularization becomes too strong compared to the amount of noise the data\ncontained (as in this example).\n\nIn real-world scenarios where data typically includes noise, selecting an\nappropriate `alpha` value becomes crucial in striking a balance between an\noverfitting and an underfitting model.\n\nHere, we saw that :class:`~sklearn.linear_model.Ridge` adds a penalty to the\ncoefficients to fight overfitting. Another problem that occurs is linked to\nthe presence of outliers in the training dataset. An outlier is a data point\nthat differs significantly from other observations. Concretely, these outliers\nimpact the left-hand side term of the loss function that we showed earlier.\nSome other linear models are formulated to be robust to outliers such as the\n:class:`~sklearn.linear_model.HuberRegressor`. You can learn more about it in\nthe `sphx_glr_auto_examples_linear_model_plot_huber_vs_ridge.py` example.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.19"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}