.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/linear_model/plot_lasso_model_selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here <sphx_glr_download_auto_examples_linear_model_plot_lasso_model_selection.py>` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py: ================================================= Lasso model selection: AIC-BIC / cross-validation ================================================= This example focuses on model selection for Lasso models that are linear models with an L1 penalty for regression problems. Indeed, several strategies can be used to select the value of the regularization parameter: via cross-validation or using an information criterion, namely AIC or BIC. In what follows, we will discuss in details the different strategies. .. GENERATED FROM PYTHON SOURCE LINES 15-22 .. code-block:: default # Author: Olivier Grisel # Gael Varoquaux # Alexandre Gramfort # Guillaume Lemaitre # License: BSD 3 clause .. GENERATED FROM PYTHON SOURCE LINES 23-27 .. code-block:: default import sklearn sklearn.set_config(display="diagram") .. GENERATED FROM PYTHON SOURCE LINES 28-31 Dataset ------- In this example, we will use the diabetes dataset. .. GENERATED FROM PYTHON SOURCE LINES 31-36 .. code-block:: default from sklearn.datasets import load_diabetes X, y = load_diabetes(return_X_y=True, as_frame=True) X.head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>age</th> <th>sex</th> <th>bmi</th> <th>bp</th> <th>s1</th> <th>s2</th> <th>s3</th> <th>s4</th> <th>s5</th> <th>s6</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.038076</td> <td>0.050680</td> <td>0.061696</td> <td>0.021872</td> <td>-0.044223</td> <td>-0.034821</td> <td>-0.043401</td> <td>-0.002592</td> <td>0.019908</td> <td>-0.017646</td> </tr> <tr> <th>1</th> <td>-0.001882</td> <td>-0.044642</td> <td>-0.051474</td> <td>-0.026328</td> <td>-0.008449</td> <td>-0.019163</td> <td>0.074412</td> <td>-0.039493</td> <td>-0.068330</td> <td>-0.092204</td> </tr> <tr> <th>2</th> <td>0.085299</td> <td>0.050680</td> <td>0.044451</td> <td>-0.005671</td> <td>-0.045599</td> <td>-0.034194</td> <td>-0.032356</td> <td>-0.002592</td> <td>0.002864</td> <td>-0.025930</td> </tr> <tr> <th>3</th> <td>-0.089063</td> <td>-0.044642</td> <td>-0.011595</td> <td>-0.036656</td> <td>0.012191</td> <td>0.024991</td> <td>-0.036038</td> <td>0.034309</td> <td>0.022692</td> <td>-0.009362</td> </tr> <tr> <th>4</th> <td>0.005383</td> <td>-0.044642</td> <td>-0.036385</td> <td>0.021872</td> <td>0.003935</td> <td>0.015596</td> <td>0.008142</td> <td>-0.002592</td> <td>-0.031991</td> <td>-0.046641</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 37-39 In addition, we add some random features to the original data to better illustrate the feature selection performed by the Lasso model. .. GENERATED FROM PYTHON SOURCE LINES 39-52 .. code-block:: default import numpy as np import pandas as pd rng = np.random.RandomState(42) n_random_features = 14 X_random = pd.DataFrame( rng.randn(X.shape[0], n_random_features), columns=[f"random_{i:02d}" for i in range(n_random_features)], ) X = pd.concat([X, X_random], axis=1) # Show only a subset of the columns X[X.columns[::3]].head() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>age</th> <th>bp</th> <th>s3</th> <th>s6</th> <th>random_02</th> <th>random_05</th> <th>random_08</th> <th>random_11</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.038076</td> <td>0.021872</td> <td>-0.043401</td> <td>-0.017646</td> <td>0.647689</td> <td>-0.234137</td> <td>-0.469474</td> <td>-0.465730</td> </tr> <tr> <th>1</th> <td>-0.001882</td> <td>-0.026328</td> <td>0.074412</td> <td>-0.092204</td> <td>-1.012831</td> <td>-1.412304</td> <td>0.067528</td> <td>0.110923</td> </tr> <tr> <th>2</th> <td>0.085299</td> <td>-0.005671</td> <td>-0.032356</td> <td>-0.025930</td> <td>-0.601707</td> <td>-1.057711</td> <td>0.208864</td> <td>0.196861</td> </tr> <tr> <th>3</th> <td>-0.089063</td> <td>-0.036656</td> <td>-0.036038</td> <td>-0.009362</td> <td>-1.478522</td> <td>1.057122</td> <td>0.324084</td> <td>0.611676</td> </tr> <tr> <th>4</th> <td>0.005383</td> <td>0.021872</td> <td>0.008142</td> <td>-0.046641</td> <td>0.331263</td> <td>-0.185659</td> <td>0.812526</td> <td>1.003533</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 53-66 Selecting Lasso via an information criterion -------------------------------------------- :class:`~sklearn.linear_model.LassoLarsIC` provides a Lasso estimator that uses the Akaike information criterion (AIC) or the Bayes information criterion (BIC) to select the optimal value of the regularization parameter alpha. Before fitting the model, we will standardize the data with a :class:`~sklearn.preprocessing.StandardScaler`. In addition, we will measure the time to fit and tune the hyperparameter alpha in order to compare with the cross-validation strategy. We will first fit a Lasso model with the AIC criterion. .. GENERATED FROM PYTHON SOURCE LINES 66-77 .. code-block:: default import time from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LassoLarsIC from sklearn.pipeline import make_pipeline start_time = time.time() lasso_lars_ic = make_pipeline( StandardScaler(), LassoLarsIC(criterion="aic", normalize=False) ).fit(X, y) fit_time = time.time() - start_time .. GENERATED FROM PYTHON SOURCE LINES 78-79 We store the AIC metric for each value of alpha used during `fit`. .. GENERATED FROM PYTHON SOURCE LINES 79-87 .. code-block:: default results = pd.DataFrame( { "alphas": lasso_lars_ic[-1].alphas_, "AIC criterion": lasso_lars_ic[-1].criterion_, } ).set_index("alphas") alpha_aic = lasso_lars_ic[-1].alpha_ .. GENERATED FROM PYTHON SOURCE LINES 88-89 Now, we perform the same analysis using the BIC criterion. .. GENERATED FROM PYTHON SOURCE LINES 89-94 .. code-block:: default lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y) results["BIC criterion"] = lasso_lars_ic[-1].criterion_ alpha_bic = lasso_lars_ic[-1].alpha_ .. GENERATED FROM PYTHON SOURCE LINES 95-96 We can check which value of `alpha` leads to the minimum AIC and BIC. .. GENERATED FROM PYTHON SOURCE LINES 96-103 .. code-block:: default def highlight_min(x): x_min = x.min() return ["font-weight: bold" if v == x_min else "" for v in x] results.style.apply(highlight_min) .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <style type="text/css"> #T_c089b_row6_col1, #T_c089b_row21_col0 { font-weight: bold; } </style> <table id="T_c089b"> <thead> <tr> <th class="blank level0" > </th> <th id="T_c089b_level0_col0" class="col_heading level0 col0" >AIC criterion</th> <th id="T_c089b_level0_col1" class="col_heading level0 col1" >BIC criterion</th> </tr> <tr> <th class="index_name level0" >alphas</th> <th class="blank col0" > </th> <th class="blank col1" > </th> </tr> </thead> <tbody> <tr> <th id="T_c089b_level0_row0" class="row_heading level0 row0" >45.160030</th> <td id="T_c089b_row0_col0" class="data row0 col0" >5244.765686</td> <td id="T_c089b_row0_col1" class="data row0 col1" >5244.765686</td> </tr> <tr> <th id="T_c089b_level0_row1" class="row_heading level0 row1" >42.300448</th> <td id="T_c089b_row1_col0" class="data row1 col0" >5208.252839</td> <td id="T_c089b_row1_col1" class="data row1 col1" >5212.344149</td> </tr> <tr> <th id="T_c089b_level0_row2" class="row_heading level0 row2" >21.542302</th> <td id="T_c089b_row2_col0" class="data row2 col0" >4928.021378</td> <td id="T_c089b_row2_col1" class="data row2 col1" >4936.203998</td> </tr> <tr> <th id="T_c089b_level0_row3" class="row_heading level0 row3" >15.034110</th> <td id="T_c089b_row3_col0" class="data row3 col0" >4869.678327</td> <td id="T_c089b_row3_col1" class="data row3 col1" >4881.952257</td> </tr> <tr> <th id="T_c089b_level0_row4" class="row_heading level0 row4" >6.189693</th> <td id="T_c089b_row4_col0" class="data row4 col0" >4815.437203</td> <td id="T_c089b_row4_col1" class="data row4 col1" >4831.802443</td> </tr> <tr> <th id="T_c089b_level0_row5" class="row_heading level0 row5" >5.329590</th> <td id="T_c089b_row5_col0" class="data row5 col0" >4810.422702</td> <td id="T_c089b_row5_col1" class="data row5 col1" >4830.879251</td> </tr> <tr> <th id="T_c089b_level0_row6" class="row_heading level0 row6" >4.305930</th> <td id="T_c089b_row6_col0" class="data row6 col0" >4803.572132</td> <td id="T_c089b_row6_col1" class="data row6 col1" >4828.119992</td> </tr> <tr> <th id="T_c089b_level0_row7" class="row_heading level0 row7" >4.124325</th> <td id="T_c089b_row7_col0" class="data row7 col0" >4804.126582</td> <td id="T_c089b_row7_col1" class="data row7 col1" >4832.765751</td> </tr> <tr> <th id="T_c089b_level0_row8" class="row_heading level0 row8" >3.820656</th> <td id="T_c089b_row8_col0" class="data row8 col0" >4803.620474</td> <td id="T_c089b_row8_col1" class="data row8 col1" >4836.350953</td> </tr> <tr> <th id="T_c089b_level0_row9" class="row_heading level0 row9" >3.750361</th> <td id="T_c089b_row9_col0" class="data row9 col0" >4805.011535</td> <td id="T_c089b_row9_col1" class="data row9 col1" >4841.833324</td> </tr> <tr> <th id="T_c089b_level0_row10" class="row_heading level0 row10" >3.570653</th> <td id="T_c089b_row10_col0" class="data row10 col0" >4805.289336</td> <td id="T_c089b_row10_col1" class="data row10 col1" >4846.202435</td> </tr> <tr> <th id="T_c089b_level0_row11" class="row_heading level0 row11" >3.550197</th> <td id="T_c089b_row11_col0" class="data row11 col0" >4807.075010</td> <td id="T_c089b_row11_col1" class="data row11 col1" >4852.079419</td> </tr> <tr> <th id="T_c089b_level0_row12" class="row_heading level0 row12" >3.358374</th> <td id="T_c089b_row12_col0" class="data row12 col0" >4806.878236</td> <td id="T_c089b_row12_col1" class="data row12 col1" >4855.973955</td> </tr> <tr> <th id="T_c089b_level0_row13" class="row_heading level0 row13" >3.259335</th> <td id="T_c089b_row13_col0" class="data row13 col0" >4807.705694</td> <td id="T_c089b_row13_col1" class="data row13 col1" >4860.892722</td> </tr> <tr> <th id="T_c089b_level0_row14" class="row_heading level0 row14" >3.237724</th> <td id="T_c089b_row14_col0" class="data row14 col0" >4809.439868</td> <td id="T_c089b_row14_col1" class="data row14 col1" >4866.718206</td> </tr> <tr> <th id="T_c089b_level0_row15" class="row_heading level0 row15" >2.850023</th> <td id="T_c089b_row15_col0" class="data row15 col0" >4805.988366</td> <td id="T_c089b_row15_col1" class="data row15 col1" >4867.358015</td> </tr> <tr> <th id="T_c089b_level0_row16" class="row_heading level0 row16" >2.384495</th> <td id="T_c089b_row16_col0" class="data row16 col0" >4801.703302</td> <td id="T_c089b_row16_col1" class="data row16 col1" >4867.164260</td> </tr> <tr> <th id="T_c089b_level0_row17" class="row_heading level0 row17" >2.296438</th> <td id="T_c089b_row17_col0" class="data row17 col0" >4802.592070</td> <td id="T_c089b_row17_col1" class="data row17 col1" >4872.144338</td> </tr> <tr> <th id="T_c089b_level0_row18" class="row_heading level0 row18" >2.031600</th> <td id="T_c089b_row18_col0" class="data row18 col0" >4801.236396</td> <td id="T_c089b_row18_col1" class="data row18 col1" >4874.879974</td> </tr> <tr> <th id="T_c089b_level0_row19" class="row_heading level0 row19" >1.618202</th> <td id="T_c089b_row19_col0" class="data row19 col0" >4798.482569</td> <td id="T_c089b_row19_col1" class="data row19 col1" >4876.217457</td> </tr> <tr> <th id="T_c089b_level0_row20" class="row_heading level0 row20" >1.526599</th> <td id="T_c089b_row20_col0" class="data row20 col0" >4799.542941</td> <td id="T_c089b_row20_col1" class="data row20 col1" >4881.369138</td> </tr> <tr> <th id="T_c089b_level0_row21" class="row_heading level0 row21" >0.586815</th> <td id="T_c089b_row21_col0" class="data row21 col0" >4794.237899</td> <td id="T_c089b_row21_col1" class="data row21 col1" >4880.155406</td> </tr> <tr> <th id="T_c089b_level0_row22" class="row_heading level0 row22" >0.446051</th> <td id="T_c089b_row22_col0" class="data row22 col0" >4795.589066</td> <td id="T_c089b_row22_col1" class="data row22 col1" >4885.597883</td> </tr> <tr> <th id="T_c089b_level0_row23" class="row_heading level0 row23" >0.259021</th> <td id="T_c089b_row23_col0" class="data row23 col0" >4796.966000</td> <td id="T_c089b_row23_col1" class="data row23 col1" >4891.066127</td> </tr> <tr> <th id="T_c089b_level0_row24" class="row_heading level0 row24" >0.032175</th> <td id="T_c089b_row24_col0" class="data row24 col0" >4794.661549</td> <td id="T_c089b_row24_col1" class="data row24 col1" >4888.761677</td> </tr> <tr> <th id="T_c089b_level0_row25" class="row_heading level0 row25" >0.019066</th> <td id="T_c089b_row25_col0" class="data row25 col0" >4794.651882</td> <td id="T_c089b_row25_col1" class="data row25 col1" >4888.752009</td> </tr> <tr> <th id="T_c089b_level0_row26" class="row_heading level0 row26" >0.000000</th> <td id="T_c089b_row26_col0" class="data row26 col0" >4796.625435</td> <td id="T_c089b_row26_col1" class="data row26 col1" >4894.816872</td> </tr> </tbody> </table> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 104-108 Finally, we can plot the AIC and BIC values for the different alpha values. The vertical lines in the plot correspond to the alpha chosen for each criterion. The selected alpha corresponds to the minimum of the AIC or BIC criterion. .. GENERATED FROM PYTHON SOURCE LINES 108-133 .. code-block:: default ax = results.plot() ax.vlines( alpha_aic, results["AIC criterion"].min(), results["AIC criterion"].max(), label="alpha: AIC estimate", linestyles="--", color="tab:blue", ) ax.vlines( alpha_bic, results["BIC criterion"].min(), results["BIC criterion"].max(), label="alpha: BIC estimate", linestyle="--", color="tab:orange", ) ax.set_xlabel(r"$\alpha$") ax.set_ylabel("criterion") ax.set_xscale("log") ax.legend() _ = ax.set_title( f"Information-criterion for model selection (training time {fit_time:.2f}s)" ) .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png :alt: Information-criterion for model selection (training time 0.01s) :srcset: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 134-164 Model selection with an information-criterion is very fast. It relies on computing the criterion on the in-sample set provided to `fit`. Both criteria estimate the model generalization error based on the training set error and penalize this overly optimistic error. However, this penalty relies on a proper estimation of the degrees of freedom and the noise variance. Both are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. These models also tend to break when the problem is badly conditioned (more features than samples). It is then required to provide an estimate of the noise variance. Selecting Lasso via cross-validation ------------------------------------ The Lasso estimator can be implemented with different solvers: coordinate descent and least angle regression. They differ with regards to their execution speed and sources of numerical errors. In scikit-learn, two different estimators are available with integrated cross-validation: :class:`~sklearn.linear_model.LassoCV` and :class:`~sklearn.linear_model.LassoLarsCV` that respectively solve the problem with coordinate descent and least angle regression. In the remainder of this section, we will present both approaches. For both algorithms, we will use a 20-fold cross-validation strategy. Lasso via coordinate descent ............................ Let's start by making the hyperparameter tuning using :class:`~sklearn.linear_model.LassoCV`. .. GENERATED FROM PYTHON SOURCE LINES 164-170 .. code-block:: default from sklearn.linear_model import LassoCV start_time = time.time() model = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y) fit_time = time.time() - start_time .. GENERATED FROM PYTHON SOURCE LINES 171-193 .. code-block:: default import matplotlib.pyplot as plt ymin, ymax = 2300, 3800 lasso = model[-1] plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":") plt.plot( lasso.alphas_, lasso.mse_path_.mean(axis=-1), color="black", label="Average across the folds", linewidth=2, ) plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate") plt.ylim(ymin, ymax) plt.xlabel(r"$\alpha$") plt.ylabel("Mean square error") plt.legend() _ = plt.title( f"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)" ) .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_002.png :alt: Mean square error on each fold: coordinate descent (train time: 0.19s) :srcset: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 194-198 Lasso via least angle regression ................................ Let's start by making the hyperparameter tuning using :class:`~sklearn.linear_model.LassoLarsCV`. .. GENERATED FROM PYTHON SOURCE LINES 198-204 .. code-block:: default from sklearn.linear_model import LassoLarsCV start_time = time.time() model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20, normalize=False)).fit(X, y) fit_time = time.time() - start_time .. GENERATED FROM PYTHON SOURCE LINES 205-222 .. code-block:: default lasso = model[-1] plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":") plt.semilogx( lasso.cv_alphas_, lasso.mse_path_.mean(axis=-1), color="black", label="Average across the folds", linewidth=2, ) plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV") plt.ylim(ymin, ymax) plt.xlabel(r"$\alpha$") plt.ylabel("Mean square error") plt.legend() _ = plt.title(f"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)") .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_003.png :alt: Mean square error on each fold: Lars (train time: 0.07s) :srcset: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 223-259 Summary of cross-validation approach .................................... Both algorithms give roughly the same results. Lars computes a solution path only for each kink in the path. As a result, it is very efficient when there are only of few kinks, which is the case if there are few features or samples. Also, it is able to compute the full path without setting any hyperparameter. On the opposite, coordinate descent computes the path points on a pre-specified grid (here we use the default). Thus it is more efficient if the number of grid points is smaller than the number of kinks in the path. Such a strategy can be interesting if the number of features is really large and there are enough samples to be selected in each of the cross-validation fold. In terms of numerical errors, for heavily correlated variables, Lars will accumulate more errors, while the coordinate descent algorithm will only sample the path on a grid. Note how the optimal value of alpha varies for each fold. This illustrates why nested-cross validation is a good strategy when trying to evaluate the performance of a method for which a parameter is chosen by cross-validation: this choice of parameter may not be optimal for a final evaluation on unseen test set only. Conclusion ---------- In this tutorial, we presented two approaches for selecting the best hyperparameter `alpha`: one strategy finds the optimal value of `alpha` by only using the training set and some information criterion, and another strategy is based on cross-validation. In this example, both approaches are working similarly. The in-sample hyperparameter selection even shows its efficacy in terms of computational performance. However, it can only be used when the number of samples is large enough compared to the number of features. That's why hyperparameter optimization via cross-validation is a safe strategy: it works in different settings. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.787 seconds) .. _sphx_glr_download_auto_examples_linear_model_plot_lasso_model_selection.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.0.X?urlpath=lab/tree/notebooks/auto_examples/linear_model/plot_lasso_model_selection.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_lasso_model_selection.py <plot_lasso_model_selection.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_lasso_model_selection.ipynb <plot_lasso_model_selection.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_