.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_linear_model_plot_lasso_model_selection.py>`     to download the full example code or to run this example in your browser via Binder
    .. rst-class:: sphx-glr-example-title

    .. _sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py:


===================================================
Lasso model selection: Cross-Validation / AIC / BIC
===================================================

Use the Akaike information criterion (AIC), the Bayes Information
criterion (BIC) and cross-validation to select an optimal value
of the regularization parameter alpha of the :ref:`lasso` estimator.

Results obtained with LassoLarsIC are based on AIC/BIC criteria.

Information-criterion based model selection is very fast, but it
relies on a proper estimation of degrees of freedom, are
derived for large samples (asymptotic results) and assume the model
is correct, i.e. that the data are actually generated by this model.
They also tend to break when the problem is badly conditioned
(more features than samples).

For cross-validation, we use 20-fold with 2 algorithms to compute the
Lasso path: coordinate descent, as implemented by the LassoCV class, and
Lars (least angle regression) as implemented by the LassoLarsCV class.
Both algorithms give roughly the same results. They differ with regards
to their execution speed and sources of numerical errors.

Lars computes a path solution only for each kink in the path. As a
result, it is very efficient when there are only of few kinks, which is
the case if there are few features or samples. Also, it is able to
compute the full path without setting any meta parameter. On the
opposite, coordinate descent compute the path points on a pre-specified
grid (here we use the default). Thus it is more efficient if the number
of grid points is smaller than the number of kinks in the path. Such a
strategy can be interesting if the number of features is really large
and there are enough samples to select a large amount. In terms of
numerical errors, for heavily correlated variables, Lars will accumulate
more errors, while the coordinate descent algorithm will only sample the
path on a grid.

Note how the optimal value of alpha varies for each fold. This
illustrates why nested-cross validation is necessary when trying to
evaluate the performance of a method for which a parameter is chosen by
cross-validation: this choice of parameter may not be optimal for unseen
data.


.. rst-class:: sphx-glr-horizontal


    *

      .. image:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png
          :alt: Information-criterion for model selection (training time 0.005s)
          :class: sphx-glr-multi-img

    *

      .. image:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_002.png
          :alt: Mean square error on each fold: coordinate descent (train time: 0.26s)
          :class: sphx-glr-multi-img

    *

      .. image:: /auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_003.png
          :alt: Mean square error on each fold: Lars (train time: 0.09s)
          :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    Computing regularization path using the coordinate descent lasso...
    Computing regularization path using the Lars lasso...


|


.. code-block:: default

    print(__doc__)

    # Author: Olivier Grisel, Gael Varoquaux, Alexandre Gramfort
    # License: BSD 3 clause

    import time

    import numpy as np
    import matplotlib.pyplot as plt

    from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
    from sklearn import datasets

    # This is to avoid division by zero while doing np.log10
    EPSILON = 1e-4

    X, y = datasets.load_diabetes(return_X_y=True)

    rng = np.random.RandomState(42)
    X = np.c_[X, rng.randn(X.shape[0], 14)]  # add some bad features

    # normalize data as done by Lars to allow for comparison
    X /= np.sqrt(np.sum(X ** 2, axis=0))

    # #############################################################################
    # LassoLarsIC: least angle regression with BIC/AIC criterion

    model_bic = LassoLarsIC(criterion='bic')
    t1 = time.time()
    model_bic.fit(X, y)
    t_bic = time.time() - t1
    alpha_bic_ = model_bic.alpha_

    model_aic = LassoLarsIC(criterion='aic')
    model_aic.fit(X, y)
    alpha_aic_ = model_aic.alpha_


    def plot_ic_criterion(model, name, color):
        criterion_ = model.criterion_
        plt.semilogx(model.alphas_ + EPSILON, criterion_, '--', color=color,
                     linewidth=3, label='%s criterion' % name)
        plt.axvline(model.alpha_ + EPSILON, color=color, linewidth=3,
                    label='alpha: %s estimate' % name)
        plt.xlabel(r'$\alpha$')
        plt.ylabel('criterion')


    plt.figure()
    plot_ic_criterion(model_aic, 'AIC', 'b')
    plot_ic_criterion(model_bic, 'BIC', 'r')
    plt.legend()
    plt.title('Information-criterion for model selection (training time %.3fs)'
              % t_bic)

    # #############################################################################
    # LassoCV: coordinate descent

    # Compute paths
    print("Computing regularization path using the coordinate descent lasso...")
    t1 = time.time()
    model = LassoCV(cv=20).fit(X, y)
    t_lasso_cv = time.time() - t1

    # Display results
    plt.figure()
    ymin, ymax = 2300, 3800
    plt.semilogx(model.alphas_ + EPSILON, model.mse_path_, ':')
    plt.plot(model.alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
             label='Average across the folds', linewidth=2)
    plt.axvline(model.alpha_ + EPSILON, linestyle='--', color='k',
                label='alpha: CV estimate')

    plt.legend()

    plt.xlabel(r'$\alpha$')
    plt.ylabel('Mean square error')
    plt.title('Mean square error on each fold: coordinate descent '
              '(train time: %.2fs)' % t_lasso_cv)
    plt.axis('tight')
    plt.ylim(ymin, ymax)

    # #############################################################################
    # LassoLarsCV: least angle regression

    # Compute paths
    print("Computing regularization path using the Lars lasso...")
    t1 = time.time()
    model = LassoLarsCV(cv=20).fit(X, y)
    t_lasso_lars_cv = time.time() - t1

    # Display results
    plt.figure()
    plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ':')
    plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
                 label='Average across the folds', linewidth=2)
    plt.axvline(model.alpha_, linestyle='--', color='k',
                label='alpha CV')
    plt.legend()

    plt.xlabel(r'$\alpha$')
    plt.ylabel('Mean square error')
    plt.title('Mean square error on each fold: Lars (train time: %.2fs)'
              % t_lasso_lars_cv)
    plt.axis('tight')
    plt.ylim(ymin, ymax)

    plt.show()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.898 seconds)


.. _sphx_glr_download_auto_examples_linear_model_plot_lasso_model_selection.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: binder-badge

    .. image:: https://mybinder.org/badge_logo.svg
      :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/linear_model/plot_lasso_model_selection.ipynb
      :width: 150 px


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_lasso_model_selection.py <plot_lasso_model_selection.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_lasso_model_selection.ipynb <plot_lasso_model_selection.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_