.. _sphx_glr_auto_examples_exercises_plot_cv_diabetes.py:


===============================================
Cross-validation on diabetes Dataset Exercise
===============================================

A tutorial exercise which uses cross-validation with linear models.

This exercise is used in the :ref:`cv_estimators_tut` part of the
:ref:`model_selection_tut` section of the :ref:`stat_learn_tut_index`.


.. code-block:: python


    from __future__ import print_function
    print(__doc__)

    import numpy as np
    import matplotlib.pyplot as plt

    from sklearn import datasets
    from sklearn.linear_model import LassoCV
    from sklearn.linear_model import Lasso
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score

    diabetes = datasets.load_diabetes()
    X = diabetes.data[:150]
    y = diabetes.target[:150]

    lasso = Lasso(random_state=0)
    alphas = np.logspace(-4, -0.5, 30)

    scores = list()
    scores_std = list()

    n_folds = 3

    for alpha in alphas:
        lasso.alpha = alpha
        this_scores = cross_val_score(lasso, X, y, cv=n_folds, n_jobs=1)
        scores.append(np.mean(this_scores))
        scores_std.append(np.std(this_scores))

    scores, scores_std = np.array(scores), np.array(scores_std)

    plt.figure().set_size_inches(8, 6)
    plt.semilogx(alphas, scores)

    # plot error lines showing +/- std. errors of the scores
    std_error = scores_std / np.sqrt(n_folds)

    plt.semilogx(alphas, scores + std_error, 'b--')
    plt.semilogx(alphas, scores - std_error, 'b--')

    # alpha=0.2 controls the translucency of the fill color
    plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)

    plt.ylabel('CV score +/- std error')
    plt.xlabel('alpha')
    plt.axhline(np.max(scores), linestyle='--', color='.5')
    plt.xlim([alphas[0], alphas[-1]])


.. image:: /auto_examples/exercises/images/sphx_glr_plot_cv_diabetes_001.png
    :align: center


Bonus: how much can you trust the selection of alpha?


.. code-block:: python


    # To answer this question we use the LassoCV object that sets its alpha
    # parameter automatically from the data by internal cross-validation (i.e. it
    # performs cross-validation on the training data it receives).
    # We use external cross-validation to see how much the automatically obtained
    # alphas differ across different cross-validation folds.
    lasso_cv = LassoCV(alphas=alphas, random_state=0)
    k_fold = KFold(3)

    print("Answer to the bonus question:",
          "how much can you trust the selection of alpha?")
    print()
    print("Alpha parameters maximising the generalization score on different")
    print("subsets of the data:")
    for k, (train, test) in enumerate(k_fold.split(X, y)):
        lasso_cv.fit(X[train], y[train])
        print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
              format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
    print()
    print("Answer: Not very much since we obtained different alphas for different")
    print("subsets of the data and moreover, the scores for these alphas differ")
    print("quite substantially.")

    plt.show()


.. rst-class:: sphx-glr-script-out

 Out::

      Answer to the bonus question: how much can you trust the selection of alpha?

    Alpha parameters maximising the generalization score on different
    subsets of the data:
    [fold 0] alpha: 0.10405, score: 0.53573
    [fold 1] alpha: 0.05968, score: 0.16278
    [fold 2] alpha: 0.10405, score: 0.44437

    Answer: Not very much since we obtained different alphas for different
    subsets of the data and moreover, the scores for these alphas differ
    quite substantially.


**Total running time of the script:**
(0 minutes 0.457 seconds)


.. container:: sphx-glr-download

    **Download Python source code:** :download:`plot_cv_diabetes.py <plot_cv_diabetes.py>`


.. container:: sphx-glr-download

    **Download IPython notebook:** :download:`plot_cv_diabetes.ipynb <plot_cv_diabetes.ipynb>`