.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/ensemble/plot_adaboost_hastie_10_2.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_ensemble_plot_adaboost_hastie_10_2.py: ============================= Discrete versus Real AdaBoost ============================= This notebook is based on Figure 10.2 from Hastie et al 2009 [1]_ and illustrates the difference in performance between the discrete SAMME [2]_ boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated on a binary classification task where the target Y is a non-linear function of 10 input features. Discrete SAMME AdaBoost adapts based on errors in predicted class labels whereas real SAMME.R uses the predicted class probabilities. .. [1] T. Hastie, R. Tibshirani and J. Friedman, "Elements of Statistical Learning Ed. 2", Springer, 2009. .. [2] J Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", Statistics and Its Interface, 2009. .. GENERATED FROM PYTHON SOURCE LINES 24-28 Preparing the data and baseline models -------------------------------------- We start by generating the binary classification dataset used in Hastie et al. 2009, Example 10.2. .. GENERATED FROM PYTHON SOURCE LINES 28-38 .. code-block:: default # Authors: Peter Prettenhofer , # Noel Dawe # # License: BSD 3 clause from sklearn import datasets X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1) .. GENERATED FROM PYTHON SOURCE LINES 39-41 Now, we set the hyperparameters for our AdaBoost classifiers. Be aware, a learning rate of 1.0 may not be optimal for both SAMME and SAMME.R .. GENERATED FROM PYTHON SOURCE LINES 41-45 .. code-block:: default n_estimators = 400 learning_rate = 1.0 .. GENERATED FROM PYTHON SOURCE LINES 46-49 We split the data into a training and a test set. Then, we train our baseline classifiers, a `DecisionTreeClassifier` with `depth=9` and a "stump" `DecisionTreeClassifier` with `depth=1` and compute the test error. .. GENERATED FROM PYTHON SOURCE LINES 49-65 .. code-block:: default from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=2_000, shuffle=False ) dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1) dt_stump.fit(X_train, y_train) dt_stump_err = 1.0 - dt_stump.score(X_test, y_test) dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1) dt.fit(X_train, y_train) dt_err = 1.0 - dt.score(X_test, y_test) .. GENERATED FROM PYTHON SOURCE LINES 66-70 Adaboost with discrete SAMME and real SAMME.R --------------------------------------------- We now define the discrete and real AdaBoost classifiers and fit them to the training set. .. GENERATED FROM PYTHON SOURCE LINES 70-81 .. code-block:: default from sklearn.ensemble import AdaBoostClassifier ada_discrete = AdaBoostClassifier( base_estimator=dt_stump, learning_rate=learning_rate, n_estimators=n_estimators, algorithm="SAMME", ) ada_discrete.fit(X_train, y_train) .. raw:: html

AdaBoostClassifier(algorithm='SAMME',
                       base_estimator=DecisionTreeClassifier(max_depth=1),
                       n_estimators=400)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 82-91 .. code-block:: default ada_real = AdaBoostClassifier( base_estimator=dt_stump, learning_rate=learning_rate, n_estimators=n_estimators, algorithm="SAMME.R", ) ada_real.fit(X_train, y_train) .. raw:: html

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                       n_estimators=400)

.. GENERATED FROM PYTHON SOURCE LINES 92-95 Now, let's compute the test error of the discrete and real AdaBoost classifiers for each new stump in `n_estimators` added to the ensemble. .. GENERATED FROM PYTHON SOURCE LINES 95-115 .. code-block:: default import numpy as np from sklearn.metrics import zero_one_loss ada_discrete_err = np.zeros((n_estimators,)) for i, y_pred in enumerate(ada_discrete.staged_predict(X_test)): ada_discrete_err[i] = zero_one_loss(y_pred, y_test) ada_discrete_err_train = np.zeros((n_estimators,)) for i, y_pred in enumerate(ada_discrete.staged_predict(X_train)): ada_discrete_err_train[i] = zero_one_loss(y_pred, y_train) ada_real_err = np.zeros((n_estimators,)) for i, y_pred in enumerate(ada_real.staged_predict(X_test)): ada_real_err[i] = zero_one_loss(y_pred, y_test) ada_real_err_train = np.zeros((n_estimators,)) for i, y_pred in enumerate(ada_real.staged_predict(X_train)): ada_real_err_train[i] = zero_one_loss(y_pred, y_train) .. GENERATED FROM PYTHON SOURCE LINES 116-120 Plotting the results -------------------- Finally, we plot the train and test errors of our baselines and of the discrete and real AdaBoost classifiers .. GENERATED FROM PYTHON SOURCE LINES 120-165 .. code-block:: default import matplotlib.pyplot as plt import seaborn as sns fig = plt.figure() ax = fig.add_subplot(111) ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error") ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error") colors = sns.color_palette("colorblind") ax.plot( np.arange(n_estimators) + 1, ada_discrete_err, label="Discrete AdaBoost Test Error", color=colors[0], ) ax.plot( np.arange(n_estimators) + 1, ada_discrete_err_train, label="Discrete AdaBoost Train Error", color=colors[1], ) ax.plot( np.arange(n_estimators) + 1, ada_real_err, label="Real AdaBoost Test Error", color=colors[2], ) ax.plot( np.arange(n_estimators) + 1, ada_real_err_train, label="Real AdaBoost Train Error", color=colors[4], ) ax.set_ylim((0.0, 0.5)) ax.set_xlabel("Number of weak learners") ax.set_ylabel("error rate") leg = ax.legend(loc="upper right", fancybox=True) leg.get_frame().set_alpha(0.7) plt.show() .. image-sg:: /auto_examples/ensemble/images/sphx_glr_plot_adaboost_hastie_10_2_001.png :alt: plot adaboost hastie 10 2 :srcset: /auto_examples/ensemble/images/sphx_glr_plot_adaboost_hastie_10_2_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 166-171 Concluding remarks ------------------ We observe that the error rate for both train and test sets of real AdaBoost is lower than that of discrete AdaBoost. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 12.721 seconds) .. _sphx_glr_download_auto_examples_ensemble_plot_adaboost_hastie_10_2.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.1.X?urlpath=lab/tree/notebooks/auto_examples/ensemble/plot_adaboost_hastie_10_2.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_adaboost_hastie_10_2.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_adaboost_hastie_10_2.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_