.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_applications_plot_model_complexity_influence.py: ========================== Model Complexity Influence ========================== Demonstrate how model complexity influences both prediction accuracy and computational performance. The dataset is the Boston Housing dataset (resp. 20 Newsgroups) for regression (resp. classification). For each class of models we make the model complexity vary through the choice of relevant model parameters and measure the influence on both computational performance (latency) and predictive power (MSE or Hamming Loss). .. rst-class:: sphx-glr-horizontal * .. image:: /auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_001.png :class: sphx-glr-multi-img * .. image:: /auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_002.png :class: sphx-glr-multi-img * .. image:: /auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_003.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Benchmarking SGDClassifier(alpha=0.001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.25, learning_rate='optimal', loss='modified_huber', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 4495 | Hamming Loss (Misclassification Ratio): 0.2536 | Pred. Time: 0.023817s Benchmarking SGDClassifier(alpha=0.001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.5, learning_rate='optimal', loss='modified_huber', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 1644 | Hamming Loss (Misclassification Ratio): 0.3032 | Pred. Time: 0.017283s Benchmarking SGDClassifier(alpha=0.001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.75, learning_rate='optimal', loss='modified_huber', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 866 | Hamming Loss (Misclassification Ratio): 0.3252 | Pred. Time: 0.016456s Benchmarking SGDClassifier(alpha=0.001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.9, learning_rate='optimal', loss='modified_huber', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 647 | Hamming Loss (Misclassification Ratio): 0.3302 | Pred. Time: 0.014696s Benchmarking NuSVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, gamma=3.0517578125e-05, kernel='rbf', max_iter=-1, nu=0.1, shrinking=True, tol=0.001, verbose=False) Complexity: 69 | MSE: 31.8139 | Pred. Time: 0.000271s Benchmarking NuSVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, gamma=3.0517578125e-05, kernel='rbf', max_iter=-1, nu=0.25, shrinking=True, tol=0.001, verbose=False) Complexity: 136 | MSE: 25.6140 | Pred. Time: 0.000726s Benchmarking NuSVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, gamma=3.0517578125e-05, kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001, verbose=False) Complexity: 244 | MSE: 22.3375 | Pred. Time: 0.000849s Benchmarking NuSVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, gamma=3.0517578125e-05, kernel='rbf', max_iter=-1, nu=0.75, shrinking=True, tol=0.001, verbose=False) Complexity: 351 | MSE: 21.3688 | Pred. Time: 0.001210s Benchmarking NuSVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, gamma=3.0517578125e-05, kernel='rbf', max_iter=-1, nu=0.9, shrinking=True, tol=0.001, verbose=False) Complexity: 404 | MSE: 21.1033 | Pred. Time: 0.001261s Benchmarking GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 10 | MSE: 29.0148 | Pred. Time: 0.000072s Benchmarking GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 50 | MSE: 8.7631 | Pred. Time: 0.000136s Benchmarking GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 100 | MSE: 7.4527 | Pred. Time: 0.000217s Benchmarking GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 200 | MSE: 6.7607 | Pred. Time: 0.000341s Benchmarking GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Complexity: 500 | MSE: 7.3029 | Pred. Time: 0.000761s | .. code-block:: default print(__doc__) # Author: Eustache Diemert # License: BSD 3 clause import time import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.axes_grid1.parasite_axes import host_subplot from mpl_toolkits.axisartist.axislines import Axes from scipy.sparse.csr import csr_matrix from sklearn import datasets from sklearn.utils import shuffle from sklearn.metrics import mean_squared_error from sklearn.svm.classes import NuSVR from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor from sklearn.linear_model.stochastic_gradient import SGDClassifier from sklearn.metrics import hamming_loss # ############################################################################# # Routines # Initialize random generator np.random.seed(0) def generate_data(case, sparse=False): """Generate regression/classification data.""" bunch = None if case == 'regression': bunch = datasets.load_boston() elif case == 'classification': bunch = datasets.fetch_20newsgroups_vectorized(subset='all') X, y = shuffle(bunch.data, bunch.target) offset = int(X.shape[0] * 0.8) X_train, y_train = X[:offset], y[:offset] X_test, y_test = X[offset:], y[offset:] if sparse: X_train = csr_matrix(X_train) X_test = csr_matrix(X_test) else: X_train = np.array(X_train) X_test = np.array(X_test) y_test = np.array(y_test) y_train = np.array(y_train) data = {'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test} return data def benchmark_influence(conf): """ Benchmark influence of :changing_param: on both MSE and latency. """ prediction_times = [] prediction_powers = [] complexities = [] for param_value in conf['changing_param_values']: conf['tuned_params'][conf['changing_param']] = param_value estimator = conf['estimator'](**conf['tuned_params']) print("Benchmarking %s" % estimator) estimator.fit(conf['data']['X_train'], conf['data']['y_train']) conf['postfit_hook'](estimator) complexity = conf['complexity_computer'](estimator) complexities.append(complexity) start_time = time.time() for _ in range(conf['n_samples']): y_pred = estimator.predict(conf['data']['X_test']) elapsed_time = (time.time() - start_time) / float(conf['n_samples']) prediction_times.append(elapsed_time) pred_score = conf['prediction_performance_computer']( conf['data']['y_test'], y_pred) prediction_powers.append(pred_score) print("Complexity: %d | %s: %.4f | Pred. Time: %fs\n" % ( complexity, conf['prediction_performance_label'], pred_score, elapsed_time)) return prediction_powers, prediction_times, complexities def plot_influence(conf, mse_values, prediction_times, complexities): """ Plot influence of model complexity on both accuracy and latency. """ plt.figure(figsize=(12, 6)) host = host_subplot(111, axes_class=Axes) plt.subplots_adjust(right=0.75) par1 = host.twinx() host.set_xlabel('Model Complexity (%s)' % conf['complexity_label']) y1_label = conf['prediction_performance_label'] y2_label = "Time (s)" host.set_ylabel(y1_label) par1.set_ylabel(y2_label) p1, = host.plot(complexities, mse_values, 'b-', label="prediction error") p2, = par1.plot(complexities, prediction_times, 'r-', label="latency") host.legend(loc='upper right') host.axis["left"].label.set_color(p1.get_color()) par1.axis["right"].label.set_color(p2.get_color()) plt.title('Influence of Model Complexity - %s' % conf['estimator'].__name__) plt.show() def _count_nonzero_coefficients(estimator): a = estimator.coef_.toarray() return np.count_nonzero(a) # ############################################################################# # Main code regression_data = generate_data('regression') classification_data = generate_data('classification', sparse=True) configurations = [ {'estimator': SGDClassifier, 'tuned_params': {'penalty': 'elasticnet', 'alpha': 0.001, 'loss': 'modified_huber', 'fit_intercept': True, 'tol': 1e-3}, 'changing_param': 'l1_ratio', 'changing_param_values': [0.25, 0.5, 0.75, 0.9], 'complexity_label': 'non_zero coefficients', 'complexity_computer': _count_nonzero_coefficients, 'prediction_performance_computer': hamming_loss, 'prediction_performance_label': 'Hamming Loss (Misclassification Ratio)', 'postfit_hook': lambda x: x.sparsify(), 'data': classification_data, 'n_samples': 30}, {'estimator': NuSVR, 'tuned_params': {'C': 1e3, 'gamma': 2 ** -15}, 'changing_param': 'nu', 'changing_param_values': [0.1, 0.25, 0.5, 0.75, 0.9], 'complexity_label': 'n_support_vectors', 'complexity_computer': lambda x: len(x.support_vectors_), 'data': regression_data, 'postfit_hook': lambda x: x, 'prediction_performance_computer': mean_squared_error, 'prediction_performance_label': 'MSE', 'n_samples': 30}, {'estimator': GradientBoostingRegressor, 'tuned_params': {'loss': 'ls'}, 'changing_param': 'n_estimators', 'changing_param_values': [10, 50, 100, 200, 500], 'complexity_label': 'n_trees', 'complexity_computer': lambda x: x.n_estimators, 'data': regression_data, 'postfit_hook': lambda x: x, 'prediction_performance_computer': mean_squared_error, 'prediction_performance_label': 'MSE', 'n_samples': 30}, ] for conf in configurations: prediction_performances, prediction_times, complexities = \ benchmark_influence(conf) plot_influence(conf, prediction_performances, prediction_times, complexities) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 44.169 seconds) .. _sphx_glr_download_auto_examples_applications_plot_model_complexity_influence.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: plot_model_complexity_influence.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: plot_model_complexity_influence.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_