.. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compose_plot_compare_reduction.py: ================================================================= Selecting dimensionality reduction with Pipeline and GridSearchCV ================================================================= This example constructs a pipeline that does dimensionality reduction followed by prediction with a support vector classifier. It demonstrates the use of ``GridSearchCV`` and ``Pipeline`` to optimize over different classes of estimators in a single CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality reductions are compared to univariate feature selection during the grid search. Additionally, ``Pipeline`` can be instantiated with the ``memory`` argument to memoize the transformers within the pipeline, avoiding to fit again the same transformers over and over. Note that the use of ``memory`` to enable caching becomes interesting when the fitting of a transformer is costly. ############################################################################### Illustration of ``Pipeline`` and ``GridSearchCV`` ############################################################################### This section illustrates the use of a ``Pipeline`` with ``GridSearchCV`` .. code-block:: default # Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC from sklearn.decomposition import PCA, NMF from sklearn.feature_selection import SelectKBest, chi2 print(__doc__) pipe = Pipeline([ # the reduce_dim stage is populated by the param_grid ('reduce_dim', 'passthrough'), ('classify', LinearSVC(dual=False, max_iter=10000)) ]) N_FEATURES_OPTIONS = [2, 4, 8] C_OPTIONS = [1, 10, 100, 1000] param_grid = [ { 'reduce_dim': [PCA(iterated_power=7), NMF()], 'reduce_dim__n_components': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, { 'reduce_dim': [SelectKBest(chi2)], 'reduce_dim__k': N_FEATURES_OPTIONS, 'classify__C': C_OPTIONS }, ] reducer_labels = ['PCA', 'NMF', 'KBest(chi2)'] grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid) X, y = load_digits(return_X_y=True) grid.fit(X, y) mean_scores = np.array(grid.cv_results_['mean_test_score']) # scores are in the order of param_grid iteration, which is alphabetical mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS)) # select score for best C mean_scores = mean_scores.max(axis=0) bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) * (len(reducer_labels) + 1) + .5) plt.figure() COLORS = 'bgrcmyk' for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)): plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i]) plt.title("Comparing feature reduction techniques") plt.xlabel('Reduced number of features') plt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS) plt.ylabel('Digit classification accuracy') plt.ylim((0, 1)) plt.legend(loc='upper left') plt.show() .. image:: /auto_examples/compose/images/sphx_glr_plot_compare_reduction_001.png :alt: Comparing feature reduction techniques :class: sphx-glr-single-img Caching transformers within a ``Pipeline`` ############################################################################## It is sometimes worthwhile storing the state of a specific transformer since it could be used again. Using a pipeline in ``GridSearchCV`` triggers such situations. Therefore, we use the argument ``memory`` to enable caching. .. warning:: Note that this example is, however, only an illustration since for this specific case fitting PCA is not necessarily slower than loading the cache. Hence, use the ``memory`` constructor parameter when the fitting of a transformer is costly. .. code-block:: default from joblib import Memory from shutil import rmtree # Create a temporary folder to store the transformers of the pipeline location = 'cachedir' memory = Memory(location=location, verbose=10) cached_pipe = Pipeline([('reduce_dim', PCA()), ('classify', LinearSVC(dual=False, max_iter=10000))], memory=memory) # This time, a cached pipeline will be used within the grid search # Delete the temporary cache before exiting memory.clear(warn=False) rmtree(location) The ``PCA`` fitting is only computed at the evaluation of the first configuration of the ``C`` parameter of the ``LinearSVC`` classifier. The other configurations of ``C`` will trigger the loading of the cached ``PCA`` estimator data, leading to save processing time. Therefore, the use of caching the pipeline using ``memory`` is highly beneficial when fitting a transformer is costly. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.905 seconds) .. _sphx_glr_download_auto_examples_compose_plot_compare_reduction.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_compare_reduction.ipynb :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_compare_reduction.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_compare_reduction.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_