.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/compose/plot_compare_reduction.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compose_plot_compare_reduction.py: ================================================================= Selecting dimensionality reduction with Pipeline and GridSearchCV ================================================================= This example constructs a pipeline that does dimensionality reduction followed by prediction with a support vector classifier. It demonstrates the use of ``GridSearchCV`` and ``Pipeline`` to optimize over different classes of estimators in a single CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality reductions are compared to univariate feature selection during the grid search. Additionally, ``Pipeline`` can be instantiated with the ``memory`` argument to memoize the transformers within the pipeline, avoiding to fit again the same transformers over and over. Note that the use of ``memory`` to enable caching becomes interesting when the fitting of a transformer is costly. .. GENERATED FROM PYTHON SOURCE LINES 22-26 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 27-29 Illustration of ``Pipeline`` and ``GridSearchCV`` ############################################################################## .. GENERATED FROM PYTHON SOURCE LINES 29-71 .. code-block:: Python import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_digits from sklearn.decomposition import NMF, PCA from sklearn.feature_selection import SelectKBest, mutual_info_classif from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.svm import LinearSVC X, y = load_digits(return_X_y=True) pipe = Pipeline( [ ("scaling", MinMaxScaler()), # the reduce_dim stage is populated by the param_grid ("reduce_dim", "passthrough"), ("classify", LinearSVC(dual=False, max_iter=10000)), ] ) N_FEATURES_OPTIONS = [2, 4, 8] C_OPTIONS = [1, 10, 100, 1000] param_grid = [ { "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)], "reduce_dim__n_components": N_FEATURES_OPTIONS, "classify__C": C_OPTIONS, }, { "reduce_dim": [SelectKBest(mutual_info_classif)], "reduce_dim__k": N_FEATURES_OPTIONS, "classify__C": C_OPTIONS, }, ] reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"] grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid) grid.fit(X, y) .. raw:: html

GridSearchCV(estimator=Pipeline(steps=[('scaling', MinMaxScaler()),
                                           ('reduce_dim', 'passthrough'),
                                           ('classify',
                                            LinearSVC(dual=False,
                                                      max_iter=10000))]),
                 n_jobs=1,
                 param_grid=[{'classify__C': [1, 10, 100, 1000],
                              'reduce_dim': [PCA(iterated_power=7),
                                             NMF(max_iter=1000)],
                              'reduce_dim__n_components': [2, 4, 8]},
                             {'classify__C': [1, 10, 100, 1000],
                              'reduce_dim': [SelectKBest(score_func=<function mutual_info_classif at 0x7fded371c8b0>)],
                              'reduce_dim__k': [2, 4, 8]}])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 72-93 .. code-block:: Python import pandas as pd mean_scores = np.array(grid.cv_results_["mean_test_score"]) # scores are in the order of param_grid iteration, which is alphabetical mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS)) # select score for best C mean_scores = mean_scores.max(axis=0) # create a dataframe to ease plotting mean_scores = pd.DataFrame( mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels ) ax = mean_scores.plot.bar() ax.set_title("Comparing feature reduction techniques") ax.set_xlabel("Reduced number of features") ax.set_ylabel("Digit classification accuracy") ax.set_ylim((0, 1)) ax.legend(loc="upper left") plt.show() .. image-sg:: /auto_examples/compose/images/sphx_glr_plot_compare_reduction_001.png :alt: Comparing feature reduction techniques :srcset: /auto_examples/compose/images/sphx_glr_plot_compare_reduction_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 94-106 Caching transformers within a ``Pipeline`` ########################################## It is sometimes worthwhile storing the state of a specific transformer since it could be used again. Using a pipeline in ``GridSearchCV`` triggers such situations. Therefore, we use the argument ``memory`` to enable caching. .. warning:: Note that this example is, however, only an illustration since for this specific case fitting PCA is not necessarily slower than loading the cache. Hence, use the ``memory`` constructor parameter when the fitting of a transformer is costly. .. GENERATED FROM PYTHON SOURCE LINES 106-126 .. code-block:: Python from shutil import rmtree from joblib import Memory # Create a temporary folder to store the transformers of the pipeline location = "cachedir" memory = Memory(location=location, verbose=10) cached_pipe = Pipeline( [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))], memory=memory, ) # This time, a cached pipeline will be used within the grid search # Delete the temporary cache before exiting memory.clear(warn=False) rmtree(location) .. GENERATED FROM PYTHON SOURCE LINES 127-133 The ``PCA`` fitting is only computed at the evaluation of the first configuration of the ``C`` parameter of the ``LinearSVC`` classifier. The other configurations of ``C`` will trigger the loading of the cached ``PCA`` estimator data, leading to save processing time. Therefore, the use of caching the pipeline using ``memory`` is highly beneficial when fitting a transformer is costly. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 44.280 seconds) .. _sphx_glr_download_auto_examples_compose_plot_compare_reduction.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.6.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_compare_reduction.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/compose/plot_compare_reduction.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_compare_reduction.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_compare_reduction.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_compare_reduction.zip ` .. include:: plot_compare_reduction.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_