.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/compose/plot_compare_reduction.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_compose_plot_compare_reduction.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_compose_plot_compare_reduction.py:


=================================================================
Selecting dimensionality reduction with Pipeline and GridSearchCV
=================================================================

This example constructs a pipeline that does dimensionality
reduction followed by prediction with a support vector
classifier. It demonstrates the use of ``GridSearchCV`` and
``Pipeline`` to optimize over different classes of estimators in a
single CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality
reductions are compared to univariate feature selection during
the grid search.

Additionally, ``Pipeline`` can be instantiated with the ``memory``
argument to memoize the transformers within the pipeline, avoiding to fit
again the same transformers over and over.

Note that the use of ``memory`` to enable caching becomes interesting when the
fitting of a transformer is costly.

.. GENERATED FROM PYTHON SOURCE LINES 22-27

.. code-block:: default


    # Authors: Robert McGibbon
    #          Joel Nothman
    #          Guillaume Lemaitre








.. GENERATED FROM PYTHON SOURCE LINES 28-30

Illustration of ``Pipeline`` and ``GridSearchCV``
##############################################################################

.. GENERATED FROM PYTHON SOURCE LINES 30-71

.. code-block:: default


    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.svm import LinearSVC
    from sklearn.decomposition import PCA, NMF
    from sklearn.feature_selection import SelectKBest, mutual_info_classif
    from sklearn.preprocessing import MinMaxScaler

    X, y = load_digits(return_X_y=True)

    pipe = Pipeline(
        [
            ("scaling", MinMaxScaler()),
            # the reduce_dim stage is populated by the param_grid
            ("reduce_dim", "passthrough"),
            ("classify", LinearSVC(dual=False, max_iter=10000)),
        ]
    )

    N_FEATURES_OPTIONS = [2, 4, 8]
    C_OPTIONS = [1, 10, 100, 1000]
    param_grid = [
        {
            "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
            "reduce_dim__n_components": N_FEATURES_OPTIONS,
            "classify__C": C_OPTIONS,
        },
        {
            "reduce_dim": [SelectKBest(mutual_info_classif)],
            "reduce_dim__k": N_FEATURES_OPTIONS,
            "classify__C": C_OPTIONS,
        },
    ]
    reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

    grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
    grid.fit(X, y)






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <style>#sk-container-id-59 {color: black;background-color: white;}#sk-container-id-59 pre{padding: 0;}#sk-container-id-59 div.sk-toggleable {background-color: white;}#sk-container-id-59 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-59 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-59 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-59 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-59 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-59 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-59 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-59 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-59 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-59 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-59 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-59 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-59 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-59 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-59 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-59 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-59 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-59 div.sk-item {position: relative;z-index: 1;}#sk-container-id-59 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-59 div.sk-item::before, #sk-container-id-59 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-59 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-59 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-59 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-59 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-59 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-59 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-59 div.sk-label-container {text-align: center;}#sk-container-id-59 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-59 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-59" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>GridSearchCV(estimator=Pipeline(steps=[(&#x27;scaling&#x27;, MinMaxScaler()),
                                           (&#x27;reduce_dim&#x27;, &#x27;passthrough&#x27;),
                                           (&#x27;classify&#x27;,
                                            LinearSVC(dual=False,
                                                      max_iter=10000))]),
                 n_jobs=1,
                 param_grid=[{&#x27;classify__C&#x27;: [1, 10, 100, 1000],
                              &#x27;reduce_dim&#x27;: [PCA(iterated_power=7, n_components=8),
                                             NMF(max_iter=1000)],
                              &#x27;reduce_dim__n_components&#x27;: [2, 4, 8]},
                             {&#x27;classify__C&#x27;: [1, 10, 100, 1000],
                              &#x27;reduce_dim&#x27;: [SelectKBest(score_func=&lt;function mutual_info_classif at 0x7fd08312f280&gt;)],
                              &#x27;reduce_dim__k&#x27;: [2, 4, 8]}])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-264" type="checkbox" ><label for="sk-estimator-id-264" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(estimator=Pipeline(steps=[(&#x27;scaling&#x27;, MinMaxScaler()),
                                           (&#x27;reduce_dim&#x27;, &#x27;passthrough&#x27;),
                                           (&#x27;classify&#x27;,
                                            LinearSVC(dual=False,
                                                      max_iter=10000))]),
                 n_jobs=1,
                 param_grid=[{&#x27;classify__C&#x27;: [1, 10, 100, 1000],
                              &#x27;reduce_dim&#x27;: [PCA(iterated_power=7, n_components=8),
                                             NMF(max_iter=1000)],
                              &#x27;reduce_dim__n_components&#x27;: [2, 4, 8]},
                             {&#x27;classify__C&#x27;: [1, 10, 100, 1000],
                              &#x27;reduce_dim&#x27;: [SelectKBest(score_func=&lt;function mutual_info_classif at 0x7fd08312f280&gt;)],
                              &#x27;reduce_dim__k&#x27;: [2, 4, 8]}])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-265" type="checkbox" ><label for="sk-estimator-id-265" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;scaling&#x27;, MinMaxScaler()), (&#x27;reduce_dim&#x27;, &#x27;passthrough&#x27;),
                    (&#x27;classify&#x27;, LinearSVC(dual=False, max_iter=10000))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-266" type="checkbox" ><label for="sk-estimator-id-266" class="sk-toggleable__label sk-toggleable__label-arrow">MinMaxScaler</label><div class="sk-toggleable__content"><pre>MinMaxScaler()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-267" type="checkbox" ><label for="sk-estimator-id-267" class="sk-toggleable__label sk-toggleable__label-arrow">passthrough</label><div class="sk-toggleable__content"><pre>passthrough</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-268" type="checkbox" ><label for="sk-estimator-id-268" class="sk-toggleable__label sk-toggleable__label-arrow">LinearSVC</label><div class="sk-toggleable__content"><pre>LinearSVC(dual=False, max_iter=10000)</pre></div></div></div></div></div></div></div></div></div></div></div></div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 72-93

.. code-block:: default

    import pandas as pd

    mean_scores = np.array(grid.cv_results_["mean_test_score"])
    # scores are in the order of param_grid iteration, which is alphabetical
    mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
    # select score for best C
    mean_scores = mean_scores.max(axis=0)
    # create a dataframe to ease plotting
    mean_scores = pd.DataFrame(
        mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
    )

    ax = mean_scores.plot.bar()
    ax.set_title("Comparing feature reduction techniques")
    ax.set_xlabel("Reduced number of features")
    ax.set_ylabel("Digit classification accuracy")
    ax.set_ylim((0, 1))
    ax.legend(loc="upper left")

    plt.show()




.. image-sg:: /auto_examples/compose/images/sphx_glr_plot_compare_reduction_001.png
   :alt: Comparing feature reduction techniques
   :srcset: /auto_examples/compose/images/sphx_glr_plot_compare_reduction_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 94-105

Caching transformers within a ``Pipeline``
##############################################################################
 It is sometimes worthwhile storing the state of a specific transformer
 since it could be used again. Using a pipeline in ``GridSearchCV`` triggers
 such situations. Therefore, we use the argument ``memory`` to enable caching.

 .. warning::
     Note that this example is, however, only an illustration since for this
     specific case fitting PCA is not necessarily slower than loading the
     cache. Hence, use the ``memory`` constructor parameter when the fitting
     of a transformer is costly.

.. GENERATED FROM PYTHON SOURCE LINES 105-124

.. code-block:: default


    from joblib import Memory
    from shutil import rmtree

    # Create a temporary folder to store the transformers of the pipeline
    location = "cachedir"
    memory = Memory(location=location, verbose=10)
    cached_pipe = Pipeline(
        [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))],
        memory=memory,
    )

    # This time, a cached pipeline will be used within the grid search


    # Delete the temporary cache before exiting
    memory.clear(warn=False)
    rmtree(location)








.. GENERATED FROM PYTHON SOURCE LINES 125-131

The ``PCA`` fitting is only computed at the evaluation of the first
configuration of the ``C`` parameter of the ``LinearSVC`` classifier. The
other configurations of ``C`` will trigger the loading of the cached ``PCA``
estimator data, leading to save processing time. Therefore, the use of
caching the pipeline using ``memory`` is highly beneficial when fitting
a transformer is costly.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  42.950 seconds)


.. _sphx_glr_download_auto_examples_compose_plot_compare_reduction.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.2.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_compare_reduction.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_compare_reduction.py <plot_compare_reduction.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_compare_reduction.ipynb <plot_compare_reduction.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_