.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/feature_selection/plot_feature_selection.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_feature_selection_plot_feature_selection.py>`
        to download the full example code or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_feature_selection_plot_feature_selection.py:


============================
Univariate Feature Selection
============================

This notebook is an example of using univariate feature selection
to improve classification accuracy on a noisy dataset.

In this example, some noisy (non informative) features are added to
the iris dataset. Support vector machine (SVM) is used to classify the
dataset both before and after applying univariate feature selection.
For each feature, we plot the p-values for the univariate feature selection
and the corresponding weights of SVMs. With this, we will compare model
accuracy and examine the impact of univariate feature selection on model
weights.

.. GENERATED FROM PYTHON SOURCE LINES 20-23

Generate sample data
--------------------


.. GENERATED FROM PYTHON SOURCE LINES 23-40

.. code-block:: default

    import numpy as np

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split

    # The iris dataset
    X, y = load_iris(return_X_y=True)

    # Some noisy data not correlated
    E = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))

    # Add the noisy data to the informative features
    X = np.hstack((X, E))

    # Split dataset to select feature and evaluate the classifier
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)








.. GENERATED FROM PYTHON SOURCE LINES 41-47

Univariate feature selection
----------------------------

Univariate feature selection with F-test for feature scoring.
We use the default selection function to select
the four most significant features.

.. GENERATED FROM PYTHON SOURCE LINES 47-54

.. code-block:: default

    from sklearn.feature_selection import SelectKBest, f_classif

    selector = SelectKBest(f_classif, k=4)
    selector.fit(X_train, y_train)
    scores = -np.log10(selector.pvalues_)
    scores /= scores.max()








.. GENERATED FROM PYTHON SOURCE LINES 55-66

.. code-block:: default

    import matplotlib.pyplot as plt

    X_indices = np.arange(X.shape[-1])
    plt.figure(1)
    plt.clf()
    plt.bar(X_indices - 0.05, scores, width=0.2)
    plt.title("Feature univariate score")
    plt.xlabel("Feature number")
    plt.ylabel(r"Univariate score ($-Log(p_{value})$)")
    plt.show()




.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png
   :alt: Feature univariate score
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 67-70

In the total set of features, only the 4 of the original features are significant.
We can see that they have the highest score with univariate feature
selection.

.. GENERATED FROM PYTHON SOURCE LINES 72-76

Compare with SVMs
-----------------

Without univariate feature selection

.. GENERATED FROM PYTHON SOURCE LINES 76-91

.. code-block:: default

    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.svm import LinearSVC

    clf = make_pipeline(MinMaxScaler(), LinearSVC(dual="auto"))
    clf.fit(X_train, y_train)
    print(
        "Classification accuracy without selecting features: {:.3f}".format(
            clf.score(X_test, y_test)
        )
    )

    svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
    svm_weights /= svm_weights.sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Classification accuracy without selecting features: 0.789




.. GENERATED FROM PYTHON SOURCE LINES 92-93

After univariate feature selection

.. GENERATED FROM PYTHON SOURCE LINES 93-106

.. code-block:: default

    clf_selected = make_pipeline(
        SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC(dual="auto")
    )
    clf_selected.fit(X_train, y_train)
    print(
        "Classification accuracy after univariate feature selection: {:.3f}".format(
            clf_selected.score(X_test, y_test)
        )
    )

    svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
    svm_weights_selected /= svm_weights_selected.sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Classification accuracy after univariate feature selection: 0.868




.. GENERATED FROM PYTHON SOURCE LINES 107-127

.. code-block:: default

    plt.bar(
        X_indices - 0.45, scores, width=0.2, label=r"Univariate score ($-Log(p_{value})$)"
    )

    plt.bar(X_indices - 0.25, svm_weights, width=0.2, label="SVM weight")

    plt.bar(
        X_indices[selector.get_support()] - 0.05,
        svm_weights_selected,
        width=0.2,
        label="SVM weights after selection",
    )

    plt.title("Comparing feature selection")
    plt.xlabel("Feature number")
    plt.yticks(())
    plt.axis("tight")
    plt.legend(loc="upper right")
    plt.show()




.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png
   :alt: Comparing feature selection
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 128-133

Without univariate feature selection, the SVM assigns a large weight
to the first 4 original significant features, but also selects many of the
non-informative features. Applying univariate feature selection before
the SVM increases the SVM weight attributed to the significant features,
and will thus improve classification.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.203 seconds)


.. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.3.X?urlpath=lab/tree/notebooks/auto_examples/feature_selection/plot_feature_selection.ipynb
        :alt: Launch binder
        :width: 150 px



    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/?path=auto_examples/feature_selection/plot_feature_selection.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_feature_selection.py <plot_feature_selection.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_feature_selection.ipynb <plot_feature_selection.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_