.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/semi_supervised/plot_semi_supervised_newsgroups.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py:


================================================
Semi-supervised Classification on a Text Dataset
================================================

In this example, semi-supervised classifiers are trained on the 20 newsgroups
dataset (which will be automatically downloaded).

You can adjust the number of categories by giving their names to the dataset
loader or setting them to `None` to get all 20 of them.

.. GENERATED FROM PYTHON SOURCE LINES 13-112


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2823 documents
    5 categories

    Supervised SGDClassifier on 100% of the data:
    Number of training samples: 2117
    Unlabeled samples in training set: 0
    Micro-averaged F1 score on test set: 0.898
    ----------

    Supervised SGDClassifier on 20% of the training data:
    Number of training samples: 433
    Unlabeled samples in training set: 0
    Micro-averaged F1 score on test set: 0.790
    ----------

    SelfTrainingClassifier on 20% of the training data (rest is unlabeled):
    Number of training samples: 2117
    Unlabeled samples in training set: 1684
    End of iteration 1, added 1099 new labels.
    End of iteration 2, added 212 new labels.
    End of iteration 3, added 62 new labels.
    End of iteration 4, added 21 new labels.
    End of iteration 5, added 4 new labels.
    End of iteration 6, added 3 new labels.
    End of iteration 7, added 7 new labels.
    End of iteration 8, added 2 new labels.
    End of iteration 9, added 5 new labels.
    End of iteration 10, added 2 new labels.
    Micro-averaged F1 score on test set: 0.851
    ----------

    LabelSpreading on 20% of the data (rest is unlabeled):
    Number of training samples: 2117
    Unlabeled samples in training set: 1684
    /home/circleci/project/sklearn/utils/validation.py:593: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
      warnings.warn(
    /home/circleci/project/sklearn/utils/validation.py:593: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
      warnings.warn(
    Micro-averaged F1 score on test set: 0.678
    ----------


|

.. code-block:: default


    import numpy as np

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.linear_model import SGDClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.semi_supervised import SelfTrainingClassifier
    from sklearn.semi_supervised import LabelSpreading
    from sklearn.metrics import f1_score

    # Loading dataset containing first five categories
    data = fetch_20newsgroups(
        subset="train",
        categories=[
            "alt.atheism",
            "comp.graphics",
            "comp.os.ms-windows.misc",
            "comp.sys.ibm.pc.hardware",
            "comp.sys.mac.hardware",
        ],
    )
    print("%d documents" % len(data.filenames))
    print("%d categories" % len(data.target_names))
    print()

    # Parameters
    sdg_params = dict(alpha=1e-5, penalty="l2", loss="log")
    vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

    # Supervised Pipeline
    pipeline = Pipeline(
        [
            ("vect", CountVectorizer(**vectorizer_params)),
            ("tfidf", TfidfTransformer()),
            ("clf", SGDClassifier(**sdg_params)),
        ]
    )
    # SelfTraining Pipeline
    st_pipeline = Pipeline(
        [
            ("vect", CountVectorizer(**vectorizer_params)),
            ("tfidf", TfidfTransformer()),
            ("clf", SelfTrainingClassifier(SGDClassifier(**sdg_params), verbose=True)),
        ]
    )
    # LabelSpreading Pipeline
    ls_pipeline = Pipeline(
        [
            ("vect", CountVectorizer(**vectorizer_params)),
            ("tfidf", TfidfTransformer()),
            # LabelSpreading does not support dense matrices
            ("todense", FunctionTransformer(lambda x: x.todense())),
            ("clf", LabelSpreading()),
        ]
    )


    def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
        print("Number of training samples:", len(X_train))
        print("Unlabeled samples in training set:", sum(1 for x in y_train if x == -1))
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(
            "Micro-averaged F1 score on test set: %0.3f"
            % f1_score(y_test, y_pred, average="micro")
        )
        print("-" * 10)
        print()


    if __name__ == "__main__":
        X, y = data.data, data.target
        X_train, X_test, y_train, y_test = train_test_split(X, y)

        print("Supervised SGDClassifier on 100% of the data:")
        eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)

        # select a mask of 20% of the train dataset
        y_mask = np.random.rand(len(y_train)) < 0.2

        # X_20 and y_20 are the subset of the train dataset indicated by the mask
        X_20, y_20 = map(
            list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m))
        )
        print("Supervised SGDClassifier on 20% of the training data:")
        eval_and_print_metrics(pipeline, X_20, y_20, X_test, y_test)

        # set the non-masked subset to be unlabeled
        y_train[~y_mask] = -1
        print("SelfTrainingClassifier on 20% of the training data (rest is unlabeled):")
        eval_and_print_metrics(st_pipeline, X_train, y_train, X_test, y_test)

        print("LabelSpreading on 20% of the data (rest is unlabeled):")
        eval_and_print_metrics(ls_pipeline, X_train, y_train, X_test, y_test)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  7.567 seconds)


.. _sphx_glr_download_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: binder-badge

    .. image:: images/binder_badge_logo.svg
      :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.0.X?urlpath=lab/tree/notebooks/auto_examples/semi_supervised/plot_semi_supervised_newsgroups.ipynb
      :alt: Launch binder
      :width: 150 px


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_semi_supervised_newsgroups.py <plot_semi_supervised_newsgroups.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_semi_supervised_newsgroups.ipynb <plot_semi_supervised_newsgroups.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_