.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINXGALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/semi_supervised/plot_semi_supervised_newsgroups.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphxglrdownloadlinknote
Click :ref:`here `
to download the full example code or to run this example in your browser via Binder
.. rstclass:: sphxglrexampletitle
.. _sphx_glr_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py:
================================================
Semisupervised Classification on a Text Dataset
================================================
In this example, semisupervised classifiers are trained on the 20 newsgroups
dataset (which will be automatically downloaded).
You can adjust the number of categories by giving their names to the dataset
loader or setting them to `None` to get all 20 of them.
.. GENERATED FROM PYTHON SOURCE LINES 13112
.. rstclass:: sphxglrscriptout
.. codeblock:: none
2823 documents
5 categories
Supervised SGDClassifier on 100% of the data:
Number of training samples: 2117
Unlabeled samples in training set: 0
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Microaveraged F1 score on test set: 0.902

Supervised SGDClassifier on 20% of the training data:
Number of training samples: 460
Unlabeled samples in training set: 0
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Microaveraged F1 score on test set: 0.773

SelfTrainingClassifier on 20% of the training data (rest is unlabeled):
Number of training samples: 2117
Unlabeled samples in training set: 1657
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 1, added 1088 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 2, added 185 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 3, added 53 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 4, added 23 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 5, added 11 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 6, added 11 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 7, added 3 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 8, added 6 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 9, added 4 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
End of iteration 10, added 2 new labels.
/home/runner/work/scikitlearn/scikitlearn/sklearn/linear_model/_stochastic_gradient.py:173: FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use `loss='log_loss'` which is equivalent.
warnings.warn(
Microaveraged F1 score on test set: 0.843

LabelSpreading on 20% of the data (rest is unlabeled):
Number of training samples: 2117
Unlabeled samples in training set: 1657
/home/runner/work/scikitlearn/scikitlearn/sklearn/utils/validation.py:727: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
warnings.warn(
/home/runner/work/scikitlearn/scikitlearn/sklearn/utils/validation.py:727: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
warnings.warn(
Microaveraged F1 score on test set: 0.671


.. codeblock:: default
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import f1_score
# Loading dataset containing first five categories
data = fetch_20newsgroups(
subset="train",
categories=[
"alt.atheism",
"comp.graphics",
"comp.os.mswindows.misc",
"comp.sys.ibm.pc.hardware",
"comp.sys.mac.hardware",
],
)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()
# Parameters
sdg_params = dict(alpha=1e5, penalty="l2", loss="log")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)
# Supervised Pipeline
pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier(**sdg_params)),
]
)
# SelfTraining Pipeline
st_pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("clf", SelfTrainingClassifier(SGDClassifier(**sdg_params), verbose=True)),
]
)
# LabelSpreading Pipeline
ls_pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
# LabelSpreading does not support dense matrices
("todense", FunctionTransformer(lambda x: x.todense())),
("clf", LabelSpreading()),
]
)
def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
print("Number of training samples:", len(X_train))
print("Unlabeled samples in training set:", sum(1 for x in y_train if x == 1))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(
"Microaveraged F1 score on test set: %0.3f"
% f1_score(y_test, y_pred, average="micro")
)
print("" * 10)
print()
if __name__ == "__main__":
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
print("Supervised SGDClassifier on 100% of the data:")
eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)
# select a mask of 20% of the train dataset
y_mask = np.random.rand(len(y_train)) < 0.2
# X_20 and y_20 are the subset of the train dataset indicated by the mask
X_20, y_20 = map(
list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m))
)
print("Supervised SGDClassifier on 20% of the training data:")
eval_and_print_metrics(pipeline, X_20, y_20, X_test, y_test)
# set the nonmasked subset to be unlabeled
y_train[~y_mask] = 1
print("SelfTrainingClassifier on 20% of the training data (rest is unlabeled):")
eval_and_print_metrics(st_pipeline, X_train, y_train, X_test, y_test)
print("LabelSpreading on 20% of the data (rest is unlabeled):")
eval_and_print_metrics(ls_pipeline, X_train, y_train, X_test, y_test)
.. rstclass:: sphxglrtiming
**Total running time of the script:** ( 0 minutes 7.354 seconds)
.. _sphx_glr_download_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py:
.. only:: html
.. container:: sphxglrfooter sphxglrfooterexample
.. container:: binderbadge
.. image:: images/binder_badge_logo.svg
:target: https://mybinder.org/v2/gh/scikitlearn/scikitlearn/1.1.X?urlpath=lab/tree/notebooks/auto_examples/semi_supervised/plot_semi_supervised_newsgroups.ipynb
:alt: Launch binder
:width: 150 px
.. container:: sphxglrdownload sphxglrdownloadpython
:download:`Download Python source code: plot_semi_supervised_newsgroups.py `
.. container:: sphxglrdownload sphxglrdownloadjupyter
:download:`Download Jupyter notebook: plot_semi_supervised_newsgroups.ipynb `
.. only:: html
.. rstclass:: sphxglrsignature
`Gallery generated by SphinxGallery `_