sklearn.datasets.fetch_20newsgroups

sklearn.datasets.fetch_20newsgroups(*, data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True, return_X_y=False, n_retries=3, delay=1.0)[source]

Load the filenames and data from the 20 newsgroups dataset (classification).

Download it if necessary.

Classes

20

Samples total

18846

Dimensionality

1

Features

text

Read more in the User Guide.

Parameters:
data_homestr or path-like, default=None

Specify a download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

subset{‘train’, ‘test’, ‘all’}, default=’train’

Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.

categoriesarray-like, dtype=str, default=None

If None (default), load all the categories. If not None, list of category names to load (other categories ignored).

shufflebool, default=True

Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

random_stateint, RandomState instance or None, default=42

Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See Glossary.

removetuple, default=()

May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.

‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.

‘headers’ follows an exact standard; the other filters are not always correct.

download_if_missingbool, default=True

If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site.

return_X_ybool, default=False

If True, returns (data.data, data.target) instead of a Bunch object.

New in version 0.22.

n_retriesint, default=3

Number of retries when HTTP errors are encountered.

New in version 1.5.

delayfloat, default=1.0

Number of seconds between retries.

New in version 1.5.

Returns:
bunchBunch

Dictionary-like object, with the following attributes.

datalist of shape (n_samples,)

The data list to learn.

target: ndarray of shape (n_samples,)

The target labels.

filenames: list of shape (n_samples,)

The path to the location of the data.

DESCR: str

The full description of the dataset.

target_names: list of shape (n_classes,)

The names of target classes.

(data, target)tuple if return_X_y=True

A tuple of two ndarrays. The first contains a 2D array of shape (n_samples, n_classes) with each row representing one sample and each column representing the features. The second array of shape (n_samples,) contains the target samples.

New in version 0.22.

Examples

>>> from sklearn.datasets import fetch_20newsgroups
>>> cats = ['alt.atheism', 'sci.space']
>>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
>>> list(newsgroups_train.target_names)
['alt.atheism', 'sci.space']
>>> newsgroups_train.filenames.shape
(1073,)
>>> newsgroups_train.target.shape
(1073,)
>>> newsgroups_train.target[:10]
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])

Examples using sklearn.datasets.fetch_20newsgroups

Biclustering documents with the Spectral Co-clustering algorithm

Biclustering documents with the Spectral Co-clustering algorithm

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

Sample pipeline for text feature extraction and evaluation

Sample pipeline for text feature extraction and evaluation

Column Transformer with Heterogeneous Data Sources

Column Transformer with Heterogeneous Data Sources

Semi-supervised Classification on a Text Dataset

Semi-supervised Classification on a Text Dataset

Classification of text documents using sparse features

Classification of text documents using sparse features

Clustering text documents using k-means

Clustering text documents using k-means

FeatureHasher and DictVectorizer Comparison

FeatureHasher and DictVectorizer Comparison