sklearn.datasets.fetch_20newsgroups_vectorized

sklearn.datasets.fetch_20newsgroups_vectorized(*, subset='train', remove=(), data_home=None, download_if_missing=True, return_X_y=False, normalize=True)[source]

Load the 20 newsgroups dataset and vectorize it into token counts (classification).

Download it if necessary.

This is a convenience function; the transformation is done using the default settings for sklearn.feature_extraction.text.CountVectorizer. For more advanced usage (stopword filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a custom sklearn.feature_extraction.text.CountVectorizer, sklearn.feature_extraction.text.HashingVectorizer, sklearn.feature_extraction.text.TfidfTransformer or sklearn.feature_extraction.text.TfidfVectorizer.

The resulting counts are normalized using sklearn.preprocessing.normalize unless normalize is set to False.

Classes

20

Samples total

18846

Dimensionality

130107

Features

real

Read more in the User Guide.

Parameters
subset‘train’ or ‘test’, ‘all’, optional

Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.

removetuple

May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.

‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.

data_homeoptional, default: None

Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

download_if_missingoptional, True by default

If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.

return_X_ybool, default=False

If True, returns (data.data, data.target) instead of a Bunch object.

New in version 0.20.

normalizebool, default=True

If True, normalizes each document’s feature vector to unit norm using sklearn.preprocessing.normalize.

New in version 0.22.

Returns
bunchBunch

Dictionary-like object, with the following attributes.

data: sparse matrix, shape [n_samples, n_features]

The data matrix to learn.

target: array, shape [n_samples]

The target labels.

target_names: list, length [n_classes]

The names of target classes.

DESCR: str

The full description of the dataset.

(data, target)tuple if return_X_y is True

New in version 0.20.