sklearn.datasets
.fetch_20newsgroups_vectorized¶
-
sklearn.datasets.
fetch_20newsgroups_vectorized
(subset='train', remove=(), data_home=None, download_if_missing=True, return_X_y=False, normalize=True)[source]¶ Load the 20 newsgroups dataset and vectorize it into token counts (classification).
Download it if necessary.
This is a convenience function; the transformation is done using the default settings for
sklearn.feature_extraction.text.CountVectorizer
. For more advanced usage (stopword filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a customsklearn.feature_extraction.text.CountVectorizer
,sklearn.feature_extraction.text.HashingVectorizer
,sklearn.feature_extraction.text.TfidfTransformer
orsklearn.feature_extraction.text.TfidfVectorizer
.The resulting counts are normalized using
sklearn.preprocessing.normalize
unless normalize is set to False.Classes
20
Samples total
18846
Dimensionality
130107
Features
real
Read more in the User Guide.
- Parameters
- subset‘train’ or ‘test’, ‘all’, optional
Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.
- removetuple
May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.
‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.
- data_homeoptional, default: None
Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
- download_if_missingoptional, True by default
If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
- return_X_ybool, default=False
If True, returns
(data.data, data.target)
instead of a Bunch object.New in version 0.20.
- normalizebool, default=True
If True, normalizes each document’s feature vector to unit norm using
sklearn.preprocessing.normalize
.New in version 0.22.
- Returns
- bunchBunch object with the following attribute:
bunch.data: sparse matrix, shape [n_samples, n_features]
bunch.target: array, shape [n_samples]
bunch.target_names: a list of categories of the returned data, length [n_classes].
bunch.DESCR: a description of the dataset.
- (data, target)tuple if
return_X_y
is True New in version 0.20.