Fork me on GitHub


sklearn.datasets.fetch_20newsgroups_vectorized(subset='train', remove=(), data_home=None)[source]

Load the 20 newsgroups dataset and transform it into tf-idf vectors.

This is a convenience function; the tf-idf transformation is done using the default settings for sklearn.feature_extraction.text.Vectorizer. For more advanced usage (stopword filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a custom Vectorizer or CountVectorizer.


subset: ‘train’ or ‘test’, ‘all’, optional :

Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.

data_home: optional, default: None :

Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

remove: tuple :

May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.

‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.


bunch : Bunch object sparse matrix, shape [n_samples, n_features] array, shape [n_samples] bunch.target_names: list, length [n_classes]

Examples using sklearn.datasets.fetch_20newsgroups_vectorized