sklearn.datasets
.fetch_20newsgroups¶
-
sklearn.datasets.
fetch_20newsgroups
(data_home=None, subset=’train’, categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)[source]¶ Load the filenames and data from the 20 newsgroups dataset (classification).
Download it if necessary.
Classes 20 Samples total 18846 Dimensionality 1 Features text Read more in the User Guide.
Parameters: - data_home : optional, default: None
Specify a download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
- subset : ‘train’ or ‘test’, ‘all’, optional
Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.
- categories : None or collection of string or unicode
If None (default), load all the categories. If not None, list of category names to load (other categories ignored).
- shuffle : bool, optional
Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.
- random_state : int, RandomState instance or None (default)
Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See Glossary.
- remove : tuple
May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.
‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.
‘headers’ follows an exact standard; the other filters are not always correct.
- download_if_missing : optional, True by default
If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
Returns: - bunch : Bunch object with the following attribute:
- bunch.data: list, length [n_samples]
- bunch.target: array, shape [n_samples]
- bunch.filenames: list, length [n_samples]
- bunch.DESCR: a description of the dataset.
- bunch.target_names: a list of categories of the returned data,
length [n_classes]. This depends on the
categories
parameter.