sklearn.datasets.fetch_mldata¶
- sklearn.datasets.fetch_mldata(dataname, target_name='label', data_name='data', transpose_data=True, data_home=None)¶
Fetch an mldata.org data set
If the file does not exist yet, it is downloaded from mldata.org .
mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with the most common cases:
- data values are stored in the column ‘data’, and target values in the column ‘label’
- alternatively, the first column stores target values, and the second data values
- the data array is stored as n_features x n_samples , and thus needs to be transposed to match the sklearn standard
Keyword arguments allow to adapt these defaults to specific data sets (see parameters target_name, data_name, transpose_data, and the examples below).
mldata.org data sets may have multiple columns, which are stored in the Bunch object with their original name.
Parameters: dataname: :
Name of the data set on mldata.org, e.g.: “leukemia”, “Whistler Daily Snowfall”, etc. The raw name is automatically converted to a mldata.org URL .
target_name: optional, default: ‘label’ :
Name or index of the column containing the target values.
data_name: optional, default: ‘data’ :
Name or index of the column containing the data.
transpose_data: optional, default: True :
If True, transpose the downloaded data array.
data_home: optional, default: None :
Specify another download and cache folder for the data sets. By default all scikit learn data is stored in ‘~/scikit_learn_data’ subfolders.
Returns: data : Bunch
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘DESCR’, the full description of the dataset, and ‘COL_NAMES’, the original names of the dataset columns.
Examples
Load the ‘iris’ dataset from mldata.org:
>>> from sklearn.datasets.mldata import fetch_mldata >>> import tempfile >>> test_data_home = tempfile.mkdtemp()
>>> iris = fetch_mldata('iris', data_home=test_data_home) >>> iris.target.shape (150,) >>> iris.data.shape (150, 4)
Load the ‘leukemia’ dataset from mldata.org, which needs to be transposed to respects the sklearn axes convention:
>>> leuk = fetch_mldata('leukemia', transpose_data=True, ... data_home=test_data_home) >>> leuk.data.shape (72, 7129)
Load an alternative ‘iris’ dataset, which has different names for the columns:
>>> iris2 = fetch_mldata('datasets-UCI iris', target_name=1, ... data_name=0, data_home=test_data_home) >>> iris3 = fetch_mldata('datasets-UCI iris', ... target_name='class', data_name='double0', ... data_home=test_data_home)
>>> import shutil >>> shutil.rmtree(test_data_home)