.. _datasets: ========================= Dataset loading utilities ========================= .. currentmodule:: sklearn.datasets The ``sklearn.datasets`` package embeds some small toy datasets as introduced in the :ref:`Getting Started ` section. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the 'real world'. To evaluate the impact of the scale of the dataset (``n_samples`` and ``n_features``) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. General dataset API =================== There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset. **The dataset loaders.** They can be used to load small standard datasets, described in the :ref:`toy_datasets` section. **The dataset fetchers.** They can be used to download and load larger datasets, described in the :ref:`real_world_datasets` section. Both loaders and fetchers functions return a :class:`sklearn.utils.Bunch` object holding at least two items: an array of shape ``n_samples`` * ``n_features`` with key ``data`` (except for 20newsgroups) and a numpy array of length ``n_samples``, containing the target values, with key ``target``. The Bunch object is a dictionary that exposes its keys are attributes. For more information about Bunch object, see :class:`sklearn.utils.Bunch`: It's also possible for almost all of these function to constrain the output to be a tuple containing only the data and the target, by setting the ``return_X_y`` parameter to ``True``. The datasets also contain a full description in their ``DESCR`` attribute and some contain ``feature_names`` and ``target_names``. See the dataset descriptions below for details. **The dataset generation functions.** They can be used to generate controlled synthetic datasets, described in the :ref:`sample_generators` section. These functions return a tuple ``(X, y)`` consisting of a ``n_samples`` * ``n_features`` numpy array ``X`` and an array of length ``n_samples`` containing the targets ``y``. In addition, there are also miscellaneous tools to load datasets of other formats or from other locations, described in the :ref:`loading_other_datasets` section. .. _toy_datasets: Toy datasets ============ scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. They can be loaded using the following functions: .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_boston load_iris load_diabetes load_digits load_linnerud load_wine load_breast_cancer These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They are however often too small to be representative of real world machine learning tasks. .. include:: ../../sklearn/datasets/descr/boston_house_prices.rst .. include:: ../../sklearn/datasets/descr/iris.rst .. include:: ../../sklearn/datasets/descr/diabetes.rst .. include:: ../../sklearn/datasets/descr/digits.rst .. include:: ../../sklearn/datasets/descr/linnerud.rst .. include:: ../../sklearn/datasets/descr/wine_data.rst .. include:: ../../sklearn/datasets/descr/breast_cancer.rst .. _real_world_datasets: Real world datasets =================== scikit-learn provides tools to load larger datasets, downloading them if necessary. They can be loaded using the following functions: .. autosummary:: :toctree: ../modules/generated/ :template: function.rst fetch_olivetti_faces fetch_20newsgroups fetch_20newsgroups_vectorized fetch_lfw_people fetch_lfw_pairs fetch_covtype fetch_rcv1 fetch_kddcup99 fetch_california_housing .. include:: ../../sklearn/datasets/descr/olivetti_faces.rst .. include:: ../../sklearn/datasets/descr/twenty_newsgroups.rst .. include:: ../../sklearn/datasets/descr/lfw.rst .. include:: ../../sklearn/datasets/descr/covtype.rst .. include:: ../../sklearn/datasets/descr/rcv1.rst .. include:: ../../sklearn/datasets/descr/kddcup99.rst .. include:: ../../sklearn/datasets/descr/california_housing.rst .. _sample_generators: Generated datasets ================== In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. Generators for classification and clustering -------------------------------------------- These generators produce a matrix of features and corresponding discrete targets. Single label ~~~~~~~~~~~~ Both :func:`make_blobs` and :func:`make_classification` create multiclass datasets by allocating each class one or more normally-distributed clusters of points. :func:`make_blobs` provides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering. :func:`make_classification` specialises in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space. :func:`make_gaussian_quantiles` divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres. :func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem. .. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_dataset_001.png :target: ../auto_examples/datasets/plot_random_dataset.html :scale: 50 :align: center :func:`make_circles` and :func:`make_moons` generate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation. :func:`make_circles` produces Gaussian data with a spherical decision boundary for binary classification, while :func:`make_moons` produces two interleaving half circles. Multilabel ~~~~~~~~~~ :func:`make_multilabel_classification` generates random samples with multiple labels, reflecting a bag of words drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson, with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications with respect to true bag-of-words mixtures include: * Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base distribution, and would be correlated. * For a document generated from multiple topics, all topics are weighted equally in generating its bag of words. * Documents without labels words at random, rather than from a base distribution. .. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_multilabel_dataset_001.png :target: ../auto_examples/datasets/plot_random_multilabel_dataset.html :scale: 50 :align: center Biclustering ~~~~~~~~~~~~ .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_biclusters make_checkerboard Generators for regression ------------------------- :func:`make_regression` produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance). Other regression generators generate functions deterministically from randomized features. :func:`make_sparse_uncorrelated` produces a target as a linear combination of four features with fixed coefficients. Others encode explicitly non-linear relations: :func:`make_friedman1` is related by polynomial and sine transforms; :func:`make_friedman2` includes feature multiplication and reciprocation; and :func:`make_friedman3` is similar with an arctan transformation on the target. Generators for manifold learning -------------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_s_curve make_swiss_roll Generators for decomposition ---------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_low_rank_matrix make_sparse_coded_signal make_spd_matrix make_sparse_spd_matrix .. _loading_other_datasets: Loading other datasets ====================== .. _sample_images: Sample images ------------- Scikit-learn also embed a couple of sample JPEG images published under Creative Commons license by their authors. Those images can be useful to test algorithms and pipeline on 2D data. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_sample_images load_sample_image .. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png :target: ../auto_examples/cluster/plot_color_quantization.html :scale: 30 :align: right .. warning:: The default coding of images is based on the ``uint8`` dtype to spare memory. Often machine learning algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use ``matplotlib.pyplpt.imshow`` don't forget to scale to the range 0 - 1 as done in the following example. .. topic:: Examples: * :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py` .. _libsvm_loader: Datasets in svmlight / libsvm format ------------------------------------ scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line takes the form ``