.. _neural_networks_supervised:

==================================
Neural network models (supervised)
==================================

.. currentmodule:: sklearn.neural_network


.. warning::

    This implementation is not intended for large-scale applications. In particular,
    scikit-learn offers no GPU support. For much faster, GPU-based implementations,
    as well as frameworks offering much more flexibility to build deep learning
    architectures, see  :ref:`related_projects`.

.. _multilayer_perceptron:

Multi-layer Perceptron
======================

**Multi-layer Perceptron (MLP)** is a supervised learning algorithm that learns
a function :math:`f(\cdot): R^m \rightarrow R^o` by training on a dataset,
where :math:`m` is the number of dimensions for input and :math:`o` is the
number of dimensions for output. Given a set of features :math:`X = {x_1, x_2, ..., x_m}`
and a target :math:`y`, it can learn a non-linear function approximator for either
classification or regression. It is different from logistic regression, in that
between the input and the output layer, there can be one or more non-linear
layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar
output.

.. figure:: ../images/multilayerperceptron_network.png
   :align: center
   :scale: 60%

   **Figure 1 : One hidden layer MLP.**

The leftmost layer, known as the input layer, consists of a set of neurons
:math:`\{x_i | x_1, x_2, ..., x_m\}` representing the input features. Each
neuron in the hidden layer transforms the values from the previous layer with
a weighted linear summation :math:`w_1x_1 + w_2x_2 + ... + w_mx_m`, followed
by a non-linear activation function :math:`g(\cdot):R \rightarrow R` - like
the hyperbolic tan function. The output layer receives the values from the
last hidden layer and transforms them into output values.

The module contains the public attributes ``coefs_`` and ``intercepts_``.
``coefs_`` is a list of weight matrices, where weight matrix at index
:math:`i` represents the weights between layer :math:`i` and layer
:math:`i+1`. ``intercepts_`` is a list of bias vectors, where the vector
at index :math:`i` represents the bias values added to layer :math:`i+1`.

The advantages of Multi-layer Perceptron are:

    + Capability to learn non-linear models.

    + Capability to learn models in real-time (on-line learning)
      using ``partial_fit``.


The disadvantages of Multi-layer Perceptron (MLP) include:

    + MLP with hidden layers have a non-convex loss function where there exists
      more than one local minimum. Therefore different random weight
      initializations can lead to different validation accuracy.

    + MLP requires tuning a number of hyperparameters such as the number of
      hidden neurons, layers, and iterations.

    + MLP is sensitive to feature scaling.

Please see :ref:`Tips on Practical Use <mlp_tips>` section that addresses
some of these disadvantages.


Classification
==============

Class :class:`MLPClassifier` implements a multi-layer perceptron (MLP) algorithm
that trains using `Backpropagation <http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm>`_.

MLP trains on two arrays: array X of size (n_samples, n_features), which holds
the training samples represented as floating point feature vectors; and array
y of size (n_samples,), which holds the target values (class labels) for the
training samples::

    >>> from sklearn.neural_network import MLPClassifier
    >>> X = [[0., 0.], [1., 1.]]
    >>> y = [0, 1]
    >>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
    ...                     hidden_layer_sizes=(5, 2), random_state=1)
    ...
    >>> clf.fit(X, y)                         # doctest: +NORMALIZE_WHITESPACE
    MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
           beta_1=0.9, beta_2=0.999, early_stopping=False,
           epsilon=1e-08, hidden_layer_sizes=(5, 2), learning_rate='constant',
           learning_rate_init=0.001, max_iter=200, momentum=0.9,
           nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
           solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
           warm_start=False)

After fitting (training), the model can predict labels for new samples::

    >>> clf.predict([[2., 2.], [-1., -2.]])
    array([1, 0])

MLP can fit a non-linear model to the training data. ``clf.coefs_``
contains the weight matrices that constitute the model parameters::

    >>> [coef.shape for coef in clf.coefs_]
    [(2, 5), (5, 2), (2, 1)]

Currently, :class:`MLPClassifier` supports only the
Cross-Entropy loss function, which allows probability estimates by running the
``predict_proba`` method.

MLP trains using Backpropagation. More precisely, it trains using some form of
gradient descent and the gradients are calculated using Backpropagation. For
classification, it minimizes the Cross-Entropy loss function, giving a vector
of probability estimates :math:`P(y|x)` per sample :math:`x`::

    >>> clf.predict_proba([[2., 2.], [1., 2.]])  # doctest: +ELLIPSIS
    array([[  1.967...e-04,   9.998...-01],
           [  1.967...e-04,   9.998...-01]])

:class:`MLPClassifier` supports multi-class classification by
applying `Softmax <https://en.wikipedia.org/wiki/Softmax_activation_function>`_
as the output function.

Further, the model supports :ref:`multi-label classification <multiclass>`
in which a sample can belong to more than one class. For each class, the raw
output passes through the logistic function. Values larger or equal to `0.5`
are rounded to `1`, otherwise to `0`. For a predicted output of a sample, the
indices where the value is `1` represents the assigned classes of that sample::

    >>> X = [[0., 0.], [1., 1.]]
    >>> y = [[0, 1], [1, 1]]
    >>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
    ...                     hidden_layer_sizes=(15,), random_state=1)
    ...
    >>> clf.fit(X, y)                         # doctest: +NORMALIZE_WHITESPACE
    MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
           beta_1=0.9, beta_2=0.999, early_stopping=False,
           epsilon=1e-08, hidden_layer_sizes=(15,), learning_rate='constant',
           learning_rate_init=0.001, max_iter=200, momentum=0.9,
           nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
           solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
           warm_start=False)
    >>> clf.predict([[1., 2.]])
    array([[1, 1]])
    >>> clf.predict([[0., 0.]])
    array([[0, 1]])

See the examples below and the doc string of
:meth:`MLPClassifier.fit` for further information.

.. topic:: Examples:

 * :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`
 * :ref:`sphx_glr_auto_examples_neural_networks_plot_mnist_filters.py`

Regression
==========

Class :class:`MLPRegressor` implements a multi-layer perceptron (MLP) that
trains using backpropagation with no activation function in the output layer,
which can also be seen as using the identity function as activation function.
Therefore, it uses the square error as the loss function, and the output is a
set of continuous values.

:class:`MLPRegressor` also supports multi-output regression, in
which a sample can have more than one target.

Regularization
==============

Both :class:`MLPRegressor` and class:`MLPClassifier` use parameter ``alpha``
for regularization (L2 regularization) term which helps in avoiding overfitting
by penalizing weights with large magnitudes. Following plot displays varying
decision function with value of alpha.

.. figure:: ../auto_examples/neural_networks/images/sphx_glr_plot_mlp_alpha_001.png
   :target: ../auto_examples/neural_networks/plot_mlp_alpha.html
   :align: center
   :scale: 75

See the examples below for further information.

.. topic:: Examples:

 * :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_alpha.py`

Algorithms
==========

MLP trains using `Stochastic Gradient Descent
<https://en.wikipedia.org/wiki/Stochastic_gradient_descent>`_,
`Adam <http://arxiv.org/abs/1412.6980>`_, or
`L-BFGS <https://en.wikipedia.org/wiki/Limited-memory_BFGS>`__.
Stochastic Gradient Descent (SGD) updates parameters using the gradient of the
loss function with respect to a parameter that needs adaptation, i.e.

.. math::

    w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w}
    + \frac{\partial Loss}{\partial w})

where :math:`\eta` is the learning rate which controls the step-size in
the parameter space search.  :math:`Loss` is the loss function used
for the network.

More details can be found in the documentation of
`SGD <http://scikit-learn.org/stable/modules/sgd.html>`_

Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can
automatically adjust the amount to update parameters based on adaptive estimates
of lower-order moments.

With SGD or Adam, training supports online and mini-batch learning.

L-BFGS is a solver that approximates the Hessian matrix which represents the
second-order partial derivative of a function. Further it approximates the
inverse of the Hessian matrix to perform parameter updates. The implementation
uses the Scipy version of `L-BFGS
<http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html>`_.

If the selected solver is 'L-BFGS', training does not support online nor
mini-batch learning.


Complexity
==========

Suppose there are :math:`n` training samples, :math:`m` features, :math:`k`
hidden layers, each containing :math:`h` neurons - for simplicity, and :math:`o`
output neurons.  The time complexity of backpropagation is
:math:`O(n\cdot m \cdot h^k \cdot o \cdot i)`, where :math:`i` is the number
of iterations. Since backpropagation has a high time complexity, it is advisable
to start with smaller number of hidden neurons and few hidden layers for
training.


Mathematical formulation
========================

Given a set of training examples :math:`(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)`
where :math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{0, 1\}`, a one hidden
layer one hidden neuron MLP learns the function :math:`f(x) = W_2 g(W_1^T x + b_1) + b_2`
where :math:`W_1 \in \mathbf{R}^m` and :math:`W_2, b_1, b_2 \in \mathbf{R}` are
model parameters. :math:`W_1, W_2` represent the weights of the input layer and
hidden layer, resepctively; and :math:`b_1, b_2` represent the bias added to
the hidden layer and the output layer, respectively.
:math:`g(\cdot) : R \rightarrow R` is the activation function, set by default as
the hyperbolic tan. It is given as,

.. math::
      g(z)= \frac{e^z-e^{-z}}{e^z+e^{-z}}

For binary classification, :math:`f(x)` passes through the logistic function
:math:`g(z)=1/(1+e^{-z})` to obtain output values between zero and one. A
threshold, set to 0.5, would assign samples of outputs larger or equal 0.5
to the positive class, and the rest to the negative class.

If there are more than two classes, :math:`f(x)` itself would be a vector of
size (n_classes,). Instead of passing through logistic function, it passes
through the softmax function, which is written as,

.. math::
      \text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_{l=1}^k\exp(z_l)}

where :math:`z_i` represents the :math:`i` th element of the input to softmax,
which corresponds to class :math:`i`, and :math:`K` is the number of classes.
The result is a vector containing the probabilities that sample :math:`x`
belong to each class. The output is the class with the highest probability.

In regression, the output remains as :math:`f(x)`; therefore, output activation
function is just the identity function.

MLP uses different loss functions depending on the problem type. The loss
function for classification is Cross-Entropy, which in binary case is given as,

.. math::

    Loss(\hat{y},y,W) = -y \ln {\hat{y}} - (1-y) \ln{(1-\hat{y})} + \alpha ||W||_2^2

where :math:`\alpha ||W||_2^2` is an L2-regularization term (aka penalty)
that penalizes complex models; and :math:`\alpha > 0` is a non-negative
hyperparameter that controls the magnitude of the penalty.

For regression, MLP uses the Square Error loss function; written as,

.. math::

    Loss(\hat{y},y,W) = \frac{1}{2}||\hat{y} - y ||_2^2 + \alpha ||W||_2^2


Starting from initial random weights, multi-layer perceptron (MLP) minimizes
the loss function by repeatedly updating these weights. After computing the
loss, a backward pass propagates it from the output layer to the previous
layers, providing each weight parameter with an update value meant to decrease
the loss.

In gradient descent, the gradient :math:`\nabla Loss_{W}` of the loss with respect
to the weights is computed and deducted from :math:`W`.
More formally, this is expressed as,

.. math::
    W^{i+1} = W^i - \epsilon \nabla {Loss}_{W}^{i}


where :math:`i` is the iteration step, and :math:`\epsilon` is the learning rate
with a value larger than 0.

The algorithm stops when it reaches a preset maximum number of iterations; or
when the improvement in loss is below a certain, small number.


.. _mlp_tips:

Tips on Practical Use
=====================

  * Multi-layer Perceptron is sensitive to feature scaling, so it
    is highly recommended to scale your data. For example, scale each
    attribute on the input vector X to [0, 1] or [-1, +1], or standardize
    it to have mean 0 and variance 1. Note that you must apply the *same*
    scaling to the test set for meaningful results.
    You can use :class:`StandardScaler` for standardization.

      >>> from sklearn.preprocessing import StandardScaler  # doctest: +SKIP
      >>> scaler = StandardScaler()  # doctest: +SKIP
      >>> # Don't cheat - fit only on training data
      >>> scaler.fit(X_train)  # doctest: +SKIP
      >>> X_train = scaler.transform(X_train)  # doctest: +SKIP
      >>> # apply same transformation to test data
      >>> X_test = scaler.transform(X_test)  # doctest: +SKIP

    An alternative and recommended approach is to use :class:`StandardScaler`
    in a :class:`Pipeline`

  * Finding a reasonable regularization parameter :math:`\alpha` is
    best done using :class:`GridSearchCV`, usually in the
    range ``10.0 ** -np.arange(1, 7)``.

  * Empirically, we observed that `L-BFGS` converges faster and
    with better solutions on small datasets. For relatively large
    datasets, however, `Adam` is very robust. It usually converges
    quickly and gives pretty good performance. `SGD` with momentum or
    nesterov's momentum, on the other hand, can perform better than
    those two algorithms if learning rate is correctly tuned.

More control with warm_start
============================
If you want more control over stopping criteria or learning rate in SGD,
or want to do additional monitoring, using ``warm_start=True`` and
``max_iter=1`` and iterating yourself can be helpful::

    >>> X = [[0., 0.], [1., 1.]]
    >>> y = [0, 1]
    >>> clf = MLPClassifier(hidden_layer_sizes=(15,), random_state=1, max_iter=1, warm_start=True)
    >>> for i in range(10):
    ...     clf.fit(X, y)
    ...     # additional monitoring / inspection # doctest: +ELLIPSIS
    MLPClassifier(...

.. topic:: References:

    * `"Learning representations by back-propagating errors."
      <http://www.iro.umontreal.ca/~pift6266/A06/refs/backprop_old.pdf>`_
      Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.

    * `"Stochastic Gradient Descent" <http://leon.bottou.org/projects/sgd>`_ L. Bottou - Website, 2010.

    * `"Backpropagation" <http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm>`_
      Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen - Website, 2011.

    * `"Efficient BackProp" <http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_
      Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks
      of the Trade 1998.

    *  `"Adam: A method for stochastic optimization."
       <http://arxiv.org/pdf/1412.6980v8.pdf>`_
       Kingma, Diederik, and Jimmy Ba. arXiv preprint arXiv:1412.6980 (2014).