1.17. Neural network models (supervised)#
Warning
This implementation is not intended for large-scale applications. In particular, scikit-learn offers no GPU support. For much faster, GPU-based implementations, as well as frameworks offering much more flexibility to build deep learning architectures, see Related Projects.
1.17.1. Multi-layer Perceptron#
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns
a function
![../_images/multilayerperceptron_network.png](../_images/multilayerperceptron_network.png)
Figure 1 : One hidden layer MLP.#
The leftmost layer, known as the input layer, consists of a set of neurons
The module contains the public attributes coefs_
and intercepts_
.
coefs_
is a list of weight matrices, where weight matrix at index
intercepts_
is a list of bias vectors, where the vector
at index
Advantages and disadvantages of Multi-layer Perceptron#
The advantages of Multi-layer Perceptron are:
Capability to learn non-linear models.
Capability to learn models in real-time (on-line learning) using
partial_fit
.
The disadvantages of Multi-layer Perceptron (MLP) include:
MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.
MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
MLP is sensitive to feature scaling.
Please see Tips on Practical Use section that addresses some of these disadvantages.
1.17.2. Classification#
Class MLPClassifier
implements a multi-layer perceptron (MLP) algorithm
that trains using Backpropagation.
MLP trains on two arrays: array X of size (n_samples, n_features), which holds the training samples represented as floating point feature vectors; and array y of size (n_samples,), which holds the target values (class labels) for the training samples:
>>> from sklearn.neural_network import MLPClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
... hidden_layer_sizes=(5, 2), random_state=1)
...
>>> clf.fit(X, y)
MLPClassifier(alpha=1e-05, hidden_layer_sizes=(5, 2), random_state=1,
solver='lbfgs')
After fitting (training), the model can predict labels for new samples:
>>> clf.predict([[2., 2.], [-1., -2.]])
array([1, 0])
MLP can fit a non-linear model to the training data. clf.coefs_
contains the weight matrices that constitute the model parameters:
>>> [coef.shape for coef in clf.coefs_]
[(2, 5), (5, 2), (2, 1)]
Currently, MLPClassifier
supports only the
Cross-Entropy loss function, which allows probability estimates by running the
predict_proba
method.
MLP trains using Backpropagation. More precisely, it trains using some form of
gradient descent and the gradients are calculated using Backpropagation. For
classification, it minimizes the Cross-Entropy loss function, giving a vector
of probability estimates
>>> clf.predict_proba([[2., 2.], [1., 2.]])
array([[1.967...e-04, 9.998...-01],
[1.967...e-04, 9.998...-01]])
MLPClassifier
supports multi-class classification by
applying Softmax
as the output function.
Further, the model supports multi-label classification
in which a sample can belong to more than one class. For each class, the raw
output passes through the logistic function. Values larger or equal to 0.5
are rounded to 1
, otherwise to 0
. For a predicted output of a sample, the
indices where the value is 1
represents the assigned classes of that sample:
>>> X = [[0., 0.], [1., 1.]]
>>> y = [[0, 1], [1, 1]]
>>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
... hidden_layer_sizes=(15,), random_state=1)
...
>>> clf.fit(X, y)
MLPClassifier(alpha=1e-05, hidden_layer_sizes=(15,), random_state=1,
solver='lbfgs')
>>> clf.predict([[1., 2.]])
array([[1, 1]])
>>> clf.predict([[0., 0.]])
array([[0, 1]])
See the examples below and the docstring of
MLPClassifier.fit
for further information.
Examples
See Visualization of MLP weights on MNIST for visualized representation of trained weights.
1.17.3. Regression#
Class MLPRegressor
implements a multi-layer perceptron (MLP) that
trains using backpropagation with no activation function in the output layer,
which can also be seen as using the identity function as activation function.
Therefore, it uses the square error as the loss function, and the output is a
set of continuous values.
MLPRegressor
also supports multi-output regression, in
which a sample can have more than one target.
1.17.4. Regularization#
Both MLPRegressor
and MLPClassifier
use parameter alpha
for regularization (L2 regularization) term which helps in avoiding overfitting
by penalizing weights with large magnitudes. Following plot displays varying
decision function with value of alpha.
![../_images/sphx_glr_plot_mlp_alpha_001.png](../_images/sphx_glr_plot_mlp_alpha_001.png)
See the examples below for further information.
Examples
1.17.5. Algorithms#
MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
where
More details can be found in the documentation of SGD
Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can automatically adjust the amount to update parameters based on adaptive estimates of lower-order moments.
With SGD or Adam, training supports online and mini-batch learning.
L-BFGS is a solver that approximates the Hessian matrix which represents the second-order partial derivative of a function. Further it approximates the inverse of the Hessian matrix to perform parameter updates. The implementation uses the Scipy version of L-BFGS.
If the selected solver is ‘L-BFGS’, training does not support online nor mini-batch learning.
1.17.6. Complexity#
Suppose there are
Mathematical formulation#
Given a set of training examples
For binary classification,
If there are more than two classes,
where
In regression, the output remains as
MLP uses different loss functions depending on the problem type. The loss function for classification is Average Cross-Entropy, which in binary case is given as,
where
For regression, MLP uses the Mean Square Error loss function; written as,
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers, providing each weight parameter with an update value meant to decrease the loss.
In gradient descent, the gradient
where
The algorithm stops when it reaches a preset maximum number of iterations; or when the improvement in loss is below a certain, small number.
1.17.7. Tips on Practical Use#
Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and variance 1. Note that you must apply the same scaling to the test set for meaningful results. You can use
StandardScaler
for standardization.>>> from sklearn.preprocessing import StandardScaler >>> scaler = StandardScaler() >>> # Don't cheat - fit only on training data >>> scaler.fit(X_train) >>> X_train = scaler.transform(X_train) >>> # apply same transformation to test data >>> X_test = scaler.transform(X_test)
An alternative and recommended approach is to use
StandardScaler
in aPipeline
Finding a reasonable regularization parameter
is best done usingGridSearchCV
, usually in the range10.0 ** -np.arange(1, 7)
.Empirically, we observed that
L-BFGS
converges faster and with better solutions on small datasets. For relatively large datasets, however,Adam
is very robust. It usually converges quickly and gives pretty good performance.SGD
with momentum or nesterov’s momentum, on the other hand, can perform better than those two algorithms if learning rate is correctly tuned.
1.17.8. More control with warm_start#
If you want more control over stopping criteria or learning rate in SGD,
or want to do additional monitoring, using warm_start=True
and
max_iter=1
and iterating yourself can be helpful:
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = MLPClassifier(hidden_layer_sizes=(15,), random_state=1, max_iter=1, warm_start=True)
>>> for i in range(10):
... clf.fit(X, y)
... # additional monitoring / inspection
MLPClassifier(...
References#
“Learning representations by back-propagating errors.” Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.
“Stochastic Gradient Descent” L. Bottou - Website, 2010.
“Backpropagation” Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen - Website, 2011.
“Efficient BackProp” Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.
“Adam: A method for stochastic optimization.” Kingma, Diederik, and Jimmy Ba (2014)