Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Faces recognition example using eigenfaces and kernel approximation#

This example builds a classical face recognition pipeline on the “Labeled Faces in the Wild” (LFW) dataset, a preprocessed excerpt of which is available here: https://www.kaggle.com/datasets/jessicali9530/lfw-dataset

We reduce the dimensionality of the face images with PCA (the eigenfaces), then approximate the RBF kernel with Nystroem and train a LogisticRegression on the resulting features. The full chain is wrapped in a Pipeline so that cross-validation does not leak information from the test set. The hyperparameters are tuned with a successive halving search (HalvingRandomSearchCV) that minimizes the log loss. We finally evaluate the model both quantitatively, with a classification report and one-vs-rest ROC and precision-recall curves, and qualitatively, by displaying the predictions and the eigenfaces.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Loading the dataset#

We download the Labeled Faces in the Wild (LFW) dataset and load it as numpy arrays. Each sample is a flattened grayscale image; the target is the identity of the person pictured.

from sklearn.datasets import fetch_lfw_people

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Introspect the image arrays to find the shapes (for plotting).
n_samples, height, width = lfw_people.images.shape

# For machine learning we use the data directly (relative pixel positions are
# ignored by this model).
X = lfw_people.data
n_features = X.shape[1]

# The label to predict is the id of the person.
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print(f"n_samples: {n_samples}")
print(f"n_features: {n_features}")
print(f"n_classes: {n_classes}")

Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7

Splitting the dataset#

We hold out 25% of the data for testing. Preprocessing and model fitting are chained in a pipeline below so that scaling and feature extraction are learned only from the training folds during cross-validation.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

Building the model pipeline#

We chain preprocessing and classification in a Pipeline. PCA extracts eigenfaces as a compact representation; Nystroem approximates the RBF feature map so that a linear LogisticRegression can model non-linear decision boundaries while scaling better than a kernel SVM. Because logistic regression outputs calibrated probabilities, we can tune the model by minimizing the log loss.

from sklearn.decomposition import PCA
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

n_components = 150

model = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=n_components, svd_solver="randomized", whiten=True)),
        ("nystroem", Nystroem(random_state=42)),
        ("logreg", LogisticRegression(max_iter=5_000)),
    ]
)
model

Pipeline(steps=[('scaler', StandardScaler()),
                ('pca',
                 PCA(n_components=150, svd_solver='randomized', whiten=True)),
                ('nystroem', Nystroem(random_state=42)),
                ('logreg', LogisticRegression(max_iter=5000))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Tuning the pipeline with successive halving#

We tune the gamma and n_components of the Nystroem approximation and the C regularization of the logistic regression with a successive halving search (HalvingRandomSearchCV). The search minimizes the log loss (neg_log_loss) and screens many candidates on small training subsets before investing compute in the most promising ones. We set min_resources high enough so that PCA can always extract 150 eigenfaces, even in the first halving iteration.

from time import time

from scipy.stats import loguniform, randint

from sklearn.experimental import enable_halving_search_cv  # noqa: F401
from sklearn.model_selection import HalvingRandomSearchCV

print("Fitting the classifier to the training set")
t0 = time()
param_distributions = {
    "nystroem__gamma": loguniform(1e-4, 1e-1),
    "nystroem__n_components": randint(50, 200),
    "logreg__C": loguniform(1e-2, 1e2),
}
clf = HalvingRandomSearchCV(
    model,
    param_distributions,
    n_candidates=30,
    factor=3,
    min_resources=300,
    scoring="neg_log_loss",
    random_state=42,
)
clf = clf.fit(X_train, y_train)
print(f"done in {time() - t0:.3f}s")

Fitting the classifier to the training set
done in 21.472s

print("Best estimator found by successive halving search:")
clf.best_estimator_

Best estimator found by successive halving search:

Pipeline(steps=[('scaler', StandardScaler()),
                ('pca',
                 PCA(n_components=150, svd_solver='randomized', whiten=True)),
                ('nystroem',
                 Nystroem(gamma=np.float64(0.0011756010900231862),
                          n_components=173, random_state=42)),
                ('logreg',
                 LogisticRegression(C=np.float64(20.651425578959262),
                                    max_iter=5000))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators <combining_estimators>` for more details.	[('scaler', ...), ('pca', ...), ...]
	transform_input transform_input: tuple or list of str, default=("X_val",) The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing <metadata_routing>`. For instance, this can be used to pass a validation set through the pipeline. By default, the validation set `X_val` is always transformed. You can only use this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6 .. versionchanged:: 1.10 The default changed from `None` to `("X_val",)`.	('X_val',)
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The classes labels. Only exist if the last step of the pipeline is a classifier.	ndarray[int64](7,)	[0,1,2,...,4,5,6]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if the underlying first estimator in `steps` exposes such an attribute when fit. .. versionadded:: 0.24	int	1850

StandardScaler

?Documentation for StandardScaler

Parameters

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

Fitted attributes

Name	Type	Value
mean_ mean_: ndarray of shape (n_features,) or None The mean value for each feature in the training set. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](1850,)	[0.36,0.38,0.41,...,0.47,0.44,0.41]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	1850
n_samples_seen_ n_samples_seen_: int or ndarray of shape (n_features,) The number of samples processed by the estimator for each feature. If there are no missing samples, the ``n_samples_seen`` will be an integer, otherwise it will be an array of dtype int. If `sample_weights` are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.	float64	966
scale_ scale_: ndarray of shape (n_features,) or None Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using `np.sqrt(var_)`. If a variance is zero, we can't achieve unit variance, and the data is left as-is, giving a scaling factor of 1. `scale_` is equal to `None` when `with_std=False`. .. versionadded:: 0.17 scale_	ndarray[float64](1850,)	[0.18,0.18,0.17,...,0.31,0.31,0.31]
var_ var_: ndarray of shape (n_features,) or None The variance for each feature in the training set. Used to compute `scale_`. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](1850,)	[0.03,0.03,0.03,...,0.09,0.09,0.09]

100 of 1,850 features

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

x21

x22

x23

x24

x25

x26

x27

x28

x29

x30

x31

x32

x33

x34

x35

x36

x37

x38

x39

x40

x41

x42

x43

x44

x45

x46

x47

x48

x49

x50

x51

x52

x53

x54

x55

x56

x57

x58

x59

x60

x61

x62

x63

x64

x65

x66

x67

x68

x69

x70

x71

x72

x73

x74

x75

x76

x77

x78

x79

x80

x81

x82

x83

x84

x85

x86

x87

x88

x89

x90

x91

x92

x93

x94

x95

x96

x97

x98

x99

PCA

?Documentation for PCA

Parameters

	n_components n_components: int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1	150
	whiten whiten: bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.	True
	svd_solver svd_solver: {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver.	'randomized'
	copy copy: bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.	True
	tol tol: float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0	0.0
	iterated_power iterated_power: int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0	'auto'
	n_oversamples n_oversamples: int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	10
	power_iteration_normalizer power_iteration_normalizer: {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	'auto'
	random_state random_state: int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`. .. versionadded:: 0.18.0	None

Fitted attributes

Name	Type	Value
components_ components_: ndarray of shape (n_components, n_features) Principal axes in feature space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by decreasing ``explained_variance_``.	ndarray[float32](150, 1850)	[[ 0.01, 0.01, 0.01,..., 0.01, 0.01, 0.01], [ 0.02, 0.02, 0.02,...,-0.01,-0.01,-0.01], [ 0.03, 0.03, 0.03,..., 0. , 0. , 0. ], ..., [ 0.03, 0.03, 0.03,..., 0.04,-0. ,-0.05], [ 0.06, 0.03,-0.02,...,-0.01, 0.01, 0.03], [-0.03,-0.01, 0.02,...,-0.01,-0. , 0. ]]
explained_variance_ explained_variance_: ndarray of shape (n_components,) The amount of variance explained by each of the selected components. The variance estimation uses `n_samples - 1` degrees of freedom. Equal to n_components largest eigenvalues of the covariance matrix of X. .. versionadded:: 0.18	ndarray[float32](150,)	[482.04,267.97,127.02,..., 0.82, 0.82, 0.81]
explained_variance_ratio_ explained_variance_ratio_: ndarray of shape (n_components,) Percentage of variance explained by each of the selected components. If ``n_components`` is not set then all components are stored and the sum of the ratios is equal to 1.0.	ndarray[float32](150,)	[0.26,0.14,0.07,...,0. ,0. ,0. ]
mean_ mean_: ndarray of shape (n_features,) Per-feature empirical mean, estimated from the training set. Equal to `X.mean(axis=0)`.	ndarray[float32](1850,)	[-0., 0.,-0.,...,-0.,-0.,-0.]
n_components_ n_components_: int The estimated number of components. When n_components is set to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.	int	150
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	1850
n_samples_ n_samples_: int Number of samples in the training data.	int	966
noise_variance_ noise_variance_: float The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples. Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.	float32	np.float32(0.12381879)
singular_values_ singular_values_: ndarray of shape (n_components,) The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the ``n_components`` variables in the lower-dimensional space. .. versionadded:: 0.19	ndarray[float32](150,)	[682.03,508.52,350.11,..., 28.19, 28.11, 27.89]

100 of 150 features

pca0

pca1

pca2

pca3

pca4

pca5

pca6

pca7

pca8

pca9

pca10

pca11

pca12

pca13

pca14

pca15

pca16

pca17

pca18

pca19

pca20

pca21

pca22

pca23

pca24

pca25

pca26

pca27

pca28

pca29

pca30

pca31

pca32

pca33

pca34

pca35

pca36

pca37

pca38

pca39

pca40

pca41

pca42

pca43

pca44

pca45

pca46

pca47

pca48

pca49

pca50

pca51

pca52

pca53

pca54

pca55

pca56

pca57

pca58

pca59

pca60

pca61

pca62

pca63

pca64

pca65

pca66

pca67

pca68

pca69

pca70

pca71

pca72

pca73

pca74

pca75

pca76

pca77

pca78

pca79

pca80

pca81

pca82

pca83

pca84

pca85

pca86

pca87

pca88

pca89

pca90

pca91

pca92

pca93

pca94

pca95

pca96

pca97

pca98

pca99

Nystroem

?Documentation for Nystroem

Parameters

	gamma gamma: float, default=None Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.	np.float64(0....6010900231862)
	n_components n_components: int, default=100 Number of features to construct. How many data points will be used to construct the mapping.	173
	random_state random_state: int, RandomState instance or None, default=None Pseudo-random number generator to control the uniform sampling without replacement of `n_components` of the training data to construct the basis kernel. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	42
	kernel kernel: str or callable, default='rbf' Kernel map to be approximated. A callable should accept two arguments and the keyword arguments passed to this object as `kernel_params`, and should return a floating point number.	'rbf'
	coef0 coef0: float, default=None Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.	None
	degree degree: float, default=None Degree of the polynomial kernel. Ignored by other kernels.	None
	kernel_params kernel_params: dict, default=None Additional parameters (keyword arguments) for kernel function passed as callable object.	None
	n_jobs n_jobs: int, default=None The number of jobs to use for the computation. This works by breaking down the kernel matrix into `n_jobs` even slices and computing them in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details. .. versionadded:: 0.24	None

Fitted attributes

Name	Type	Value
component_indices_ component_indices_: ndarray of shape (n_components) Indices of ``components_`` in the training set.	ndarray[int64](173,)	[244,467,836,..., 65,141,266]
components_ components_: ndarray of shape (n_components, n_features) Subset of training points used to construct the feature map.	ndarray[float32](173, 150)	[[ 1.45,-0.18, 0.09,..., 1.9 , 0.87, 0.31], [-0.18,-0.9 , 2.32,..., 0.04,-0.16, 0.08], [ 1.18,-1.99,-0.16,..., 0.71, 1.93,-0.59], ..., [ 1.01, 0.23,-0.32,...,-0.79, 0.48, 0.3 ], [ 2.09, 0.53,-0.95,..., 0.99, 0.14, 0.23], [ 0.28,-0.53, 1.28,..., 1.36,-0.46, 1.92]]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	150
normalization_ normalization_: ndarray of shape (n_components, n_components) Normalization matrix needed for embedding. Square root of the kernel matrix on ``components_``.	ndarray[float32](173, 173)	[[ 2.91, 0.02,-0.11,..., 0.05,-0.15, 0.07], [ 0.02, 1.55, 0.02,...,-0.02,-0.01, 0.1 ], [-0.11, 0.02, 2.4 ,..., 0.16,-0.09,-0.02], ..., [ 0.05,-0.02, 0.16,..., 3.14, 0.04,-0.05], [-0.15,-0.01,-0.09,..., 0.04, 3.66,-0.13], [ 0.07, 0.1 ,-0.02,...,-0.05,-0.13, 2.41]]

100 of 173 features

nystroem0

nystroem1

nystroem2

nystroem3

nystroem4

nystroem5

nystroem6

nystroem7

nystroem8

nystroem9

nystroem10

nystroem11

nystroem12

nystroem13

nystroem14

nystroem15

nystroem16

nystroem17

nystroem18

nystroem19

nystroem20

nystroem21

nystroem22

nystroem23

nystroem24

nystroem25

nystroem26

nystroem27

nystroem28

nystroem29

nystroem30

nystroem31

nystroem32

nystroem33

nystroem34

nystroem35

nystroem36

nystroem37

nystroem38

nystroem39

nystroem40

nystroem41

nystroem42

nystroem43

nystroem44

nystroem45

nystroem46

nystroem47

nystroem48

nystroem49

nystroem50

nystroem51

nystroem52

nystroem53

nystroem54

nystroem55

nystroem56

nystroem57

nystroem58

nystroem59

nystroem60

nystroem61

nystroem62

nystroem63

nystroem64

nystroem65

nystroem66

nystroem67

nystroem68

nystroem69

nystroem70

nystroem71

nystroem72

nystroem73

nystroem74

nystroem75

nystroem76

nystroem77

nystroem78

nystroem79

nystroem80

nystroem81

nystroem82

nystroem83

nystroem84

nystroem85

nystroem86

nystroem87

nystroem88

nystroem89

nystroem90

nystroem91

nystroem92

nystroem93

nystroem94

nystroem95

nystroem96

nystroem97

nystroem98

nystroem99

LogisticRegression

?Documentation for LogisticRegression

Parameters

	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	np.float64(20.651425578959262)
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	5000
	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add an L2 penalty term and it is the default choice; - `'l1'`: add an L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` and `C` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'`, `l1_ratio` set to any float between 0 and 1 for `penalty='elasticnet'`, and `C=np.inf` for `penalty=None`.	'deprecated'
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` gives a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation <regularized-logistic-loss>`) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Only used for `solver` == 'sag', 'saga' or 'liblinear' to shuffle the data. It has no effect on the other solvers. See :term:`Glossary <random_state>` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cd-gram', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cd-gram' 0<=l1_ratio<=1 yes 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide <Logistic_regression>` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table <logistic_regression_solvers>` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary <warm_start>`. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes, ) A list of class labels known to the classifier.	ndarray[int64](7,)	[0,1,2,...,4,5,6]
coef_ coef_: ndarray or CSR matrix of shape (1, n_features) or (n_classes, n_features) Coefficients of the features in the decision function. `coef_` is of shape (1, n_features) when the given problem is binary. By default, it will be created as a dense array, but can be turned to sparse (CSR format) through :meth:`sparsify` (which can be beneficial under L1 regularization when many coefficients are zero), and back to dense through :meth:`densify`.	ndarray[float32](7, 173)	[[-2.98,-1.28,-0.8 ,..., 0.95, 0.96, 1.68], [ 1.11, 6.5 ,-2.67,...,-4.26, 3.82, 3.12], [ 6.07,-0.07, 6.92,...,-2.59,-2.08, 1.58], ..., [ 1.13,-3. ,-0.63,...,-1.13, 0.23, 0.42], [-0.79,-0.49,-0.44,..., 5.06,-1.09, 2.73], [ 0.31, 2.67,-2.44,...,-2.16, 0.61,-4.76]]
intercept_ intercept_: ndarray of shape (1,) or (n_classes,) Intercept (a.k.a. bias) added to the decision function. If `fit_intercept` is set to False, the intercept is set to zero. `intercept_` is of shape (1,) when the given problem is binary.	ndarray[float32](7,)	[ 2.18, 1.62,-3.93,...,-0.81, 1.38,-1.14]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	173
n_iter_ n_iter_: ndarray of shape (1, ) Actual number of iterations for all classes. .. versionchanged:: 0.20 In SciPy <= 1.0.0 the number of lbfgs iterations may exceed ``max_iter``. ``n_iter_`` will now report at most ``max_iter``.	ndarray[int32](1,)	[72]

Quantitative evaluation#

We measure the model quality on the held-out test set with a classification report and, since the probabilities are well calibrated, one-vs-rest ROC and precision-recall curves. The pipeline handles preprocessing internally.

import matplotlib.pyplot as plt

from sklearn.metrics import classification_report
from sklearn.preprocessing import label_binarize

print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test)
y_score = clf.predict_proba(X_test)
print(f"done in {time() - t0:.3f}s")

print(classification_report(y_test, y_pred, target_names=target_names))

Predicting people's names on the test set
done in 0.013s
                   precision    recall  f1-score   support

     Ariel Sharon       0.64      0.54      0.58        13
     Colin Powell       0.79      0.87      0.83        60
  Donald Rumsfeld       0.77      0.63      0.69        27
    George W Bush       0.87      0.95      0.91       146
Gerhard Schroeder       0.68      0.68      0.68        25
      Hugo Chavez       0.80      0.53      0.64        15
       Tony Blair       0.90      0.72      0.80        36

         accuracy                           0.83       322
        macro avg       0.78      0.70      0.73       322
     weighted avg       0.82      0.83      0.82       322

Because the problem is multiclass, we summarize the ranking quality of the predicted probabilities with one-vs-rest curves: each identity is in turn treated as the positive class against all the others. The ROC curve relates the true positive rate to the false positive rate, while the precision-recall curve is more informative when the positive class is rare, as is the case here where each identity is a small fraction of the test set.

from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay

classes = list(range(n_classes))
y_onehot_test = label_binarize(y_test, classes=classes)

fig, (ax_roc, ax_pr) = plt.subplots(1, 2, figsize=(13, 6))

for class_id, name in enumerate(target_names):
    RocCurveDisplay.from_predictions(
        y_onehot_test[:, class_id],
        y_score[:, class_id],
        name=name,
        ax=ax_roc,
        plot_chance_level=(class_id == n_classes - 1),
    )
    PrecisionRecallDisplay.from_predictions(
        y_onehot_test[:, class_id],
        y_score[:, class_id],
        name=name,
        ax=ax_pr,
    )

ax_roc.set_title("One-vs-rest ROC curves")
ax_pr.set_title("One-vs-rest precision-recall curves")
plt.tight_layout()
plt.show()

One-vs-rest ROC curves, One-vs-rest precision-recall curves

Qualitative evaluation#

We visualize a gallery of test portraits with their predicted and true labels to inspect the model’s mistakes at a glance.

def plot_gallery(images, titles, height, width, n_row=3, n_col=4):
    """Plot a gallery of portraits."""
    fig, axs = plt.subplots(n_row, n_col, figsize=(1.8 * n_col, 2.4 * n_row))
    fig.subplots_adjust(bottom=0, left=0.01, right=0.99, top=0.90, hspace=0.35)
    for ax, image, title in zip(axs.ravel(), images, titles):
        ax.imshow(image.reshape((height, width)), cmap=plt.cm.gray)
        ax.set_title(title, size=12)
        ax.set_xticks(())
        ax.set_yticks(())
    return fig


def make_title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(" ", 1)[-1]
    true_name = target_names[y_test[i]].rsplit(" ", 1)[-1]
    return f"predicted: {pred_name}\ntrue:      {true_name}"


prediction_titles = [
    make_title(y_pred, y_test, target_names, i) for i in range(y_pred.shape[0])
]

plot_gallery(X_test, prediction_titles, height, width)

predicted: Bush true: Bush, predicted: Bush true: Bush, predicted: Blair true: Blair, predicted: Bush true: Bush, predicted: Bush true: Bush, predicted: Bush true: Bush, predicted: Schroeder true: Schroeder, predicted: Powell true: Powell, predicted: Bush true: Bush, predicted: Bush true: Bush, predicted: Bush true: Bush, predicted: Bush true: Bush

<Figure size 720x720 with 12 Axes>

Eigenfaces gallery#

We display the most significant eigenfaces, i.e. the principal components that form the basis of the face representation learned by the fitted pipeline.

pca = clf.best_estimator_.named_steps["pca"]
eigenfaces = pca.components_.reshape((pca.n_components_, height, width))

eigenface_titles = [f"eigenface {i}" for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, height, width)

plt.show()

eigenface 0, eigenface 1, eigenface 2, eigenface 3, eigenface 4, eigenface 5, eigenface 6, eigenface 7, eigenface 8, eigenface 9, eigenface 10, eigenface 11

Conclusion#

This example walks through a classical face recognition pipeline in scikit-learn:

Eigenfaces (PCA) reduce the high-dimensional pixel space to a compact set of uncorrelated features that capture the main variations across faces.
Nystroem + LogisticRegression approximate a non-linear RBF kernel with a linear model that scales better than a kernel SVM and is tuned to minimize the log loss.
Pipeline chains preprocessing and classification so that cross-validation does not leak information from the test set.
Quantitative and qualitative evaluation on a held-out test set confirm whether the pipeline generalizes. The one-vs-rest ROC and precision-recall curves show how well the predicted probabilities rank each identity against the others, independently of any single decision threshold.

In practice, face recognition is often better addressed with convolutional neural networks, but this family of models is outside the scope of the scikit-learn library. Interested readers should instead try PyTorch or TensorFlow to implement such models.

Total running time of the script: (0 minutes 22.607 seconds)