Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Precision-Recall#

Example of Precision-Recall metric to evaluate classifier output quality.

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of the fraction of relevant items among actually returned items while recall is a measure of the fraction of items that were returned among all items that should have been returned. ‘Relevancy’ here refers to items that are positively labeled, i.e., true positives and false negatives.

Precision (\(P\)) is defined as the number of true positives (\(T_p\)) over the number of true positives plus the number of false positives (\(F_p\)).

\[P = \frac{T_p}{T_p+F_p}\]

Recall (\(R\)) is defined as the number of true positives (\(T_p\)) over the number of true positives plus the number of false negatives (\(F_n\)).

\[R = \frac{T_p}{T_p + F_n}\]

The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision. High precision is achieved by having few false positives in the returned results, and high recall is achieved by having few false negatives in the relevant results. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all relevant results (high recall).

A system with high recall but low precision returns most of the relevant items, but the proportion of returned results that are incorrectly labeled is high. A system with high precision but low recall is just the opposite, returning very few of the relevant items, but most of its predicted labels are correct when compared to the actual labels. An ideal system with high precision and high recall will return most of the relevant items, with most results labeled correctly.

The definition of precision (\(\frac{T_p}{T_p + F_p}\)) shows that lowering the threshold of a classifier may increase the denominator, by increasing the number of results returned. If the threshold was previously set too high, the new results may all be true positives, which will increase precision. If the previous threshold was about right or too low, further lowering the threshold will introduce false positives, decreasing precision.

Recall is defined as \(\frac{T_p}{T_p+F_n}\), where \(T_p+F_n\) does not depend on the classifier threshold. Changing the classifier threshold can only change the numerator, \(T_p\). Lowering the classifier threshold may increase recall, by increasing the number of true positive results. It is also possible that lowering the threshold may leave recall unchanged, while the precision fluctuates. Thus, precision does not necessarily decrease with recall.

The relationship between recall and precision can be observed in the stairstep area of the plot - at the edges of these steps a small change in the threshold considerably reduces precision, with only a minor gain in recall.

Average precision (AP) summarizes such a plot as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

\(\text{AP} = \sum_n (R_n - R_{n-1}) P_n\)

where \(P_n\) and \(R_n\) are the precision and recall at the nth threshold. A pair \((R_k, P_k)\) is referred to as an operating point.

AP and the trapezoidal area under the operating points (sklearn.metrics.auc) are common ways to summarize a precision-recall curve that lead to different results. Read more in the User Guide.

Precision-recall curves are typically used in binary classification to study the output of a classifier. In order to extend the precision-recall curve and average precision to multi-class or multi-label classification, it is necessary to binarize the output. One curve can be drawn per label, but one can also draw a precision-recall curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

Note

See also sklearn.metrics.average_precision_score,: sklearn.metrics.recall_score, sklearn.metrics.precision_score, sklearn.metrics.f1_score

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

In binary classification settings#

Dataset and model#

We will use a Linear SVC classifier to differentiate two types of irises.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(
    X[y < 2], y[y < 2], test_size=0.5, random_state=random_state
)

Linear SVC will expect each feature to have a similar range of values. Thus, we will first scale the data using a StandardScaler.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

classifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
classifier.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvc',
                 LinearSVC(random_state=RandomState(MT19937) at 0x7644A197ED40))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators <combining_estimators>` for more details.	[('standardscaler', ...), ('linearsvc', ...)]
	transform_input transform_input: tuple or list of str, default=("X_val",) The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing <metadata_routing>`. For instance, this can be used to pass a validation set through the pipeline. By default, the validation set `X_val` is always transformed. You can only use this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6 .. versionchanged:: 1.10 The default changed from `None` to `("X_val",)`.	('X_val',)
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The classes labels. Only exist if the last step of the pipeline is a classifier.	ndarray[int64](2,)	[0,1]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if the underlying first estimator in `steps` exposes such an attribute when fit. .. versionadded:: 0.24	int	804

StandardScaler

?Documentation for StandardScaler

Parameters

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

Fitted attributes

Name	Type	Value
mean_ mean_: ndarray of shape (n_features,) or None The mean value for each feature in the training set. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](804,)	[ 5.44, 3.13, 2.76,..., 0.08,-0.09, 0.07]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	804
n_samples_seen_ n_samples_seen_: int or ndarray of shape (n_features,) The number of samples processed by the estimator for each feature. If there are no missing samples, the ``n_samples_seen`` will be an integer, otherwise it will be an array of dtype int. If `sample_weights` are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.	float64	50
scale_ scale_: ndarray of shape (n_features,) or None Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using `np.sqrt(var_)`. If a variance is zero, we can't achieve unit variance, and the data is left as-is, giving a scaling factor of 1. `scale_` is equal to `None` when `with_std=False`. .. versionadded:: 0.17 scale_	ndarray[float64](804,)	[0.59,0.51,1.41,...,0.9 ,0.82,0.98]
var_ var_: ndarray of shape (n_features,) or None The variance for each feature in the training set. Used to compute `scale_`. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](804,)	[0.35,0.26,2. ,...,0.81,0.68,0.97]

100 of 804 features

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

x21

x22

x23

x24

x25

x26

x27

x28

x29

x30

x31

x32

x33

x34

x35

x36

x37

x38

x39

x40

x41

x42

x43

x44

x45

x46

x47

x48

x49

x50

x51

x52

x53

x54

x55

x56

x57

x58

x59

x60

x61

x62

x63

x64

x65

x66

x67

x68

x69

x70

x71

x72

x73

x74

x75

x76

x77

x78

x79

x80

x81

x82

x83

x84

x85

x86

x87

x88

x89

x90

x91

x92

x93

x94

x95

x96

x97

x98

x99

LinearSVC

?Documentation for LinearSVC

Parameters

	random_state random_state: int, RandomState instance or None, default=None Controls the pseudo random number generation for shuffling the data for the dual coordinate descent (if ``dual=True``). When ``dual=False`` the underlying implementation of :class:`LinearSVC` is not random and ``random_state`` has no effect on the results. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	RandomState(M...0x7644A197ED40
	penalty penalty: {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization. The 'l2' penalty is the standard used in SVC. The 'l1' leads to ``coef_`` vectors that are sparse.	'l2'
	loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge' Specifies the loss function. 'hinge' is the standard SVM loss (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss. The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.	'squared_hinge'
	dual dual: "auto" or bool, default="auto" Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features. `dual="auto"` will choose the value of the parameter automatically, based on the values of `n_samples`, `n_features`, `loss`, `multi_class` and `penalty`. If `n_samples` < `n_features` and optimizer supports chosen `loss`, `multi_class` and `penalty`, then dual will be set to True, otherwise it will be set to False. .. versionchanged:: 1.3 The `"auto"` option is added in version 1.3 and will be the default in version 1.5.	'auto'
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	C C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.	1.0
	multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr' Determines the multi-class strategy if `y` contains more than two classes. ``"ovr"`` trains n_classes one-vs-rest classifiers, while ``"crammer_singer"`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``"crammer_singer"`` is chosen, the options loss, penalty and dual will be ignored.	'ovr'
	fit_intercept fit_intercept: bool, default=True Whether or not to fit an intercept. If set to True, the feature vector is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where 1 corresponds to the intercept. If set to False, no intercept will be used in calculations (i.e. data is expected to be already centered).	True
	intercept_scaling intercept_scaling: float, default=1.0 When `fit_intercept` is True, the instance vector x becomes ``[x_1, ..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight. Note that liblinear internally penalizes the intercept, treating it like any other term in the feature vector. To reduce the impact of the regularization on the intercept, the `intercept_scaling` parameter can be set to a value greater than 1; the higher the value of `intercept_scaling`, the lower the impact of regularization on it. Then, the weights become `[w_x_1, ..., w_x_n, w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent the feature weights and the intercept weight is scaled by `intercept_scaling`. This scaling allows the intercept term to have a different regularization behavior compared to the other features.	1
	class_weight class_weight: dict or 'balanced', default=None Set the parameter C of class i to ``class_weight[i]C`` for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes np.bincount(y))``.	None
	verbose verbose: int, default=0 Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.	0
	max_iter max_iter: int, default=1000 The maximum number of iterations to be run.	1000

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The unique classes labels.	ndarray[int64](2,)	[0,1]
coef_ coef_: ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features) Weights assigned to the features (coefficients in the primal problem). ``coef_`` is a readonly property derived from ``raw_coef_`` that follows the internal memory layout of liblinear.	ndarray[float64](1, 804)	[[ 0.03,-0.04, 0.05,..., 0. ,-0. , 0. ]]
intercept_ intercept_: ndarray of shape (1,) if n_classes == 2 else (n_classes,) Constants in decision function.	ndarray[float64](1,)	[-0.04]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	804
n_iter_ n_iter_: int Maximum number of iterations run across all classes.	int	53

Plot the Precision-Recall curve#

To plot the precision-recall curve, you should use PrecisionRecallDisplay. There are three methods available:

for plotting a single curve:
- from_estimator for when you have not computed the predictions
- from_predictions for when you already have the predictions
for plotting multiple curves using cross-validation results: from_cv_results

Let’s first plot the precision-recall curve without the classifier predictions. We use from_estimator that computes the predictions for us before plotting the curve.

from sklearn.metrics import PrecisionRecallDisplay

display = PrecisionRecallDisplay.from_estimator(
    classifier, X_test, y_test, name="LinearSVC", plot_chance_level=True, despine=True
)
_ = display.ax_.set_title("2-class Precision-Recall curve")

If we already got the estimated probabilities or scores for our model, then we can use from_predictions.

y_score = classifier.decision_function(X_test)

display = PrecisionRecallDisplay.from_predictions(
    y_test, y_score, name="LinearSVC", plot_chance_level=True, despine=True
)
_ = display.ax_.set_title("2-class Precision-Recall curve")

The from_cv_results takes the cross-validation results from cross_validate and plots a precision-recall curve for each fold.

from sklearn.model_selection import cross_validate

classifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
cv_results = cross_validate(
    classifier, X_train, y_train, return_estimator=True, return_indices=True
)
display = PrecisionRecallDisplay.from_cv_results(cv_results, X_train, y_train)
_ = display.ax_.set_title("Cross-validation Precision-Recall curves")

Cross-validation Precision-Recall curves

In multi-label settings#

The precision-recall curve does not support the multilabel setting. However, one can decide how to handle this case. We show such an example below.

Create multi-label data, fit, and predict#

We create a multi-label dataset, to illustrate the precision-recall in multi-label settings.

from sklearn.preprocessing import label_binarize

# Use label_binarize to be multi-label like settings
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]

# Split into training and test
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.5, random_state=random_state
)

We use OneVsRestClassifier for multi-label prediction.

from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(
    make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
)
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)

The average precision score in multi-label settings#

from sklearn.metrics import average_precision_score, precision_recall_curve

# For each class
precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(
    Y_test.ravel(), y_score.ravel()
)
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")

Plot the micro-averaged Precision-Recall curve#

from collections import Counter

display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
    prevalence_pos_label=Counter(Y_test.ravel())[1] / Y_test.size,
)
display.plot(plot_chance_level=True, despine=True)
_ = display.ax_.set_title("Micro-averaged over all classes")

Plot Precision-Recall curve for each class and iso-f1 curves#

from itertools import cycle

import matplotlib.pyplot as plt

# setup plot details
colors = cycle(["navy", "turquoise", "darkorange", "cornflowerblue", "teal"])

_, ax = plt.subplots(figsize=(7, 8))

f_scores = np.linspace(0.2, 0.8, num=4)
lines, labels = [], []
for f_score in f_scores:
    x = np.linspace(0.01, 1)
    y = f_score * x / (2 * x - f_score)
    (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2)
    plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))

display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
)
display.plot(
    ax=ax, name="Micro-average precision-recall", curve_kwargs={"color": "gold"}
)

for i, color in zip(range(n_classes), colors):
    display = PrecisionRecallDisplay(
        recall=recall[i],
        precision=precision[i],
        average_precision=average_precision[i],
    )
    display.plot(
        ax=ax,
        name=f"Precision-recall for class {i}",
        curve_kwargs={"color": color},
        despine=True,
    )

# add the legend for the iso-f1 curves
handles, labels = display.ax_.get_legend_handles_labels()
handles.extend([l])
labels.extend(["iso-f1 curves"])
# set the legend and the axes
ax.legend(handles=handles, labels=labels, loc="best")
ax.set_title("Extension of Precision-Recall curve to multi-class")

plt.show()