Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Precision-Recall#

Example of Precision-Recall metric to evaluate classifier output quality.

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of the fraction of relevant items among actually returned items while recall is a measure of the fraction of items that were returned among all items that should have been returned. ‘Relevancy’ here refers to items that are positively labeled, i.e., true positives and false negatives.

Precision (\(P\)) is defined as the number of true positives (\(T_p\)) over the number of true positives plus the number of false positives (\(F_p\)).

\[P = \frac{T_p}{T_p+F_p}\]

Recall (\(R\)) is defined as the number of true positives (\(T_p\)) over the number of true positives plus the number of false negatives (\(F_n\)).

\[R = \frac{T_p}{T_p + F_n}\]

The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision. High precision is achieved by having few false positives in the returned results, and high recall is achieved by having few false negatives in the relevant results. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all relevant results (high recall).

A system with high recall but low precision returns most of the relevant items, but the proportion of returned results that are incorrectly labeled is high. A system with high precision but low recall is just the opposite, returning very few of the relevant items, but most of its predicted labels are correct when compared to the actual labels. An ideal system with high precision and high recall will return most of the relevant items, with most results labeled correctly.

The definition of precision (\(\frac{T_p}{T_p + F_p}\)) shows that lowering the threshold of a classifier may increase the denominator, by increasing the number of results returned. If the threshold was previously set too high, the new results may all be true positives, which will increase precision. If the previous threshold was about right or too low, further lowering the threshold will introduce false positives, decreasing precision.

Recall is defined as \(\frac{T_p}{T_p+F_n}\), where \(T_p+F_n\) does not depend on the classifier threshold. Changing the classifier threshold can only change the numerator, \(T_p\). Lowering the classifier threshold may increase recall, by increasing the number of true positive results. It is also possible that lowering the threshold may leave recall unchanged, while the precision fluctuates. Thus, precision does not necessarily decrease with recall.

The relationship between recall and precision can be observed in the stairstep area of the plot - at the edges of these steps a small change in the threshold considerably reduces precision, with only a minor gain in recall.

Average precision (AP) summarizes such a plot as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

\(\text{AP} = \sum_n (R_n - R_{n-1}) P_n\)

where \(P_n\) and \(R_n\) are the precision and recall at the nth threshold. A pair \((R_k, P_k)\) is referred to as an operating point.

AP and the trapezoidal area under the operating points (sklearn.metrics.auc) are common ways to summarize a precision-recall curve that lead to different results. Read more in the User Guide.

Precision-recall curves are typically used in binary classification to study the output of a classifier. In order to extend the precision-recall curve and average precision to multi-class or multi-label classification, it is necessary to binarize the output. One curve can be drawn per label, but one can also draw a precision-recall curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

Note

See also sklearn.metrics.average_precision_score,: sklearn.metrics.recall_score, sklearn.metrics.precision_score, sklearn.metrics.f1_score

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

In binary classification settings#

Dataset and model#

We will use a Linear SVC classifier to differentiate two types of irises.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(
    X[y < 2], y[y < 2], test_size=0.5, random_state=random_state
)

Linear SVC will expect each feature to have a similar range of values. Thus, we will first scale the data using a StandardScaler.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

classifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
classifier.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvc',
                 LinearSVC(random_state=RandomState(MT19937) at 0x78414C6BB440))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators <combining_estimators>` for more details.	[('standardscaler', ...), ('linearsvc', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing <metadata_routing>`. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The classes labels. Only exist if the last step of the pipeline is a classifier.	ndarray[int64](2,)	[0,1]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if the underlying first estimator in `steps` exposes such an attribute when fit. .. versionadded:: 0.24	int	804

StandardScaler

?Documentation for StandardScaler

Parameters

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

Fitted attributes

Name	Type	Value
mean_ mean_: ndarray of shape (n_features,) or None The mean value for each feature in the training set. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](804,)	[ 5.44, 3.13, 2.76,..., 0.08,-0.09, 0.07]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	804
n_samples_seen_ n_samples_seen_: int or ndarray of shape (n_features,) The number of samples processed by the estimator for each feature. If there are no missing samples, the ``n_samples_seen`` will be an integer, otherwise it will be an array of dtype int. If `sample_weights` are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.	float64	50
scale_ scale_: ndarray of shape (n_features,) or None Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using `np.sqrt(var_)`. If a variance is zero, we can't achieve unit variance, and the data is left as-is, giving a scaling factor of 1. `scale_` is equal to `None` when `with_std=False`. .. versionadded:: 0.17 scale_	ndarray[float64](804,)	[0.59,0.51,1.41,...,0.9 ,0.82,0.98]
var_ var_: ndarray of shape (n_features,) or None The variance for each feature in the training set. Used to compute `scale_`. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](804,)	[0.35,0.26,2. ,...,0.81,0.68,0.97]

804 features

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

x21

x22

x23

x24

x25

x26

x27

x28

x29

x30

x31

x32

x33

x34

x35

x36

x37

x38

x39

x40

x41

x42

x43

x44

x45

x46

x47

x48

x49

x50

x51

x52

x53

x54

x55

x56

x57

x58

x59

x60

x61

x62

x63

x64

x65

x66

x67

x68

x69

x70

x71

x72

x73

x74

x75

x76

x77

x78

x79

x80

x81

x82

x83

x84

x85

x86

x87

x88

x89

x90

x91

x92

x93

x94

x95

x96

x97

x98

x99

x100

x101

x102

x103

x104

x105

x106

x107

x108

x109

x110

x111

x112

x113

x114

x115

x116

x117

x118

x119

x120

x121

x122

x123

x124

x125

x126

x127

x128

x129

x130

x131

x132

x133

x134

x135

x136

x137

x138

x139

x140

x141

x142

x143

x144

x145

x146

x147

x148

x149

x150

x151

x152

x153

x154

x155

x156

x157

x158

x159

x160

x161

x162

x163

x164

x165

x166

x167

x168

x169

x170

x171

x172

x173

x174

x175

x176

x177

x178

x179

x180

x181

x182

x183

x184

x185

x186

x187

x188

x189

x190

x191

x192

x193

x194

x195

x196

x197

x198

x199

x200

x201

x202

x203

x204

x205

x206

x207

x208

x209

x210

x211

x212

x213

x214

x215

x216

x217

x218

x219

x220

x221

x222

x223

x224

x225

x226

x227

x228

x229

x230

x231

x232

x233

x234

x235

x236

x237

x238

x239

x240

x241

x242

x243

x244

x245

x246

x247

x248

x249

x250

x251

x252

x253

x254

x255

x256

x257

x258

x259

x260

x261

x262

x263

x264

x265

x266

x267

x268

x269

x270

x271

x272

x273

x274

x275

x276

x277

x278

x279

x280

x281

x282

x283

x284

x285

x286

x287

x288

x289

x290

x291

x292

x293

x294

x295

x296

x297

x298

x299

x300

x301

x302

x303

x304

x305

x306

x307

x308

x309

x310

x311

x312

x313

x314

x315

x316

x317

x318

x319

x320

x321

x322

x323

x324

x325

x326

x327

x328

x329

x330

x331

x332

x333

x334

x335

x336

x337

x338

x339

x340

x341

x342

x343

x344

x345

x346

x347

x348

x349

x350

x351

x352

x353

x354

x355

x356

x357

x358

x359

x360

x361

x362

x363

x364

x365

x366

x367

x368

x369

x370

x371

x372

x373

x374

x375

x376

x377

x378

x379

x380

x381

x382

x383

x384

x385

x386

x387

x388

x389

x390

x391

x392

x393

x394

x395

x396

x397

x398

x399

x400

x401

x402

x403

x404

x405

x406

x407

x408

x409

x410

x411

x412

x413

x414

x415

x416

x417

x418

x419

x420

x421

x422

x423

x424

x425

x426

x427

x428

x429

x430

x431

x432

x433

x434

x435

x436

x437

x438

x439

x440

x441

x442

x443

x444

x445

x446

x447

x448

x449

x450

x451

x452

x453

x454

x455

x456

x457

x458

x459

x460

x461

x462

x463

x464

x465

x466

x467

x468

x469

x470

x471

x472

x473

x474

x475

x476

x477

x478

x479

x480

x481

x482

x483

x484

x485

x486

x487

x488

x489

x490

x491

x492

x493

x494

x495

x496

x497

x498

x499

x500

x501

x502

x503

x504

x505

x506

x507

x508

x509

x510

x511

x512

x513

x514

x515

x516

x517

x518

x519

x520

x521

x522

x523

x524

x525

x526

x527

x528

x529

x530

x531

x532

x533

x534

x535

x536

x537

x538

x539

x540

x541

x542

x543

x544

x545

x546

x547

x548

x549

x550

x551

x552

x553

x554

x555

x556

x557

x558

x559

x560

x561

x562

x563

x564

x565

x566

x567

x568

x569

x570

x571

x572

x573

x574

x575

x576

x577

x578

x579

x580

x581

x582

x583

x584

x585

x586

x587

x588

x589

x590

x591

x592

x593

x594

x595

x596

x597

x598

x599

x600

x601

x602

x603

x604

x605

x606

x607

x608

x609

x610

x611

x612

x613

x614

x615

x616

x617

x618

x619

x620

x621

x622

x623

x624

x625

x626

x627

x628

x629

x630

x631

x632

x633

x634

x635

x636

x637

x638

x639

x640

x641

x642

x643

x644

x645

x646

x647

x648

x649

x650

x651

x652

x653

x654

x655

x656

x657

x658

x659

x660

x661

x662

x663

x664

x665

x666

x667

x668

x669

x670

x671

x672

x673

x674

x675

x676

x677

x678

x679

x680

x681

x682

x683

x684

x685

x686

x687

x688

x689

x690

x691

x692

x693

x694

x695

x696

x697

x698

x699

x700

x701

x702

x703

x704

x705

x706

x707

x708

x709

x710

x711

x712

x713

x714

x715

x716

x717

x718

x719

x720

x721

x722

x723

x724

x725

x726

x727

x728

x729

x730

x731

x732

x733

x734

x735

x736

x737

x738

x739

x740

x741

x742

x743

x744

x745

x746

x747

x748

x749

x750

x751

x752

x753

x754

x755

x756

x757

x758

x759

x760

x761

x762

x763

x764

x765

x766

x767

x768

x769

x770

x771

x772

x773

x774

x775

x776

x777

x778

x779

x780

x781

x782

x783

x784

x785

x786

x787

x788

x789

x790

x791

x792

x793

x794

x795

x796

x797

x798

x799

x800

x801

x802

x803

LinearSVC

?Documentation for LinearSVC

Parameters

	random_state random_state: int, RandomState instance or None, default=None Controls the pseudo random number generation for shuffling the data for the dual coordinate descent (if ``dual=True``). When ``dual=False`` the underlying implementation of :class:`LinearSVC` is not random and ``random_state`` has no effect on the results. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	RandomState(M...0x78414C6BB440
	penalty penalty: {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization. The 'l2' penalty is the standard used in SVC. The 'l1' leads to ``coef_`` vectors that are sparse.	'l2'
	loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge' Specifies the loss function. 'hinge' is the standard SVM loss (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss. The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.	'squared_hinge'
	dual dual: "auto" or bool, default="auto" Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features. `dual="auto"` will choose the value of the parameter automatically, based on the values of `n_samples`, `n_features`, `loss`, `multi_class` and `penalty`. If `n_samples` < `n_features` and optimizer supports chosen `loss`, `multi_class` and `penalty`, then dual will be set to True, otherwise it will be set to False. .. versionchanged:: 1.3 The `"auto"` option is added in version 1.3 and will be the default in version 1.5.	'auto'
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	C C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.	1.0
	multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr' Determines the multi-class strategy if `y` contains more than two classes. ``"ovr"`` trains n_classes one-vs-rest classifiers, while ``"crammer_singer"`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``"crammer_singer"`` is chosen, the options loss, penalty and dual will be ignored.	'ovr'
	fit_intercept fit_intercept: bool, default=True Whether or not to fit an intercept. If set to True, the feature vector is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where 1 corresponds to the intercept. If set to False, no intercept will be used in calculations (i.e. data is expected to be already centered).	True
	intercept_scaling intercept_scaling: float, default=1.0 When `fit_intercept` is True, the instance vector x becomes ``[x_1, ..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight. Note that liblinear internally penalizes the intercept, treating it like any other term in the feature vector. To reduce the impact of the regularization on the intercept, the `intercept_scaling` parameter can be set to a value greater than 1; the higher the value of `intercept_scaling`, the lower the impact of regularization on it. Then, the weights become `[w_x_1, ..., w_x_n, w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent the feature weights and the intercept weight is scaled by `intercept_scaling`. This scaling allows the intercept term to have a different regularization behavior compared to the other features.	1
	class_weight class_weight: dict or 'balanced', default=None Set the parameter C of class i to ``class_weight[i]C`` for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes np.bincount(y))``.	None
	verbose verbose: int, default=0 Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.	0
	max_iter max_iter: int, default=1000 The maximum number of iterations to be run.	1000

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The unique classes labels.	ndarray[int64](2,)	[0,1]
coef_ coef_: ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features) Weights assigned to the features (coefficients in the primal problem). ``coef_`` is a readonly property derived from ``raw_coef_`` that follows the internal memory layout of liblinear.	ndarray[float64](1, 804)	[[ 0.03,-0.04, 0.05,..., 0. ,-0. , 0. ]]
intercept_ intercept_: ndarray of shape (1,) if n_classes == 2 else (n_classes,) Constants in decision function.	ndarray[float64](1,)	[-0.04]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	804
n_iter_ n_iter_: int Maximum number of iterations run across all classes.	int	53

Plot the Precision-Recall curve#

To plot the precision-recall curve, you should use PrecisionRecallDisplay. There are three methods available:

for plotting a single curve:
- from_estimator for when you have not computed the predictions
- from_predictions for when you already have the predictions
for plotting multiple curves using cross-validation results: from_cv_results

Let’s first plot the precision-recall curve without the classifier predictions. We use from_estimator that computes the predictions for us before plotting the curve.

from sklearn.metrics import PrecisionRecallDisplay

display = PrecisionRecallDisplay.from_estimator(
    classifier, X_test, y_test, name="LinearSVC", plot_chance_level=True, despine=True
)
_ = display.ax_.set_title("2-class Precision-Recall curve")

If we already got the estimated probabilities or scores for our model, then we can use from_predictions.

y_score = classifier.decision_function(X_test)

display = PrecisionRecallDisplay.from_predictions(
    y_test, y_score, name="LinearSVC", plot_chance_level=True, despine=True
)
_ = display.ax_.set_title("2-class Precision-Recall curve")

The from_cv_results takes the cross-validation results from cross_validate and plots a precision-recall curve for each fold.

from sklearn.model_selection import cross_validate

classifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
cv_results = cross_validate(
    classifier, X_train, y_train, return_estimator=True, return_indices=True
)
display = PrecisionRecallDisplay.from_cv_results(cv_results, X_train, y_train)
_ = display.ax_.set_title("Cross-validation Precision-Recall curves")

Cross-validation Precision-Recall curves

In multi-label settings#

The precision-recall curve does not support the multilabel setting. However, one can decide how to handle this case. We show such an example below.

Create multi-label data, fit, and predict#

We create a multi-label dataset, to illustrate the precision-recall in multi-label settings.

from sklearn.preprocessing import label_binarize

# Use label_binarize to be multi-label like settings
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]

# Split into training and test
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.5, random_state=random_state
)

We use OneVsRestClassifier for multi-label prediction.

from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(
    make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
)
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)

The average precision score in multi-label settings#

from sklearn.metrics import average_precision_score, precision_recall_curve

# For each class
precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(
    Y_test.ravel(), y_score.ravel()
)
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")

Plot the micro-averaged Precision-Recall curve#

from collections import Counter

display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
    prevalence_pos_label=Counter(Y_test.ravel())[1] / Y_test.size,
)
display.plot(plot_chance_level=True, despine=True)
_ = display.ax_.set_title("Micro-averaged over all classes")

Plot Precision-Recall curve for each class and iso-f1 curves#

from itertools import cycle

import matplotlib.pyplot as plt

# setup plot details
colors = cycle(["navy", "turquoise", "darkorange", "cornflowerblue", "teal"])

_, ax = plt.subplots(figsize=(7, 8))

f_scores = np.linspace(0.2, 0.8, num=4)
lines, labels = [], []
for f_score in f_scores:
    x = np.linspace(0.01, 1)
    y = f_score * x / (2 * x - f_score)
    (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2)
    plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))

display = PrecisionRecallDisplay(
    recall=recall["micro"],
    precision=precision["micro"],
    average_precision=average_precision["micro"],
)
display.plot(
    ax=ax, name="Micro-average precision-recall", curve_kwargs={"color": "gold"}
)

for i, color in zip(range(n_classes), colors):
    display = PrecisionRecallDisplay(
        recall=recall[i],
        precision=precision[i],
        average_precision=average_precision[i],
    )
    display.plot(
        ax=ax,
        name=f"Precision-recall for class {i}",
        curve_kwargs={"color": color},
        despine=True,
    )

# add the legend for the iso-f1 curves
handles, labels = display.ax_.get_legend_handles_labels()
handles.extend([l])
labels.extend(["iso-f1 curves"])
# set the legend and the axes
ax.legend(handles=handles, labels=labels, loc="best")
ax.set_title("Extension of Precision-Recall curve to multi-class")

plt.show()