{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Multiclass Receiver Operating Characteristic (ROC)\n\nThis example describes the use of the Receiver Operating Characteristic (ROC)\nmetric to evaluate the quality of multiclass classifiers.\n\nROC curves typically feature true positive rate (TPR) on the Y axis, and false\npositive rate (FPR) on the X axis. This means that the top left corner of the\nplot is the \"ideal\" point - a FPR of zero, and a TPR of one. This is not very\nrealistic, but it does mean that a larger area under the curve (AUC) is usually\nbetter. The \"steepness\" of ROC curves is also important, since it is ideal to\nmaximize the TPR while minimizing the FPR.\n\nROC curves are typically used in binary classification, where the TPR and FPR\ncan be defined unambiguously. In the case of multiclass classification, a notion\nof TPR or FPR is obtained only after binarizing the output. This can be done in\n2 different ways:\n\n- the One-vs-Rest scheme compares each class against all the others (assumed as\n one);\n- the One-vs-One scheme compares every unique pairwise combination of classes.\n\nIn this example we explore both schemes and demo the concepts of micro and macro\naveraging as different ways of summarizing the information of the multiclass ROC\ncurves.\n\n

#### Note

See sphx_glr_auto_examples_model_selection_plot_roc_crossval.py for\n an extension of the present example estimating the variance of the ROC\n curves and their respective AUC.

\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and prepare data\n\nWe import the iris_dataset which contains 3 classes, each one\ncorresponding to a type of iris plant. One class is linearly separable from\nthe other 2; the latter are **not** linearly separable from each other.\n\nHere we binarize the output and add noisy features to make the problem harder.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\niris = load_iris()\ntarget_names = iris.target_names\nX, y = iris.data, iris.target\ny = iris.target_names[y]\n\nrandom_state = np.random.RandomState(0)\nn_samples, n_features = X.shape\nn_classes = len(np.unique(y))\nX = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)\n(\n X_train,\n X_test,\n y_train,\n y_test,\n) = train_test_split(X, y, test_size=0.5, stratify=y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We train a :class:~sklearn.linear_model.LogisticRegression model which can\nnaturally handle multiclass problems, thanks to the use of the multinomial\nformulation.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n\nclassifier = LogisticRegression()\ny_score = classifier.fit(X_train, y_train).predict_proba(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-vs-Rest multiclass ROC\n\nThe One-vs-the-Rest (OvR) multiclass strategy, also known as one-vs-all,\nconsists in computing a ROC curve per each of the n_classes. In each step, a\ngiven class is regarded as the positive class and the remaining classes are\nregarded as the negative class as a bulk.\n\n

#### Note

One should not confuse the OvR strategy used for the **evaluation**\n of multiclass classifiers with the OvR strategy used to **train** a\n multiclass classifier by fitting a set of binary classifiers (for instance\n via the :class:~sklearn.multiclass.OneVsRestClassifier meta-estimator).\n The OvR ROC evaluation can be used to scrutinize any kind of classification\n models irrespectively of how they were trained (see multiclass).

\n\nIn this section we use a :class:~sklearn.preprocessing.LabelBinarizer to\nbinarize the target by one-hot-encoding in a OvR fashion. This means that the\ntarget of shape (n_samples,) is mapped to a target of shape (n_samples,\nn_classes).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelBinarizer\n\nlabel_binarizer = LabelBinarizer().fit(y_train)\ny_onehot_test = label_binarizer.transform(y_test)\ny_onehot_test.shape # (n_samples, n_classes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can as well easily check the encoding of a specific class:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "label_binarizer.transform([\"virginica\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ROC curve showing a specific class\n\nIn the following plot we show the resulting ROC curve when regarding the iris\nflowers as either \"virginica\" (class_id=2) or \"non-virginica\" (the rest).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class_of_interest = \"virginica\"\nclass_id = np.flatnonzero(label_binarizer.classes_ == class_of_interest)\nclass_id" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nfrom sklearn.metrics import RocCurveDisplay\n\nRocCurveDisplay.from_predictions(\n y_onehot_test[:, class_id],\n y_score[:, class_id],\n name=f\"{class_of_interest} vs the rest\",\n color=\"darkorange\",\n plot_chance_level=True,\n)\nplt.axis(\"square\")\nplt.xlabel(\"False Positive Rate\")\nplt.ylabel(\"True Positive Rate\")\nplt.title(\"One-vs-Rest ROC curves:\\nVirginica vs (Setosa & Versicolor)\")\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ROC curve using micro-averaged OvR\n\nMicro-averaging aggregates the contributions from all the classes (using\n:func:np.ravel) to compute the average metrics as follows:\n\n$TPR=\\frac{\\sum_{c}TP_c}{\\sum_{c}(TP_c + FN_c)}$ ;\n\n$FPR=\\frac{\\sum_{c}FP_c}{\\sum_{c}(FP_c + TN_c)}$ .\n\nWe can briefly demo the effect of :func:np.ravel:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"y_score:\\n{y_score[0:2,:]}\")\nprint()\nprint(f\"y_score.ravel():\\n{y_score[0:2,:].ravel()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a multi-class classification setup with highly imbalanced classes,\nmicro-averaging is preferable over macro-averaging. In such cases, one can\nalternatively use a weighted macro-averaging, not demoed here.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "RocCurveDisplay.from_predictions(\n y_onehot_test.ravel(),\n y_score.ravel(),\n name=\"micro-average OvR\",\n color=\"darkorange\",\n plot_chance_level=True,\n)\nplt.axis(\"square\")\nplt.xlabel(\"False Positive Rate\")\nplt.ylabel(\"True Positive Rate\")\nplt.title(\"Micro-averaged One-vs-Rest\\nReceiver Operating Characteristic\")\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case where the main interest is not the plot but the ROC-AUC score\nitself, we can reproduce the value shown in the plot using\n:class:~sklearn.metrics.roc_auc_score.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import roc_auc_score\n\nmicro_roc_auc_ovr = roc_auc_score(\n y_test,\n y_score,\n multi_class=\"ovr\",\n average=\"micro\",\n)\n\nprint(f\"Micro-averaged One-vs-Rest ROC AUC score:\\n{micro_roc_auc_ovr:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is equivalent to computing the ROC curve with\n:class:~sklearn.metrics.roc_curve and then the area under the curve with\n:class:~sklearn.metrics.auc for the raveled true and predicted classes.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import roc_curve, auc\n\n# store the fpr, tpr, and roc_auc for all averaging strategies\nfpr, tpr, roc_auc = dict(), dict(), dict()\n# Compute micro-average ROC curve and ROC area\nfpr[\"micro\"], tpr[\"micro\"], _ = roc_curve(y_onehot_test.ravel(), y_score.ravel())\nroc_auc[\"micro\"] = auc(fpr[\"micro\"], tpr[\"micro\"])\n\nprint(f\"Micro-averaged One-vs-Rest ROC AUC score:\\n{roc_auc['micro']:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

By default, the computation of the ROC curve adds a single point at\n the maximal false positive rate by using linear interpolation and the\n McClish correction [:doi:Analyzing a portion of the ROC curve Med Decis\n Making. 1989 Jul-Sep; 9(3):190-5.<10.1177/0272989x8900900307>].

\n\n### ROC curve using the OvR macro-average\n\nObtaining the macro-average requires computing the metric independently for\neach class and then taking the average over them, hence treating all classes\nequally a priori. We first aggregate the true/false positive rates per class:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in range(n_classes):\n fpr[i], tpr[i], _ = roc_curve(y_onehot_test[:, i], y_score[:, i])\n roc_auc[i] = auc(fpr[i], tpr[i])\n\nfpr_grid = np.linspace(0.0, 1.0, 1000)\n\n# Interpolate all ROC curves at these points\nmean_tpr = np.zeros_like(fpr_grid)\n\nfor i in range(n_classes):\n mean_tpr += np.interp(fpr_grid, fpr[i], tpr[i]) # linear interpolation\n\n# Average it and compute AUC\nmean_tpr /= n_classes\n\nfpr[\"macro\"] = fpr_grid\ntpr[\"macro\"] = mean_tpr\nroc_auc[\"macro\"] = auc(fpr[\"macro\"], tpr[\"macro\"])\n\nprint(f\"Macro-averaged One-vs-Rest ROC AUC score:\\n{roc_auc['macro']:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This computation is equivalent to simply calling\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "macro_roc_auc_ovr = roc_auc_score(\n y_test,\n y_score,\n multi_class=\"ovr\",\n average=\"macro\",\n)\n\nprint(f\"Macro-averaged One-vs-Rest ROC AUC score:\\n{macro_roc_auc_ovr:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot all OvR ROC curves together\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import cycle\n\nfig, ax = plt.subplots(figsize=(6, 6))\n\nplt.plot(\n fpr[\"micro\"],\n tpr[\"micro\"],\n label=f\"micro-average ROC curve (AUC = {roc_auc['micro']:.2f})\",\n color=\"deeppink\",\n linestyle=\":\",\n linewidth=4,\n)\n\nplt.plot(\n fpr[\"macro\"],\n tpr[\"macro\"],\n label=f\"macro-average ROC curve (AUC = {roc_auc['macro']:.2f})\",\n color=\"navy\",\n linestyle=\":\",\n linewidth=4,\n)\n\ncolors = cycle([\"aqua\", \"darkorange\", \"cornflowerblue\"])\nfor class_id, color in zip(range(n_classes), colors):\n RocCurveDisplay.from_predictions(\n y_onehot_test[:, class_id],\n y_score[:, class_id],\n name=f\"ROC curve for {target_names[class_id]}\",\n color=color,\n ax=ax,\n plot_chance_level=(class_id == 2),\n )\n\nplt.axis(\"square\")\nplt.xlabel(\"False Positive Rate\")\nplt.ylabel(\"True Positive Rate\")\nplt.title(\"Extension of Receiver Operating Characteristic\\nto One-vs-Rest multiclass\")\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-vs-One multiclass ROC\n\nThe One-vs-One (OvO) multiclass strategy consists in fitting one classifier\nper class pair. Since it requires to train n_classes * (n_classes - 1) / 2\nclassifiers, this method is usually slower than One-vs-Rest due to its\nO(n_classes ^2) complexity.\n\nIn this section, we demonstrate the macro-averaged AUC using the OvO scheme\nfor the 3 possible combinations in the iris_dataset: \"setosa\" vs\n\"versicolor\", \"versicolor\" vs \"virginica\" and \"virginica\" vs \"setosa\". Notice\nthat micro-averaging is not defined for the OvO scheme.\n\n### ROC curve using the OvO macro-average\n\nIn the OvO scheme, the first step is to identify all possible unique\ncombinations of pairs. The computation of scores is done by treating one of\nthe elements in a given pair as the positive class and the other element as\nthe negative class, then re-computing the score by inversing the roles and\ntaking the mean of both scores.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import combinations\n\npair_list = list(combinations(np.unique(y), 2))\nprint(pair_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pair_scores = []\nmean_tpr = dict()\n\nfor ix, (label_a, label_b) in enumerate(pair_list):\n a_mask = y_test == label_a\n b_mask = y_test == label_b\n ab_mask = np.logical_or(a_mask, b_mask)\n\n a_true = a_mask[ab_mask]\n b_true = b_mask[ab_mask]\n\n idx_a = np.flatnonzero(label_binarizer.classes_ == label_a)\n idx_b = np.flatnonzero(label_binarizer.classes_ == label_b)\n\n fpr_a, tpr_a, _ = roc_curve(a_true, y_score[ab_mask, idx_a])\n fpr_b, tpr_b, _ = roc_curve(b_true, y_score[ab_mask, idx_b])\n\n mean_tpr[ix] = np.zeros_like(fpr_grid)\n mean_tpr[ix] += np.interp(fpr_grid, fpr_a, tpr_a)\n mean_tpr[ix] += np.interp(fpr_grid, fpr_b, tpr_b)\n mean_tpr[ix] /= 2\n mean_score = auc(fpr_grid, mean_tpr[ix])\n pair_scores.append(mean_score)\n\n fig, ax = plt.subplots(figsize=(6, 6))\n plt.plot(\n fpr_grid,\n mean_tpr[ix],\n label=f\"Mean {label_a} vs {label_b} (AUC = {mean_score :.2f})\",\n linestyle=\":\",\n linewidth=4,\n )\n RocCurveDisplay.from_predictions(\n a_true,\n y_score[ab_mask, idx_a],\n ax=ax,\n name=f\"{label_a} as positive class\",\n )\n RocCurveDisplay.from_predictions(\n b_true,\n y_score[ab_mask, idx_b],\n ax=ax,\n name=f\"{label_b} as positive class\",\n plot_chance_level=True,\n )\n plt.axis(\"square\")\n plt.xlabel(\"False Positive Rate\")\n plt.ylabel(\"True Positive Rate\")\n plt.title(f\"{target_names[idx_a]} vs {label_b} ROC curves\")\n plt.legend()\n plt.show()\n\nprint(f\"Macro-averaged One-vs-One ROC AUC score:\\n{np.average(pair_scores):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can also assert that the macro-average we computed \"by hand\" is equivalent\nto the implemented average=\"macro\" option of the\n:class:~sklearn.metrics.roc_auc_score function.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "macro_roc_auc_ovo = roc_auc_score(\n y_test,\n y_score,\n multi_class=\"ovo\",\n average=\"macro\",\n)\n\nprint(f\"Macro-averaged One-vs-One ROC AUC score:\\n{macro_roc_auc_ovo:.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot all OvO ROC curves together\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ovo_tpr = np.zeros_like(fpr_grid)\n\nfig, ax = plt.subplots(figsize=(6, 6))\nfor ix, (label_a, label_b) in enumerate(pair_list):\n ovo_tpr += mean_tpr[ix]\n plt.plot(\n fpr_grid,\n mean_tpr[ix],\n label=f\"Mean {label_a} vs {label_b} (AUC = {pair_scores[ix]:.2f})\",\n )\n\novo_tpr /= sum(1 for pair in enumerate(pair_list))\n\nplt.plot(\n fpr_grid,\n ovo_tpr,\n label=f\"One-vs-One macro-average (AUC = {macro_roc_auc_ovo:.2f})\",\n linestyle=\":\",\n linewidth=4,\n)\nplt.plot([0, 1], [0, 1], \"k--\", label=\"Chance level (AUC = 0.5)\")\nplt.axis(\"square\")\nplt.xlabel(\"False Positive Rate\")\nplt.ylabel(\"True Positive Rate\")\nplt.title(\"Extension of Receiver Operating Characteristic\\nto One-vs-One multiclass\")\nplt.legend()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We confirm that the classes \"versicolor\" and \"virginica\" are not well\nidentified by a linear classifier. Notice that the \"virginica\"-vs-the-rest\nROC-AUC score (0.77) is between the OvO ROC-AUC scores for \"versicolor\" vs\n\"virginica\" (0.64) and \"setosa\" vs \"virginica\" (0.90). Indeed, the OvO\nstrategy gives additional information on the confusion between a pair of\nclasses, at the expense of computational cost when the number of classes\nis large.\n\nThe OvO strategy is recommended if the user is mainly interested in correctly\nidentifying a particular class or subset of classes, whereas evaluating the\nglobal performance of a classifier can still be summarized via a given\naveraging strategy.\n\nMicro-averaged OvR ROC is dominated by the more frequent class, since the\ncounts are pooled. The macro-averaged alternative better reflects the\nstatistics of the less frequent classes, and then is more appropriate when\nperformance on all the classes is deemed equally important.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 0 }