.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/miscellaneous/plot_outlier_detection_bench.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_miscellaneous_plot_outlier_detection_bench.py: ========================================== Evaluation of outlier detection estimators ========================================== This example benchmarks outlier detection algorithms, :ref:`local_outlier_factor` (LOF) and :ref:`isolation_forest` (IForest), using ROC curves on classical anomaly detection datasets. The algorithm performance is assessed in an outlier detection context: 1. The algorithms are trained on the whole dataset which is assumed to contain outliers. 2. The ROC curve from :class:`~sklearn.metrics.RocCurveDisplay` is computed on the same dataset using the knowledge of the labels. .. GENERATED FROM PYTHON SOURCE LINES 18-24 .. code-block:: default # Author: Pharuj Rajborirug # License: BSD 3 clause print(__doc__) .. GENERATED FROM PYTHON SOURCE LINES 25-33 Define a data preprocessing function ------------------------------------ The example uses real-world datasets available in :class:`sklearn.datasets` and the sample size of some datasets is reduced to speed up computation. After the data preprocessing, the datasets' targets will have two classes, 0 representing inliers and 1 representing outliers. The `preprocess_dataset` function returns data and target. .. GENERATED FROM PYTHON SOURCE LINES 33-111 .. code-block:: default import numpy as np from sklearn.datasets import fetch_kddcup99, fetch_covtype, fetch_openml from sklearn.preprocessing import LabelBinarizer import pandas as pd rng = np.random.RandomState(42) def preprocess_dataset(dataset_name): # loading and vectorization print(f"Loading {dataset_name} data") if dataset_name in ["http", "smtp", "SA", "SF"]: dataset = fetch_kddcup99(subset=dataset_name, percent10=True, random_state=rng) X = dataset.data y = dataset.target lb = LabelBinarizer() if dataset_name == "SF": idx = rng.choice(X.shape[0], int(X.shape[0] * 0.1), replace=False) X = X[idx] # reduce the sample size y = y[idx] x1 = lb.fit_transform(X[:, 1].astype(str)) X = np.c_[X[:, :1], x1, X[:, 2:]] elif dataset_name == "SA": idx = rng.choice(X.shape[0], int(X.shape[0] * 0.1), replace=False) X = X[idx] # reduce the sample size y = y[idx] x1 = lb.fit_transform(X[:, 1].astype(str)) x2 = lb.fit_transform(X[:, 2].astype(str)) x3 = lb.fit_transform(X[:, 3].astype(str)) X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]] y = (y != b"normal.").astype(int) if dataset_name == "forestcover": dataset = fetch_covtype() X = dataset.data y = dataset.target idx = rng.choice(X.shape[0], int(X.shape[0] * 0.1), replace=False) X = X[idx] # reduce the sample size y = y[idx] # inliers are those with attribute 2 # outliers are those with attribute 4 s = (y == 2) + (y == 4) X = X[s, :] y = y[s] y = (y != 2).astype(int) if dataset_name in ["glass", "wdbc", "cardiotocography"]: dataset = fetch_openml( name=dataset_name, version=1, as_frame=False, parser="pandas" ) X = dataset.data y = dataset.target if dataset_name == "glass": s = y == "tableware" y = s.astype(int) if dataset_name == "wdbc": s = y == "2" y = s.astype(int) X_mal, y_mal = X[s], y[s] X_ben, y_ben = X[~s], y[~s] # downsampled to 39 points (9.8% outliers) idx = rng.choice(y_mal.shape[0], 39, replace=False) X_mal2 = X_mal[idx] y_mal2 = y_mal[idx] X = np.concatenate((X_ben, X_mal2), axis=0) y = np.concatenate((y_ben, y_mal2), axis=0) if dataset_name == "cardiotocography": s = y == "3" y = s.astype(int) # 0 represents inliers, and 1 represents outliers y = pd.Series(y, dtype="category") return (X, y) .. GENERATED FROM PYTHON SOURCE LINES 112-119 Define an outlier prediction function ------------------------------------- There is no particular reason to choose algorithms :class:`~sklearn.neighbors.LocalOutlierFactor` and :class:`~sklearn.ensemble.IsolationForest`. The goal is to show that different algorithm performs well on different datasets. The following `compute_prediction` function returns average outlier score of X. .. GENERATED FROM PYTHON SOURCE LINES 119-138 .. code-block:: default from sklearn.neighbors import LocalOutlierFactor from sklearn.ensemble import IsolationForest def compute_prediction(X, model_name): print(f"Computing {model_name} prediction...") if model_name == "LOF": clf = LocalOutlierFactor(n_neighbors=20, contamination="auto") clf.fit(X) y_pred = clf.negative_outlier_factor_ if model_name == "IForest": clf = IsolationForest(random_state=rng, contamination="auto") y_pred = clf.fit(X).decision_function(X) return y_pred .. GENERATED FROM PYTHON SOURCE LINES 139-147 Plot and interpret results -------------------------- The algorithm performance relates to how good the true positive rate (TPR) is at low value of the false positive rate (FPR). The best algorithms have the curve on the top-left of the plot and the area under curve (AUC) close to 1. The diagonal dashed line represents a random classification of outliers and inliers. .. GENERATED FROM PYTHON SOURCE LINES 147-196 .. code-block:: default import math import matplotlib.pyplot as plt from sklearn.metrics import RocCurveDisplay datasets_name = [ "http", "smtp", "SA", "SF", "forestcover", "glass", "wdbc", "cardiotocography", ] models_name = [ "LOF", "IForest", ] # plotting parameters cols = 2 linewidth = 1 pos_label = 0 # mean 0 belongs to positive class rows = math.ceil(len(datasets_name) / cols) fig, axs = plt.subplots(rows, cols, figsize=(10, rows * 3)) for i, dataset_name in enumerate(datasets_name): (X, y) = preprocess_dataset(dataset_name=dataset_name) for model_name in models_name: y_pred = compute_prediction(X, model_name=model_name) display = RocCurveDisplay.from_predictions( y, y_pred, pos_label=pos_label, name=model_name, linewidth=linewidth, ax=axs[i // cols, i % cols], ) axs[i // cols, i % cols].plot([0, 1], [0, 1], linewidth=linewidth, linestyle=":") axs[i // cols, i % cols].set_title(dataset_name) axs[i // cols, i % cols].set_xlabel("False Positive Rate") axs[i // cols, i % cols].set_ylabel("True Positive Rate") plt.tight_layout(pad=2.0) # spacing between subplots plt.show() .. image-sg:: /auto_examples/miscellaneous/images/sphx_glr_plot_outlier_detection_bench_001.png :alt: http, smtp, SA, SF, forestcover, glass, wdbc, cardiotocography :srcset: /auto_examples/miscellaneous/images/sphx_glr_plot_outlier_detection_bench_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Loading http data Computing LOF prediction... Computing IForest prediction... Loading smtp data Computing LOF prediction... Computing IForest prediction... Loading SA data Computing LOF prediction... Computing IForest prediction... Loading SF data Computing LOF prediction... Computing IForest prediction... Loading forestcover data Computing LOF prediction... Computing IForest prediction... Loading glass data Computing LOF prediction... Computing IForest prediction... Loading wdbc data Computing LOF prediction... Computing IForest prediction... Loading cardiotocography data Computing LOF prediction... Computing IForest prediction... .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 46.233 seconds) .. _sphx_glr_download_auto_examples_miscellaneous_plot_outlier_detection_bench.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.2.X?urlpath=lab/tree/notebooks/auto_examples/miscellaneous/plot_outlier_detection_bench.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_outlier_detection_bench.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_outlier_detection_bench.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_