Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

IsolationForest example#

An example using IsolationForest for anomaly detection.

The Isolation Forest is an ensemble of “Isolation Trees” that “isolate” observations by recursive random partitioning, which can be represented by a tree structure. The number of splittings required to isolate a sample is lower for outliers and higher for inliers.

In the present example we demo two ways to visualize the decision boundary of an Isolation Forest trained on a toy dataset.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Data generation#

We generate two clusters (each one containing n_samples) by randomly sampling the standard normal distribution as returned by numpy.random.randn. One of them is spherical and the other one is slightly deformed.

For consistency with the IsolationForest notation, the inliers (i.e. the gaussian clusters) are assigned a ground truth label 1 whereas the outliers (created with numpy.random.uniform) are assigned the label -1.

import numpy as np

from sklearn.model_selection import train_test_split

n_samples, n_outliers = 120, 40
rng = np.random.RandomState(0)
covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

We can visualize the resulting clusters:

import matplotlib.pyplot as plt

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")
plt.show()

Gaussian inliers with uniformly distributed outliers

Training of the model#

from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

IsolationForest(max_samples=100, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

IsolationForest

?Documentation for IsolationForestiFitted

Parameters

	max_samples max_samples: "auto", int or float, default="auto" The number of samples to draw from X to train each base estimator. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. - If "auto", then `max_samples=min(256, n_samples)`. If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).	100
	random_state random_state: int, RandomState instance or None, default=None Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`.	0
	n_estimators n_estimators: int, default=100 The number of base estimators in the ensemble.	100
	contamination contamination: 'auto' or float, default='auto' The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. - If 'auto', the threshold is determined as in the original paper. - If float, the contamination should be in the range (0, 0.5]. .. versionchanged:: 0.22 The default value of ``contamination`` changed from 0.1 to ``'auto'``.	'auto'
	max_features max_features: int or float, default=1.0 The number of features to draw from X to train each base estimator. - If int, then draw `max_features` features. - If float, then draw `max(1, int(max_features * n_features_in_))` features. Note: using a float number less than 1.0 or integer less than number of features will enable feature subsampling and leads to a longer runtime.	1.0
	bootstrap bootstrap: bool, default=False If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel for :meth:`fit`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.	None
	verbose verbose: int, default=0 Controls the verbosity of the tree building process.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`the Glossary <warm_start>`. .. versionadded:: 0.21	False

Fitted attributes

Name	Type	Value
estimator_ estimator_: :class:`~sklearn.tree.ExtraTreeRegressor` instance The child estimator template used to create the collection of fitted sub-estimators. .. versionadded:: 1.2 `base_estimator_` was renamed to `estimator_`.	ExtraTreeRegressor	ExtraTreeRegr...andom_state=0)
estimators_ estimators_: list of ExtraTreeRegressor instances The collection of fitted sub-estimators.	list	[ExtraTreeRegr...te=2087557356), ExtraTreeRegr...ate=132990059), ExtraTreeRegr...te=1109697837), ExtraTreeRegr...ate=123230084), ...]
estimators_features_ estimators_features_: list of ndarray The subset of drawn features for each base estimator.	list	[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), ...]
estimators_samples_ estimators_samples_: list of ndarray The subset of drawn samples (i.e., the in-bag samples) for each base estimator.	list	[array([131, ...57, 12, 61]), array([ 3, 1...36, 124, 16]), array([146, ...16, 124, 78]), array([183, 1...99, 61, 78]), ...]
max_samples_ max_samples_: int The actual number of samples.	int	100
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	2
offset_ offset_: float Offset used to define the decision function from the raw scores. We have the relation: ``decision_function = score_samples - offset_``. ``offset_`` is defined as follows. When the contamination parameter is set to "auto", the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than "auto" is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training. .. versionadded:: 0.20	float	-0.5

Plot discrete decision boundary#

We use the class DecisionBoundaryDisplay to visualize a discrete decision boundary. The background color represents whether a sample in that given area is predicted to be an outlier or not. The scatter plot displays the true labels.

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

Binary decision boundary of IsolationForest

Plot path length decision boundary#

By setting the response_method="decision_function", the background of the DecisionBoundaryDisplay represents the measure of normality of an observation. Such score is given by the path length averaged over a forest of random trees, which itself is given by the depth of the leaf (or equivalently the number of splits) required to isolate a given sample.

When a forest of random trees collectively produce short path lengths for isolating some particular samples, they are highly likely to be anomalies and the measure of normality is close to 0. Similarly, large paths correspond to values close to 1 and are more likely to be inliers.

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(disp.ax_.collections[1])
plt.show()