{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Adjustment for chance in clustering performance evaluation\nThis notebook explores the impact of uniformly-distributed random labeling on\nthe behavior of some clustering evaluation metrics. For such purpose, the\nmetrics are computed with a fixed number of samples and as a function of the number\nof clusters assigned by the estimator. The example is divided into two\nexperiments:\n\n- a first experiment with fixed \"ground truth labels\" (and therefore fixed\n  number of classes) and randomly \"predicted labels\";\n- a second experiment with varying \"ground truth labels\", randomly \"predicted\n  labels\". The \"predicted labels\" have the same number of classes and clusters\n  as the \"ground truth labels\".\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Defining the list of metrics to evaluate\n\nClustering algorithms are fundamentally unsupervised learning methods.\nHowever, since we assign class labels for the synthetic clusters in this\nexample, it is possible to use evaluation metrics that leverage this\n\"supervised\" ground truth information to quantify the quality of the resulting\nclusters. Examples of such metrics are the following:\n\n- V-measure, the harmonic mean of completeness and homogeneity;\n\n- Rand index, which measures how frequently pairs of data points are grouped\n  consistently according to the result of the clustering algorithm and the\n  ground truth class assignment;\n\n- Adjusted Rand index (ARI), a chance-adjusted Rand index such that a random\n  cluster assignment has an ARI of 0.0 in expectation;\n\n- Mutual Information (MI) is an information theoretic measure that quantifies\n  how dependent are the two labelings. Note that the maximum value of MI for\n  perfect labelings depends on the number of clusters and samples;\n\n- Normalized Mutual Information (NMI), a Mutual Information defined between 0\n  (no mutual information) in the limit of large number of data points and 1\n  (perfectly matching label assignments, up to a permutation of the labels).\n  It is not adjusted for chance: then the number of clustered data points is\n  not large enough, the expected values of MI or NMI for random labelings can\n  be significantly non-zero;\n\n- Adjusted Mutual Information (AMI), a chance-adjusted Mutual Information.\n  Similarly to ARI, random cluster assignment has an AMI of 0.0 in\n  expectation.\n\nFor more information, see the `clustering_evaluation` module.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn import metrics\n\nscore_funcs = [\n    (\"V-measure\", metrics.v_measure_score),\n    (\"Rand index\", metrics.rand_score),\n    (\"ARI\", metrics.adjusted_rand_score),\n    (\"MI\", metrics.mutual_info_score),\n    (\"NMI\", metrics.normalized_mutual_info_score),\n    (\"AMI\", metrics.adjusted_mutual_info_score),\n]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## First experiment: fixed ground truth labels and growing number of clusters\n\nWe first define a function that creates uniformly-distributed random labeling.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\nrng = np.random.RandomState(0)\n\n\ndef random_labels(n_samples, n_classes):\n    return rng.randint(low=0, high=n_classes, size=n_samples)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Another function will use the `random_labels` function to create a fixed set\nof ground truth labels (`labels_a`) distributed in `n_classes` and then score\nseveral sets of randomly \"predicted\" labels (`labels_b`) to assess the\nvariability of a given metric at a given `n_clusters`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def fixed_classes_uniform_labelings_scores(\n    score_func, n_samples, n_clusters_range, n_classes, n_runs=5\n):\n    scores = np.zeros((len(n_clusters_range), n_runs))\n    labels_a = random_labels(n_samples=n_samples, n_classes=n_classes)\n\n    for i, n_clusters in enumerate(n_clusters_range):\n        for j in range(n_runs):\n            labels_b = random_labels(n_samples=n_samples, n_classes=n_clusters)\n            scores[i, j] = score_func(labels_a, labels_b)\n    return scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this first example we set the number of classes (true number of clusters) to\n`n_classes=10`. The number of clusters varies over the values provided by\n`n_clusters_range`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport seaborn as sns\n\nn_samples = 1000\nn_classes = 10\nn_clusters_range = np.linspace(2, 100, 10).astype(int)\nplots = []\nnames = []\n\nsns.color_palette(\"colorblind\")\nplt.figure(1)\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n    scores = fixed_classes_uniform_labelings_scores(\n        score_func, n_samples, n_clusters_range, n_classes=n_classes\n    )\n    plots.append(\n        plt.errorbar(\n            n_clusters_range,\n            scores.mean(axis=1),\n            scores.std(axis=1),\n            alpha=0.8,\n            linewidth=1,\n            marker=marker,\n        )[0]\n    )\n    names.append(score_name)\n\nplt.title(\n    \"Clustering measures for random uniform labeling\\n\"\n    f\"against reference assignment with {n_classes} classes\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.ylim(bottom=-0.05, top=1.05)\nplt.legend(plots, names, bbox_to_anchor=(0.5, 0.5))\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The Rand index saturates for `n_clusters` > `n_classes`. Other non-adjusted\nmeasures such as the V-Measure show a linear dependency between the number of\nclusters and the number of samples.\n\nAdjusted for chance measure, such as ARI and AMI, display some random\nvariations centered around a mean score of 0.0, independently of the number of\nsamples and clusters.\n\n## Second experiment: varying number of classes and clusters\n\nIn this section we define a similar function that uses several metrics to\nscore 2 uniformly-distributed random labelings. In this case, the number of\nclasses and assigned number of clusters are matched for each possible value in\n`n_clusters_range`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def uniform_labelings_scores(score_func, n_samples, n_clusters_range, n_runs=5):\n    scores = np.zeros((len(n_clusters_range), n_runs))\n\n    for i, n_clusters in enumerate(n_clusters_range):\n        for j in range(n_runs):\n            labels_a = random_labels(n_samples=n_samples, n_classes=n_clusters)\n            labels_b = random_labels(n_samples=n_samples, n_classes=n_clusters)\n            scores[i, j] = score_func(labels_a, labels_b)\n    return scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, we use `n_samples=100` to show the effect of having a number of\nclusters similar or equal to the number of samples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_samples = 100\nn_clusters_range = np.linspace(2, n_samples, 10).astype(int)\n\nplt.figure(2)\n\nplots = []\nnames = []\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n    scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)\n    plots.append(\n        plt.errorbar(\n            n_clusters_range,\n            np.median(scores, axis=1),\n            scores.std(axis=1),\n            alpha=0.8,\n            linewidth=2,\n            marker=marker,\n        )[0]\n    )\n    names.append(score_name)\n\nplt.title(\n    \"Clustering measures for 2 random uniform labelings\\nwith equal number of clusters\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.legend(plots, names)\nplt.ylim(bottom=-0.05, top=1.05)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We observe similar results as for the first experiment: adjusted for chance\nmetrics stay constantly near zero while other metrics tend to get larger with\nfiner-grained labelings. The mean V-measure of random labeling increases\nsignificantly as the number of clusters is closer to the total number of\nsamples used to compute the measure. Furthermore, raw Mutual Information is\nunbounded from above and its scale depends on the dimensions of the clustering\nproblem and the cardinality of the ground truth classes. This is why the\ncurve goes off the chart.\n\nOnly adjusted measures can hence be safely used as a consensus index to\nevaluate the average stability of clustering algorithms for a given value of k\non various overlapping sub-samples of the dataset.\n\nNon-adjusted clustering evaluation metric can therefore be misleading as they\noutput large values for fine-grained labelings, one could be lead to think\nthat the labeling has captured meaningful groups while they can be totally\nrandom. In particular, such non-adjusted metrics should not be used to compare\nthe results of different clustering algorithms that output a different number\nof clusters.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.19"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}