{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Statistical comparison of models using grid search\n\nThis example illustrates how to statistically compare the performance of models\ntrained and evaluated using :class:`~sklearn.model_selection.GridSearchCV`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will start by simulating moon shaped data (where the ideal separation\nbetween classes is non-linear), adding to it a moderate degree of noise.\nDatapoints will belong to one of two possible classes to be predicted by two\nfeatures. We will simulate 50 samples for each class:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom sklearn.datasets import make_moons\n\nX, y = make_moons(noise=0.352, random_state=1, n_samples=100)\n\nsns.scatterplot(\n    x=X[:, 0], y=X[:, 1], hue=y,\n    marker='o', s=25, edgecolor='k', legend=False\n).set_title(\"Data\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will compare the performance of :class:`~sklearn.svm.SVC` estimators that\nvary on their `kernel` parameter, to decide which choice of this\nhyper-parameter predicts our simulated data best.\nWe will evaluate the performance of the models using\n:class:`~sklearn.model_selection.RepeatedStratifiedKFold`, repeating 10 times\na 10-fold stratified cross validation using a different randomization of the\ndata in each repetition. The performance will be evaluated using\n:class:`~sklearn.metrics.roc_auc_score`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold\nfrom sklearn.svm import SVC\n\nparam_grid = [\n    {'kernel': ['linear']},\n    {'kernel': ['poly'], 'degree': [2, 3]},\n    {'kernel': ['rbf']}\n]\n\nsvc = SVC(random_state=0)\n\ncv = RepeatedStratifiedKFold(\n    n_splits=10, n_repeats=10, random_state=0\n)\n\nsearch = GridSearchCV(\n    estimator=svc, param_grid=param_grid,\n    scoring='roc_auc', cv=cv\n)\nsearch.fit(X, y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now inspect the results of our search, sorted by their\n`mean_test_score`:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n\nresults_df = pd.DataFrame(search.cv_results_)\nresults_df = results_df.sort_values(by=['rank_test_score'])\nresults_df = (\n    results_df\n    .set_index(results_df[\"params\"].apply(\n        lambda x: \"_\".join(str(val) for val in x.values()))\n    )\n    .rename_axis('kernel')\n)\nresults_df[\n    ['params', 'rank_test_score', 'mean_test_score', 'std_test_score']\n]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that the estimator using the `'rbf'` kernel performed best,\nclosely followed by `'linear'`. Both estimators with a `'poly'` kernel\nperformed worse, with the one using a two-degree polynomial achieving a much\nlower perfomance than all other models.\n\nUsually, the analysis just ends here, but half the story is missing. The\noutput of :class:`~sklearn.model_selection.GridSearchCV` does not provide\ninformation on the certainty of the differences between the models.\nWe don't know if these are **statistically** significant.\nTo evaluate this, we need to conduct a statistical test.\nSpecifically, to contrast the performance of two models we should\nstatistically compare their AUC scores. There are 100 samples (AUC\nscores) for each model as we repreated 10 times a 10-fold cross-validation.\n\nHowever, the scores of the models are not independent: all models are\nevaluated on the **same** 100 partitions, increasing the correlation\nbetween the performance of the models.\nSince some partitions of the data can make the distinction of the classes\nparticularly easy or hard to find for all models, the models scores will\nco-vary.\n\nLet's inspect this partition effect by plotting the performance of all models\nin each fold, and calculating the correlation between models across folds:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# create df of model scores ordered by perfomance\nmodel_scores = results_df.filter(regex=r'split\\d*_test_score')\n\n# plot 30 examples of dependency between cv fold and AUC scores\nfig, ax = plt.subplots()\nsns.lineplot(\n    data=model_scores.transpose().iloc[:30],\n    dashes=False, palette='Set1', marker='o', alpha=.5, ax=ax\n)\nax.set_xlabel(\"CV test fold\", size=12, labelpad=10)\nax.set_ylabel(\"Model AUC\", size=12)\nax.tick_params(bottom=True, labelbottom=False)\nplt.show()\n\n# print correlation of AUC scores across folds\nprint(f\"Correlation of models:\\n {model_scores.transpose().corr()}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can observe that the performance of the models highly depends on the fold.\n\nAs a consequence, if we assume independence between samples we will be\nunderestimating the variance computed in our statistical tests, increasing\nthe number of false positive errors (i.e. detecting a significant difference\nbetween models when such does not exist) [1]_.\n\nSeveral variance-corrected statistical tests have been developed for these\ncases. In this example we will show how to implement one of them (the so\ncalled Nadeau and Bengio's corrected t-test) under two different statistical\nframeworks: frequentist and Bayesian.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Comparing two models: frequentist approach\n\nWe can start by asking: \"Is the first model significantly better than the\nsecond model (when ranked by `mean_test_score`)?\"\n\nTo answer this question using a frequentist approach we could\nrun a paired t-test and compute the p-value. This is also known as\nDiebold-Mariano test in the forecast literature [5]_.\nMany variants of such a t-test have been developed to account for the\n'non-independence of samples problem'\ndescribed in the previous section. We will use the one proven to obtain the\nhighest replicability scores (which rate how similar the performance of a\nmodel is when evaluating it on different random partitions of the same\ndataset) while mantaining a low rate of false postitives and false negatives:\nthe Nadeau and Bengio's corrected t-test [2]_ that uses a 10 times repeated\n10-fold cross validation [3]_.\n\nThis corrected paired t-test is computed as:\n\n\\begin{align}t=\\frac{\\frac{1}{k \\cdot r}\\sum_{i=1}^{k}\\sum_{j=1}^{r}x_{ij}}\n   {\\sqrt{(\\frac{1}{k \\cdot r}+\\frac{n_{test}}{n_{train}})\\hat{\\sigma}^2}}\\end{align}\n\nwhere $k$ is the number of folds,\n$r$ the number of repetitions in the cross-validation,\n$x$ is the difference in performance of the models,\n$n_{test}$ is the number of samples used for testing,\n$n_{train}$ is the number of samples used for training,\nand $\\hat{\\sigma}^2$ represents the variance of the observed\ndifferences.\n\nLet's implement a corrected right-tailed paired t-test to evaluate if the\nperformance of the first model is significantly better than that of the\nsecond model. Our null hypothesis is that the second model performs at least\nas good as the first model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nfrom scipy.stats import t\n\n\ndef corrected_std(differences, n_train, n_test):\n    \"\"\"Corrects standard deviation using Nadeau and Bengio's approach.\n\n    Parameters\n    ----------\n    differences : ndarray of shape (n_samples, 1)\n        Vector containing the differences in the score metrics of two models.\n    n_train : int\n        Number of samples in the training set.\n    n_test : int\n        Number of samples in the testing set.\n\n    Returns\n    -------\n    corrected_std : int\n        Variance-corrected standard deviation of the set of differences.\n    \"\"\"\n    n = n_train + n_test\n    corrected_var = (\n        np.var(differences, ddof=1) * ((1 / n) + (n_test / n_train))\n    )\n    corrected_std = np.sqrt(corrected_var)\n    return corrected_std\n\n\ndef compute_corrected_ttest(differences, df, n_train, n_test):\n    \"\"\"Computes right-tailed paired t-test with corrected variance.\n\n    Parameters\n    ----------\n    differences : array-like of shape (n_samples, 1)\n        Vector containing the differences in the score metrics of two models.\n    df : int\n        Degrees of freedom.\n    n_train : int\n        Number of samples in the training set.\n    n_test : int\n        Number of samples in the testing set.\n\n    Returns\n    -------\n    t_stat : float\n        Variance-corrected t-statistic.\n    p_val : float\n        Variance-corrected p-value.\n    \"\"\"\n    mean = np.mean(differences)\n    std = corrected_std(differences, n_train, n_test)\n    t_stat = mean / std\n    p_val = t.sf(np.abs(t_stat), df)  # right-tailed t-test\n    return t_stat, p_val"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model_1_scores = model_scores.iloc[0].values  # scores of the best model\nmodel_2_scores = model_scores.iloc[1].values  # scores of the second-best model\n\ndifferences = model_1_scores - model_2_scores\n\nn = differences.shape[0]  # number of test sets\ndf = n - 1\nn_train = len(list(cv.split(X, y))[0][0])\nn_test = len(list(cv.split(X, y))[0][1])\n\nt_stat, p_val = compute_corrected_ttest(differences, df, n_train, n_test)\nprint(f\"Corrected t-value: {t_stat:.3f}\\n\"\n      f\"Corrected p-value: {p_val:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can compare the corrected t- and p-values with the uncorrected ones:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "t_stat_uncorrected = (\n    np.mean(differences) / np.sqrt(np.var(differences, ddof=1) / n)\n)\np_val_uncorrected = t.sf(np.abs(t_stat_uncorrected), df)\n\nprint(f\"Uncorrected t-value: {t_stat_uncorrected:.3f}\\n\"\n      f\"Uncorrected p-value: {p_val_uncorrected:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using the conventional significance alpha level at `p=0.05`, we observe that\nthe uncorrected t-test concludes that the first model is significantly better\nthan the second.\n\nWith the corrected approach, in contrast, we fail to detect this difference.\n\nIn the latter case, however, the frequentist approach does not let us\nconclude that the first and second model have an equivalent performance. If\nwe wanted to make this assertion we need to use a Bayesian approach.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Comparing two models: Bayesian approach\nWe can use Bayesian estimation to calculate the probability that the first\nmodel is better than the second. Bayesian estimation will output a\ndistribution followed by the mean $\\mu$ of the differences in the\nperformance of two models.\n\nTo obtain the posterior distribution we need to define a prior that models\nour beliefs of how the mean is distributed before looking at the data,\nand multiply it by a likelihood function that computes how likely our\nobserved differences are, given the values that the mean of differences\ncould take.\n\nBayesian estimation can be carried out in many forms to answer our question,\nbut in this example we will implement the approach suggested by Benavoli and\ncollegues [4]_.\n\nOne way of defining our posterior using a closed-form expression is to select\na prior conjugate to the likelihood function. Benavoli and collegues [4]_\nshow that when comparing the performance of two classifiers we can model the\nprior as a Normal-Gamma distribution (with both mean and variance unknown)\nconjugate to a normal likelihood, to thus express the posterior as a normal\ndistribution.\nMarginalizing out the variance from this normal posterior, we can define the\nposterior of the mean parameter as a Student's t-distribution. Specifically:\n\n\\begin{align}St(\\mu;n-1,\\overline{x},(\\frac{1}{n}+\\frac{n_{test}}{n_{train}})\n   \\hat{\\sigma}^2)\\end{align}\n\nwhere $n$ is the total number of samples,\n$\\overline{x}$ represents the mean difference in the scores,\n$n_{test}$ is the number of samples used for testing,\n$n_{train}$ is the number of samples used for training,\nand $\\hat{\\sigma}^2$ represents the variance of the observed\ndifferences.\n\nNotice that we are using Nadeau and Bengio's corrected variance in our\nBayesian approach as well.\n\nLet's compute and plot the posterior:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# intitialize random variable\nt_post = t(\n    df, loc=np.mean(differences),\n    scale=corrected_std(differences, n_train, n_test)\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's plot the posterior distribution:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = np.linspace(t_post.ppf(0.001), t_post.ppf(0.999), 100)\n\nplt.plot(x, t_post.pdf(x))\nplt.xticks(np.arange(-0.04, 0.06, 0.01))\nplt.fill_between(x, t_post.pdf(x), 0, facecolor='blue', alpha=.2)\nplt.ylabel(\"Probability density\")\nplt.xlabel(r\"Mean difference ($\\mu$)\")\nplt.title(\"Posterior distribution\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can calculate the probability that the first model is better than the\nsecond by computing the area under the curve of the posterior distirbution\nfrom zero to infinity. And also the reverse: we can calculate the probability\nthat the second model is better than the first by computing the area under\nthe curve from minus infinity to zero.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "better_prob = 1 - t_post.cdf(0)\n\nprint(f\"Probability of {model_scores.index[0]} being more accurate than \"\n      f\"{model_scores.index[1]}: {better_prob:.3f}\")\nprint(f\"Probability of {model_scores.index[1]} being more accurate than \"\n      f\"{model_scores.index[0]}: {1 - better_prob:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In contrast with the frequentist approach, we can compute the probability\nthat one model is better than the other.\n\nNote that we obtained similar results as those in the frequentist approach.\nGiven our choice of priors, we are essentially performing the same\ncomputations, but we are allowed to make different assertions.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Region of Practical Equivalence\nSometimes we are interested in determining the probabilities that our models\nhave an equivalent performance, where \"equivalent\" is defined in a practical\nway. A naive approach [4]_ would be to define estimators as practically\nequivalent when they differ by less than 1% in their accuracy. But we could\nalso define this practical equivalence taking into account the problem we are\ntrying to solve. For example, a difference of 5% in accuracy would mean an\nincrease of $1000 in sales, and we consider any quantity above that as\nrelevant for our business.\n\nIn this example we are going to define the\nRegion of Practical Equivalence (ROPE) to be $[-0.01, 0.01]$. That is,\nwe will consider two models as practically equivalent if they differ by less\nthan 1% in their performance.\n\nTo compute the probabilities of the classifiers being practically equivalent,\nwe calculate the area under the curve of the posterior over the ROPE\ninterval:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rope_interval = [-0.01, 0.01]\nrope_prob = t_post.cdf(rope_interval[1]) - t_post.cdf(rope_interval[0])\n\nprint(f\"Probability of {model_scores.index[0]} and {model_scores.index[1]} \"\n      f\"being practically equivalent: {rope_prob:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can plot how the posterior is distributed over the ROPE interval:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x_rope = np.linspace(rope_interval[0], rope_interval[1], 100)\n\nplt.plot(x, t_post.pdf(x))\nplt.xticks(np.arange(-0.04, 0.06, 0.01))\nplt.vlines([-0.01, 0.01], ymin=0, ymax=(np.max(t_post.pdf(x)) + 1))\nplt.fill_between(x_rope, t_post.pdf(x_rope), 0, facecolor='blue', alpha=.2)\nplt.ylabel(\"Probability density\")\nplt.xlabel(r\"Mean difference ($\\mu$)\")\nplt.title(\"Posterior distribution under the ROPE\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As suggested in [4]_, we can further interpret these probabilities using the\nsame criteria as the frequentist approach: is the probability of falling\ninside the ROPE bigger than 95% (alpha value of 5%)?  In that case we can\nconclude that both models are practically equivalent.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The Bayesian estimation approach also allows us to compute how uncertain we\nare about our estimation of the difference. This can be calculated using\ncredible intervals. For a given probability, they show the range of values\nthat the estimated quantity, in our case the mean difference in\nperformance, can take.\nFor example, a 50% credible interval [x, y] tells us that there is a 50%\nprobability that the true (mean) difference of performance between models is\nbetween x and y.\n\nLet's determine the credible intervals of our data using 50%, 75% and 95%:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "cred_intervals = []\nintervals = [0.5, 0.75, 0.95]\n\nfor interval in intervals:\n    cred_interval = list(t_post.interval(interval))\n    cred_intervals.append([interval, cred_interval[0], cred_interval[1]])\n\ncred_int_df = pd.DataFrame(\n    cred_intervals,\n    columns=['interval', 'lower value', 'upper value']\n).set_index('interval')\ncred_int_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As shown in the table, there is a 50% probability that the true mean\ndifference between models will be between 0.000977 and 0.019023, 70%\nprobability that it will be between -0.005422 and 0.025422, and 95%\nprobability that it will be between -0.016445\tand 0.036445.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Pairwise comparison of all models: frequentist approach\n\nWe could also be interested in comparing the performance of all our models\nevaluated with :class:`~sklearn.model_selection.GridSearchCV`. In this case\nwe would be running our statistical test multiple times, which leads us to\nthe `multiple comparisons problem\n<https://en.wikipedia.org/wiki/Multiple_comparisons_problem>`_.\n\nThere are many possible ways to tackle this problem, but a standard approach\nis to apply a `Bonferroni correction\n<https://en.wikipedia.org/wiki/Bonferroni_correction>`_. Bonferroni can be\ncomputed by multiplying the p-value by the number of comparisons we are\ntesting.\n\nLet's compare the performance of the models using the corrected t-test:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from itertools import combinations\nfrom math import factorial\n\nn_comparisons = (\n    factorial(len(model_scores))\n    / (factorial(2) * factorial(len(model_scores) - 2))\n)\npairwise_t_test = []\n\nfor model_i, model_k in combinations(range(len(model_scores)), 2):\n    model_i_scores = model_scores.iloc[model_i].values\n    model_k_scores = model_scores.iloc[model_k].values\n    differences = model_i_scores - model_k_scores\n    t_stat, p_val = compute_corrected_ttest(\n        differences, df, n_train, n_test\n    )\n    p_val *= n_comparisons  # implement Bonferroni correction\n    # Bonferroni can output p-values higher than 1\n    p_val = 1 if p_val > 1 else p_val\n    pairwise_t_test.append(\n        [model_scores.index[model_i], model_scores.index[model_k],\n         t_stat, p_val]\n    )\n\npairwise_comp_df = pd.DataFrame(\n    pairwise_t_test,\n    columns=['model_1', 'model_2', 't_stat', 'p_val']\n).round(3)\npairwise_comp_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We observe that after correcting for multiple comparisons, the only model\nthat significantly differs from the others is `'2_poly'`.\n`'rbf'`, the model ranked first by\n:class:`~sklearn.model_selection.GridSearchCV`, does not significantly\ndiffer from `'linear'` or `'3_poly'`.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Pairwise comparison of all models: Bayesian approach\n\nWhen using Bayesian estimation to compare multiple models, we don't need to\ncorrect for multiple comparisons (for reasons why see [4]_).\n\nWe can carry out our pairwise comparisons the same way as in the first\nsection:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pairwise_bayesian = []\n\nfor model_i, model_k in combinations(range(len(model_scores)), 2):\n    model_i_scores = model_scores.iloc[model_i].values\n    model_k_scores = model_scores.iloc[model_k].values\n    differences = model_i_scores - model_k_scores\n    t_post = t(\n        df, loc=np.mean(differences),\n        scale=corrected_std(differences, n_train, n_test)\n    )\n    worse_prob = t_post.cdf(rope_interval[0])\n    better_prob = 1 - t_post.cdf(rope_interval[1])\n    rope_prob = t_post.cdf(rope_interval[1]) - t_post.cdf(rope_interval[0])\n\n    pairwise_bayesian.append([worse_prob, better_prob, rope_prob])\n\npairwise_bayesian_df = (pd.DataFrame(\n    pairwise_bayesian,\n    columns=['worse_prob', 'better_prob', 'rope_prob']\n).round(3))\n\npairwise_comp_df = pairwise_comp_df.join(pairwise_bayesian_df)\npairwise_comp_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using the Bayesian approach we can compute the probability that a model\nperforms better, worse or practically equivalent to another.\n\nResults show that the model ranked first by\n:class:`~sklearn.model_selection.GridSearchCV` `'rbf'`, has approximately a\n6.8% chance of being worse than `'linear'`, and a 1.8% chance of being worse\nthan `'3_poly'`.\n`'rbf'` and `'linear'` have a 43% probability of being practically\nequivalent, while `'rbf'` and `'3_poly'` have a 10% chance of being so.\n\nSimilarly to the conclusions obtained using the frequentist approach, all\nmodels have a 100% probability of being better than `'2_poly'`, and none have\na practically equivalent performance with the latter.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Take-home messages\n- Small differences in performance measures might easily turn out to be\n  merely by chance, but not because one model predicts systematically better\n  than the other. As shown in this example, statistics can tell you how\n  likely that is.\n- When statistically comparing the performance of two models evaluated in\n  GridSearchCV, it is necessary to correct the calculated variance which\n  could be underestimated since the scores of the models are not independent\n  from each other.\n- A frequentist approach that uses a (variance-corrected) paired t-test can\n  tell us if the performance of one model is better than another with a\n  degree of certainty above chance.\n- A Bayesian approach can provide the probabilities of one model being\n  better, worse or practically equivalent than another. It can also tell us\n  how confident we are of knowing that the true differences of our models\n  fall under a certain range of values.\n- If multiple models are statistically compared, a multiple comparisons\n  correction is needed when using the frequentist approach.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. topic:: References\n\n   .. [1] Dietterich, T. G. (1998). `Approximate statistical tests for\n          comparing supervised classification learning algorithms\n          <http://web.cs.iastate.edu/~jtian/cs573/Papers/Dietterich-98.pdf>`_.\n          Neural computation, 10(7).\n   .. [2] Nadeau, C., & Bengio, Y. (2000). `Inference for the generalization\n          error\n          <https://papers.nips.cc/paper/1661-inference-for-the-generalization-error.pdf>`_.\n          In Advances in neural information processing systems.\n   .. [3] Bouckaert, R. R., & Frank, E. (2004). `Evaluating the replicability\n          of significance tests for comparing learning algorithms\n          <https://www.cms.waikato.ac.nz/~ml/publications/2004/bouckaert-frank.pdf>`_.\n          In Pacific-Asia Conference on Knowledge Discovery and Data Mining.\n   .. [4] Benavoli, A., Corani, G., Dem\u0161ar, J., & Zaffalon, M. (2017). `Time\n          for a change: a tutorial for comparing multiple classifiers through\n          Bayesian analysis\n          <http://www.jmlr.org/papers/volume18/16-305/16-305.pdf>`_.\n          The Journal of Machine Learning Research, 18(1). See the Python\n          library that accompanies this paper `here\n          <https://github.com/janezd/baycomp>`_.\n   .. [5] Diebold, F.X. & Mariano R.S. (1995). `Comparing predictive accuracy\n          <http://www.est.uc3m.es/esp/nueva_docencia/comp_col_get/lade/tecnicas_prediccion/Practicas0708/Comparing%20Predictive%20Accuracy%20(Dielbold).pdf>`_\n          Journal of Business & economic statistics, 20(1), 134-144.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}