.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/covariance/plot_robust_vs_empirical_covariance.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_covariance_plot_robust_vs_empirical_covariance.py>`
        to download the full example code or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py:


=======================================
Robust vs Empirical covariance estimate
=======================================

The usual covariance maximum likelihood estimate is very sensitive to the
presence of outliers in the data set. In such a case, it would be better to
use a robust estimator of covariance to guarantee that the estimation is
resistant to "erroneous" observations in the data set. [1]_, [2]_

Minimum Covariance Determinant Estimator
----------------------------------------
The Minimum Covariance Determinant estimator is a robust, high-breakdown point
(i.e. it can be used to estimate the covariance matrix of highly contaminated
datasets, up to
:math:`\frac{n_\text{samples} - n_\text{features}-1}{2}` outliers) estimator of
covariance. The idea is to find
:math:`\frac{n_\text{samples} + n_\text{features}+1}{2}`
observations whose empirical covariance has the smallest determinant, yielding
a "pure" subset of observations from which to compute standards estimates of
location and covariance. After a correction step aiming at compensating the
fact that the estimates were learned from only a portion of the initial data,
we end up with robust estimates of the data set location and covariance.

The Minimum Covariance Determinant estimator (MCD) has been introduced by
P.J.Rousseuw in [3]_.

Evaluation
----------
In this example, we compare the estimation errors that are made when using
various types of location and covariance estimates on contaminated Gaussian
distributed data sets:

- The mean and the empirical covariance of the full dataset, which break
  down as soon as there are outliers in the data set
- The robust MCD, that has a low error provided
  :math:`n_\text{samples} > 5n_\text{features}`
- The mean and the empirical covariance of the observations that are known
  to be good ones. This can be considered as a "perfect" MCD estimation,
  so one can trust our implementation by comparing to this case.


References
----------
.. [1] Johanna Hardin, David M Rocke. The distribution of robust distances.
    Journal of Computational and Graphical Statistics. December 1, 2005,
    14(4): 928-946.
.. [2] Zoubir A., Koivunen V., Chakhchoukh Y. and Muma M. (2012). Robust
    estimation in signal processing: A tutorial-style treatment of
    fundamental concepts. IEEE Signal Processing Magazine 29(4), 61-80.
.. [3] P. J. Rousseeuw. Least median of squares regression. Journal of American
    Statistical Ass., 79:871, 1984.

.. GENERATED FROM PYTHON SOURCE LINES 55-186



.. image-sg:: /auto_examples/covariance/images/sphx_glr_plot_robust_vs_empirical_covariance_001.png
   :alt: Influence of outliers on the location estimation, Influence of outliers on the covariance estimation
   :srcset: /auto_examples/covariance/images/sphx_glr_plot_robust_vs_empirical_covariance_001.png
   :class: sphx-glr-single-img





.. code-block:: Python


    import matplotlib.font_manager
    import matplotlib.pyplot as plt
    import numpy as np

    from sklearn.covariance import EmpiricalCovariance, MinCovDet

    # example settings
    n_samples = 80
    n_features = 5
    repeat = 10

    range_n_outliers = np.concatenate(
        (
            np.linspace(0, n_samples / 8, 5),
            np.linspace(n_samples / 8, n_samples / 2, 5)[1:-1],
        )
    ).astype(int)

    # definition of arrays to store results
    err_loc_mcd = np.zeros((range_n_outliers.size, repeat))
    err_cov_mcd = np.zeros((range_n_outliers.size, repeat))
    err_loc_emp_full = np.zeros((range_n_outliers.size, repeat))
    err_cov_emp_full = np.zeros((range_n_outliers.size, repeat))
    err_loc_emp_pure = np.zeros((range_n_outliers.size, repeat))
    err_cov_emp_pure = np.zeros((range_n_outliers.size, repeat))

    # computation
    for i, n_outliers in enumerate(range_n_outliers):
        for j in range(repeat):
            rng = np.random.RandomState(i * j)

            # generate data
            X = rng.randn(n_samples, n_features)
            # add some outliers
            outliers_index = rng.permutation(n_samples)[:n_outliers]
            outliers_offset = 10.0 * (
                np.random.randint(2, size=(n_outliers, n_features)) - 0.5
            )
            X[outliers_index] += outliers_offset
            inliers_mask = np.ones(n_samples).astype(bool)
            inliers_mask[outliers_index] = False

            # fit a Minimum Covariance Determinant (MCD) robust estimator to data
            mcd = MinCovDet().fit(X)
            # compare raw robust estimates with the true location and covariance
            err_loc_mcd[i, j] = np.sum(mcd.location_**2)
            err_cov_mcd[i, j] = mcd.error_norm(np.eye(n_features))

            # compare estimators learned from the full data set with true
            # parameters
            err_loc_emp_full[i, j] = np.sum(X.mean(0) ** 2)
            err_cov_emp_full[i, j] = (
                EmpiricalCovariance().fit(X).error_norm(np.eye(n_features))
            )

            # compare with an empirical covariance learned from a pure data set
            # (i.e. "perfect" mcd)
            pure_X = X[inliers_mask]
            pure_location = pure_X.mean(0)
            pure_emp_cov = EmpiricalCovariance().fit(pure_X)
            err_loc_emp_pure[i, j] = np.sum(pure_location**2)
            err_cov_emp_pure[i, j] = pure_emp_cov.error_norm(np.eye(n_features))

    # Display results
    font_prop = matplotlib.font_manager.FontProperties(size=11)
    plt.subplot(2, 1, 1)
    lw = 2
    plt.errorbar(
        range_n_outliers,
        err_loc_mcd.mean(1),
        yerr=err_loc_mcd.std(1) / np.sqrt(repeat),
        label="Robust location",
        lw=lw,
        color="m",
    )
    plt.errorbar(
        range_n_outliers,
        err_loc_emp_full.mean(1),
        yerr=err_loc_emp_full.std(1) / np.sqrt(repeat),
        label="Full data set mean",
        lw=lw,
        color="green",
    )
    plt.errorbar(
        range_n_outliers,
        err_loc_emp_pure.mean(1),
        yerr=err_loc_emp_pure.std(1) / np.sqrt(repeat),
        label="Pure data set mean",
        lw=lw,
        color="black",
    )
    plt.title("Influence of outliers on the location estimation")
    plt.ylabel(r"Error ($||\mu - \hat{\mu}||_2^2$)")
    plt.legend(loc="upper left", prop=font_prop)

    plt.subplot(2, 1, 2)
    x_size = range_n_outliers.size
    plt.errorbar(
        range_n_outliers,
        err_cov_mcd.mean(1),
        yerr=err_cov_mcd.std(1),
        label="Robust covariance (mcd)",
        color="m",
    )
    plt.errorbar(
        range_n_outliers[: (x_size // 5 + 1)],
        err_cov_emp_full.mean(1)[: (x_size // 5 + 1)],
        yerr=err_cov_emp_full.std(1)[: (x_size // 5 + 1)],
        label="Full data set empirical covariance",
        color="green",
    )
    plt.plot(
        range_n_outliers[(x_size // 5) : (x_size // 2 - 1)],
        err_cov_emp_full.mean(1)[(x_size // 5) : (x_size // 2 - 1)],
        color="green",
        ls="--",
    )
    plt.errorbar(
        range_n_outliers,
        err_cov_emp_pure.mean(1),
        yerr=err_cov_emp_pure.std(1),
        label="Pure data set empirical covariance",
        color="black",
    )
    plt.title("Influence of outliers on the covariance estimation")
    plt.xlabel("Amount of contamination (%)")
    plt.ylabel("RMSE")
    plt.legend(loc="upper center", prop=font_prop)

    plt.show()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 2.952 seconds)


.. _sphx_glr_download_auto_examples_covariance_plot_robust_vs_empirical_covariance.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.4.X?urlpath=lab/tree/notebooks/auto_examples/covariance/plot_robust_vs_empirical_covariance.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/?path=auto_examples/covariance/plot_robust_vs_empirical_covariance.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_robust_vs_empirical_covariance.ipynb <plot_robust_vs_empirical_covariance.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_robust_vs_empirical_covariance.py <plot_robust_vs_empirical_covariance.py>`


.. include:: plot_robust_vs_empirical_covariance.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_