Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Gaussian Mixture Model Selection#

This example shows that model selection can be performed with Gaussian Mixture Models (GMM) using information-theory criteria. Model selection concerns both the covariance type and the number of components in the model.

In this case, both the Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) provide the right result, but we only demo the latter as BIC is better suited to identify the true model among a set of candidates. Unlike Bayesian procedures, such inferences are prior-free.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Data generation#

We generate two components (each one containing n_samples) by randomly sampling the standard normal distribution as returned by numpy.random.randn. One component is kept spherical yet shifted and re-scaled. The other one is deformed to have a more general covariance matrix.

import numpy as np

n_samples = 500
np.random.seed(0)
C = np.array([[0.0, -0.1], [1.7, 0.4]])
component_1 = np.dot(np.random.randn(n_samples, 2), C)  # general
component_2 = 0.7 * np.random.randn(n_samples, 2) + np.array([-4, 1])  # spherical

X = np.concatenate([component_1, component_2])

We can visualize the different components:

import matplotlib.pyplot as plt

plt.scatter(component_1[:, 0], component_1[:, 1], s=0.8)
plt.scatter(component_2[:, 0], component_2[:, 1], s=0.8)
plt.title("Gaussian Mixture components")
plt.axis("equal")
plt.show()

Model training and selection#

We vary the number of components from 1 to 6 and the type of covariance parameters to use:

"full": each component has its own general covariance matrix.
"tied": all components share the same general covariance matrix.
"diag": each component has its own diagonal covariance matrix.
"spherical": each component has its own single variance.

We score the different models and keep the best model (the lowest BIC). This is done by using GridSearchCV and a user-defined score function which returns the negative BIC score, as GridSearchCV is designed to maximize a score (maximizing the negative BIC is equivalent to minimizing the BIC).

The best set of parameters and estimator are stored in best_parameters_ and best_estimator_, respectively.

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV


def gmm_bic_score(estimator, X):
    """Callable to pass to GridSearchCV that will use the BIC score."""
    # Make it negative since GridSearchCV expects a score to maximize
    return -estimator.bic(X)


param_grid = {
    "n_components": range(1, 7),
    "covariance_type": ["spherical", "tied", "diag", "full"],
}
grid_search = GridSearchCV(
    GaussianMixture(), param_grid=param_grid, scoring=gmm_bic_score
)
grid_search.fit(X)

GridSearchCV(estimator=GaussianMixture(),
             param_grid={'covariance_type': ['spherical', 'tied', 'diag',
                                             'full'],
                         'n_components': range(1, 7)},
             scoring=<function gmm_bic_score at 0x784129f52610>)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

?Documentation for GridSearchCViFitted

Parameters

	estimator estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	GaussianMixture()
	param_grid param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.	{'covariance_type': ['spherical', 'tied', ...], 'n_components': range(1, 7)}
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion <scoring_api_overview>` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.	<function gmm...x784129f52610>
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	None
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example <sphx_glr_auto_examples_model_selection_plot_grid_search_refit_callable.py>` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - an iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	None
	verbose verbose: int, default=0 Controls the verbosity of information printed during fitting, with higher values yielding more detailed logging. - 0 : no messages are printed; - >=1 : summary of the total number of fits; - >=2 : computation time for each fold and parameter candidate; - >=3 : fold indices and scores; - >=10 : parameter candidate indices and START messages before each fit.	0
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	nan
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

Fitted attributes

Name	Type	Value
best_estimator_ best_estimator_: estimator Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if ``refit=False``. See ``refit`` parameter for more information on allowed values.	GaussianMixture	GaussianMixtu..._components=2)
best_index_ best_index_: int The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting. The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``). For multi-metric evaluation, this is present only if ``refit`` is specified.	int64	np.int64(19)
best_params_ best_params_: dict Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if ``refit`` is specified.	dict	{'co...pe': 'full', 'n_...ts': 2}
best_score_ best_score_: float Mean cross-validated score of the best_estimator For multi-metric evaluation, this is present only if ``refit`` is specified. This attribute is not available if ``refit`` is a function.	float64	-1047
cv_results_ cv_results_: dict of numpy (masked) ndarrays A dict with keys as column headers and values as columns, that can be imported into a pandas ``DataFrame``. For instance the below given table +------------+-----------+------------+-----------------+---+---------+ \|param_kernel\|param_gamma\|param_degree\|split0_test_score\|...\|rank_t...\| +============+===========+============+=================+===+=========+ \| 'poly' \| -- \| 2 \| 0.80 \|...\| 2 \| +------------+-----------+------------+-----------------+---+---------+ \| 'poly' \| -- \| 3 \| 0.70 \|...\| 4 \| +------------+-----------+------------+-----------------+---+---------+ \| 'rbf' \| 0.1 \| -- \| 0.80 \|...\| 3 \| +------------+-----------+------------+-----------------+---+---------+ \| 'rbf' \| 0.2 \| -- \| 0.93 \|...\| 1 \| +------------+-----------+------------+-----------------+---+---------+ will be represented by a ``cv_results_`` dict of:: { 'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'], mask = [False False False False]...) 'param_gamma': masked_array(data = [-- -- 0.1 0.2], mask = [ True True False False]...), 'param_degree': masked_array(data = [2.0 3.0 -- --], mask = [False False True True]...), 'split0_test_score' : [0.80, 0.70, 0.80, 0.93], 'split1_test_score' : [0.82, 0.50, 0.70, 0.78], 'mean_test_score' : [0.81, 0.60, 0.75, 0.85], 'std_test_score' : [0.01, 0.10, 0.05, 0.08], 'rank_test_score' : [2, 4, 3, 1], 'split0_train_score' : [0.80, 0.92, 0.70, 0.93], 'split1_train_score' : [0.82, 0.55, 0.70, 0.87], 'mean_train_score' : [0.81, 0.74, 0.70, 0.90], 'std_train_score' : [0.01, 0.19, 0.00, 0.03], 'mean_fit_time' : [0.73, 0.63, 0.43, 0.49], 'std_fit_time' : [0.01, 0.02, 0.01, 0.01], 'mean_score_time' : [0.01, 0.06, 0.04, 0.04], 'std_score_time' : [0.00, 0.00, 0.00, 0.01], 'params' : [{'kernel': 'poly', 'degree': 2}, ...], } For an example of visualization and interpretation of GridSearch results, see :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_stats.py`. NOTE The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates. The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and ``std_score_time`` are all in seconds. For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown above. ('split0_test_precision', 'mean_train_precision' etc.)	dict	{'me...me': array([0.0022..., 0.01910238]), 'me...me': array([0.0003..., 0.00057511]), 'me...re': array([-1728....179.97789043]), 'pa...pe': masked_array(... dtype=object), ...}
multimetric_ multimetric_: bool Whether or not the scorers compute several metrics.	bool	False
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if `best_estimator_` is defined (see the documentation for the `refit` parameter for more details) and that `best_estimator_` exposes `n_features_in_` when fit. .. versionadded:: 0.24	int	2
n_splits_ n_splits_: int The number of cross-validation splits (folds/iterations).	int	5
refit_time_ refit_time_: float Seconds used for refitting the best model on the whole dataset. This is present only if ``refit`` is not False. .. versionadded:: 0.20	float	0.006094
scorer_ scorer_: function or a dict Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable.	function	<function gmm...x784129f52610>

best_estimator_: GaussianMixture

GaussianMixture(n_components=2)

GaussianMixture

?Documentation for GaussianMixture

Parameters

	n_components n_components: int, default=1 The number of mixture components.	2
	covariance_type covariance_type: {'full', 'tied', 'diag', 'spherical'}, default='full' String describing the type of covariance parameters to use. Must be one of: - 'full': each component has its own general covariance matrix. - 'tied': all components share the same general covariance matrix. - 'diag': each component has its own diagonal covariance matrix. - 'spherical': each component has its own single variance. For an example of using `covariance_type`, refer to :ref:`sphx_glr_auto_examples_mixture_plot_gmm_selection.py`.	'full'
	tol tol: float, default=1e-3 The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.	0.001
	reg_covar reg_covar: float, default=1e-6 Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.	1e-06
	max_iter max_iter: int, default=100 The number of EM iterations to perform.	100
	n_init n_init: int, default=1 The number of initializations to perform. The best results are kept.	1
	init_params init_params: {'kmeans', 'k-means++', 'random', 'random_from_data'}, default='kmeans' The method used to initialize the weights, the means and the precisions. String must be one of: - 'kmeans' : responsibilities are initialized using kmeans. - 'k-means++' : use the k-means++ method to initialize. - 'random' : responsibilities are initialized randomly. - 'random_from_data' : initial means are randomly selected data points. .. versionchanged:: v1.1 `init_params` now accepts 'random_from_data' and 'k-means++' as initialization methods.	'kmeans'
	weights_init weights_init: array-like of shape (n_components, ), default=None The user-provided initial weights. If it is None, weights are initialized using the `init_params` method.	None
	means_init means_init: array-like of shape (n_components, n_features), default=None The user-provided initial means, If it is None, means are initialized using the `init_params` method.	None
	precisions_init precisions_init: array-like, default=None The user-provided initial precisions (inverse of the covariance matrices). If it is None, precisions are initialized using the 'init_params' method. The shape depends on 'covariance_type':: (n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'	None
	random_state random_state: int, RandomState instance or None, default=None Controls the random seed given to the method chosen to initialize the parameters (see `init_params`). In addition, it controls the generation of random samples from the fitted distribution (see the method `sample`). Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	None
	warm_start warm_start: bool, default=False If 'warm_start' is True, the solution of the last fitting is used as initialization for the next call of fit(). This can speed up convergence when fit is called several times on similar problems. In that case, 'n_init' is ignored and only a single initialization occurs upon the first call. See :term:`the Glossary <warm_start>`.	False
	verbose verbose: int, default=0 Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.	0
	verbose_interval verbose_interval: int, default=10 Number of iteration done before the next print.	10

Fitted attributes

Name	Type	Value
converged_ converged_: bool True when convergence of the best fit of EM was reached, False otherwise.	bool	True
covariances_ covariances_: array-like The covariance of each mixture component. The shape depends on `covariance_type`:: (n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full' For an example of using covariances, refer to :ref:`sphx_glr_auto_examples_mixture_plot_gmm_covariances.py`.	ndarray[float64](2, 2, 2)	[[[ 2.89, 0.68], [ 0.68, 0.17]], [[ 0.47,-0.01], [-0.01, 0.46]]]
lower_bound_ lower_bound_: float Lower bound value on the log-likelihood (of the training data with respect to the model) of the best fit of EM.	float64	-2.233
lower_bounds_ lower_bounds_: array-like of shape (`n_iter_`,) The list of lower bound values on the log-likelihood from each iteration of the best fit of EM.	list	[np.float64(-2...0495624935283), np.float64(-2...4448492771387), np.float64(-2...5645944775564), np.float64(-2...8322149565912), ...]
means_ means_: array-like of shape (n_components, n_features) The mean of each mixture component.	ndarray[float64](2, 2)	[[-0.04,-0. ], [-3.99, 1. ]]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	2
n_iter_ n_iter_: int Number of step used by the best fit of EM to reach the convergence.	int	5
precisions_ precisions_: array-like The precision matrices for each component in the mixture. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on `covariance_type`:: (n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'	ndarray[float64](2, 2, 2)	[[[ 6.18,-24.84], [-24.84,105.76]], [[ 2.15, 0.04], [ 0.04, 2.19]]]
precisions_cholesky_ precisions_cholesky_: array-like The Cholesky decomposition of the precision matrices of each mixture component. A precision matrix is the inverse of a covariance matrix. A covariance matrix is symmetric positive definite so the mixture of Gaussian can be equivalently parameterized by the precision matrices. Storing the precision matrices instead of the covariance matrices makes it more efficient to compute the log-likelihood of new samples at test time. The shape depends on `covariance_type`:: (n_components,) if 'spherical', (n_features, n_features) if 'tied', (n_components, n_features) if 'diag', (n_components, n_features, n_features) if 'full'	ndarray[float64](2, 2, 2)	[[[ 0.59,-2.42], [ 0. ,10.28]], [[ 1.46, 0.02], [ 0. , 1.48]]]
weights_ weights_: array-like of shape (n_components,) The weights of each mixture components.	ndarray[float64](2,)	[0.5,0.5]

Plot the BIC scores#

To ease the plotting we can create a pandas.DataFrame from the results of the cross-validation done by the grid search. We re-inverse the sign of the BIC score to show the effect of minimizing it.

import pandas as pd

df = pd.DataFrame(grid_search.cv_results_)[
    ["param_n_components", "param_covariance_type", "mean_test_score"]
]
df["mean_test_score"] = -df["mean_test_score"]
df = df.rename(
    columns={
        "param_n_components": "Number of components",
        "param_covariance_type": "Type of covariance",
        "mean_test_score": "BIC score",
    }
)
df.sort_values(by="BIC score").head()

	Number of components	Type of covariance	BIC score
19	2	full	1046.829429
20	3	full	1084.038689
21	4	full	1114.517272
22	5	full	1148.512281
23	6	full	1179.977890

import seaborn as sns

sns.catplot(
    data=df,
    kind="bar",
    x="Number of components",
    y="BIC score",
    hue="Type of covariance",
)
plt.show()

In the present case, the model with 2 components and full covariance (which corresponds to the true generative model) has the lowest BIC score and is therefore selected by the grid search.

Plot the best model#

We plot an ellipse to show each Gaussian component of the selected model. For such purpose, one needs to find the eigenvalues of the covariance matrices as returned by the covariances_ attribute. The shape of such matrices depends on the covariance_type:

"full": (n_components, n_features, n_features)
"tied": (n_features, n_features)
"diag": (n_components, n_features)
"spherical": (n_components,)

from matplotlib.patches import Ellipse
from scipy import linalg

color_iter = sns.color_palette("tab10", 2)[::-1]
Y_ = grid_search.predict(X)

fig, ax = plt.subplots()

for i, (mean, cov, color) in enumerate(
    zip(
        grid_search.best_estimator_.means_,
        grid_search.best_estimator_.covariances_,
        color_iter,
    )
):
    v, w = linalg.eigh(cov)
    if not np.any(Y_ == i):
        continue
    plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], 0.8, color=color)

    angle = np.arctan2(w[0][1], w[0][0])
    angle = 180.0 * angle / np.pi  # convert to degrees
    v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
    ellipse = Ellipse(mean, v[0], v[1], angle=180.0 + angle, color=color)
    ellipse.set_clip_box(fig.bbox)
    ellipse.set_alpha(0.5)
    ax.add_artist(ellipse)

plt.title(
    f"Selected GMM: {grid_search.best_params_['covariance_type']} model, "
    f"{grid_search.best_params_['n_components']} components"
)
plt.axis("equal")
plt.show()