MiniBatchKMeans#

class sklearn.cluster.MiniBatchKMeans(n_clusters=8, *, init='k-means++', max_iter=100, batch_size=1024, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init='auto', reassignment_ratio=0.01)[source]#

Mini-Batch K-Means clustering.

See also

KMeans: The classic implementation of the clustering method based on the Lloyd’s algorithm. It consumes the whole set of input data at each iteration.

Notes

See https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

When there are too few points in the dataset, some centers may be duplicated, which means that a proper clustering in terms of the number of requesting clusters and the number of returned clusters will not always match. One solution is to set reassignment_ratio=0, which prevents reassignments of clusters that are too small.

See Compare BIRCH and MiniBatchKMeans for a comparison with BIRCH.

Examples

>>> from sklearn.cluster import MiniBatchKMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 0], [4, 4],
...               [4, 5], [0, 1], [2, 2],
...               [3, 2], [5, 5], [1, -1]])
>>> # manually fit on batches
>>> kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          n_init="auto")
>>> kmeans = kmeans.partial_fit(X[0:6,:])
>>> kmeans = kmeans.partial_fit(X[6:12,:])
>>> kmeans.cluster_centers_
array([[3.375, 3.  ],
       [0.75 , 0.5 ]])
>>> kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)
>>> # fit on the whole data
>>> kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          max_iter=10,
...                          n_init="auto").fit(X)
>>> kmeans.cluster_centers_
array([[3.55102041, 2.48979592],
       [1.06896552, 1.        ]])
>>> kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)

For a comparison of Mini-Batch K-Means clustering with other clustering algorithms, see Comparing different clustering algorithms on toy datasets

fit(X, y=None, sample_weight=None)[source]#

Compute the centroids on X by chunking it into mini-batches.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. sample_weight is not used during initialization if init is a callable or a user provided array.

Added in version 0.20.

Returns:

selfobject: Fitted estimator.

fit_predict(X, y=None, sample_weight=None)[source]#

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

labelsndarray of shape (n_samples,): Index of the cluster each sample belongs to.

fit_transform(X, y=None, sample_weight=None)[source]#

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

X_newndarray of shape (n_samples, n_clusters): X transformed in the new space.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: ["class_name0", "class_name1", "class_name2"].

Parameters:

input_featuresarray-like of str or None, default=None: Only used to validate feature names with the names seen in fit.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

partial_fit(X, y=None, sample_weight=None)[source]#

Update k means estimate on a single mini-batch X.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. sample_weight is not used during initialization if init is a callable or a user provided array.

Returns:

selfobject: Return updated estimator.

predict(X)[source]#

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to predict.

Returns:

labelsndarray of shape (n_samples,): Index of the cluster each sample belongs to.

score(X, y=None, sample_weight=None)[source]#

Opposite of the value of X on the K-means objective.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

scorefloat: Opposite of the value of X on the K-means objective.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans[source]#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans[source]#

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to partial_fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in partial_fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans[source]#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

transform(X)[source]#

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.

Returns:

X_newndarray of shape (n_samples, n_clusters): X transformed in the new space.

Gallery examples#

Biclustering documents with the Spectral Co-clustering algorithm

Compare BIRCH and MiniBatchKMeans

Comparing different clustering algorithms on toy datasets

Online learning of a dictionary of parts of faces

Empirical evaluation of the impact of k-means initialization

Comparison of the K-Means and MiniBatchKMeans clustering algorithms

Faces dataset decompositions

Clustering text documents using k-means

MiniBatchKMeans#

Gallery examples#

This Page