LatentDirichletAllocation#

class sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)[source]#

Latent Dirichlet Allocation with online variational Bayes algorithm.

The implementation is based on [1] and [2].

Added in version 0.17.

See also

sklearn.discriminant_analysis.LinearDiscriminantAnalysis: A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

References

[1] (1,2,3)

“Online Learning for Latent Dirichlet Allocation”, Matthew D. Hoffman, David M. Blei, Francis Bach, 2010. blei-lab/onlineldavb

[2]

“Stochastic Variational Inference”, Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley, 2013. https://jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf

Examples

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

fit(X, y=None)[source]#

Learn model for the data X with variational Bayes method.

When learning_method is ‘online’, use mini-batch update. Otherwise, use batch update.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document word matrix.
yIgnored: Not used, present here for API consistency by convention.

Returns:

self: Fitted estimator.

fit_transform(X, y=None, *, normalize=True)[source]#

Fit to data, then transform it.

Fits transformer to X and y and returns a transformed version of X.

Parameters:

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
normalizebool, default=True: Whether to normalize the document topic distribution in transform.

Returns:

X_newndarray array of shape (n_samples, n_components): Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: ["class_name0", "class_name1", "class_name2"].

Parameters:

input_featuresarray-like of str or None, default=None: Only used to validate feature names with the names seen in fit.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

partial_fit(X, y=None)[source]#

Online VB with Mini-Batch update.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document word matrix.
yIgnored: Not used, present here for API consistency by convention.

Returns:

self: Partially fitted estimator.

perplexity(X, sub_sampling=False)[source]#

Calculate approximate perplexity for data X.

Perplexity is defined as exp(-1. * log-likelihood per word)

Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no longer has access to unnormalized distribution

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document word matrix.
sub_samplingbool: Do sub-sampling or not.

Returns:

scorefloat: Perplexity score.

score(X, y=None)[source]#

Calculate approximate log-likelihood as score.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document word matrix.
yIgnored: Not used, present here for API consistency by convention.

Returns:

scorefloat: Use approximate bound as score.

set_output(*, transform=None)[source]#

Set output container.

Refer to the user guide for more details and Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_transform_request(*, normalize: bool | None | str = '$UNCHANGED$') → LatentDirichletAllocation[source]#

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

normalizestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for normalize parameter in transform.

Returns:

selfobject: The updated object.

transform(X, *, normalize=True)[source]#

Transform data X according to the fitted model.

Changed in version 0.18: doc_topic_distr is now normalized.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document word matrix.
normalizebool, default=True: Whether to normalize the document topic distribution.

Returns:

doc_topic_distrndarray of shape (n_samples, n_components): Document topic distribution for X.

Gallery examples#

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation