Version 1.5

Legend for changelogs

  • Major Feature something big that you couldn’t do before.

  • Feature something that you couldn’t do before.

  • Efficiency an existing feature now may not require as much computation or memory.

  • Enhancement a miscellaneous minor improvement.

  • Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.

  • API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.

Version 1.5.0

In Development

Security

  • Fix feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer no longer store discarded tokens from the training set in their stop_words_ attribute. This attribute would hold too frequent (above max_df) but also too rare tokens (below min_df). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.

    Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the stop_words_ attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.

Changed models

  • Efficiency The subsampling in preprocessing.QuantileTransformer is now more efficient for dense arrays but the fitted quantiles and the results of transform may be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.

  • Enhancement decomposition.PCA, decomposition.SparsePCA and decomposition.TruncatedSVD now set the sign of the components_ attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across all PCA solvers, including the new svd_solver="covariance_eigh" option introduced in this release.

Support for Array API

Additional estimators and functions have been updated to include support for all Array API compliant inputs.

See Array API support (experimental) for more details.

Functions:

Classes:

Support for building with Meson

Meson is now supported as a build backend, see Building from source for more details.

#28040 by Loïc Estève

TODO Fill more details before the 1.5 release, when the Meson story has settled down.

Metadata Routing

The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.

Changelog

sklearn.calibration

sklearn.cluster

sklearn.compose

sklearn.cross_decomposition

sklearn.datasets

sklearn.decomposition

  • Efficiency decomposition.PCA with svd_solver="full" now assigns a contiguous components_ attribute instead of an non-contiguous slice of the singular vectors. When n_components << n_features, this can save some memory and, more importantly, help speed-up subsequent calls to the transform method by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.

  • Enhancement PCA now automatically selects the ARPACK solver for sparse inputs when svd_solver="auto" instead of raising an error. #28498 by Thanh Lam Dang.

  • Enhancement decomposition.PCA now supports a new solver option named svd_solver="covariance_eigh" which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say, n_samples >> 1000 > n_features). The svd_solver="auto" option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.

  • Fix decomposition.PCA fit with svd_solver="arpack", whiten=True and a value for n_components that is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.

sklearn.dummy

sklearn.ensemble

sklearn.feature_extraction

sklearn.feature_selection

sklearn.impute

sklearn.inspection

sklearn.linear_model

sklearn.manifold

sklearn.metrics

sklearn.mixture

sklearn.model_selection

sklearn.multioutput

sklearn.neighbors

sklearn.pipeline

  • Feature pipeline.FeatureUnion can now use the verbose_feature_names_out attribute. If True, get_feature_names_out will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out will not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.

sklearn.preprocessing

sklearn.tree

  • Enhancement Plotting trees in matplotlib via tree.plot_tree now show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.

sklearn.utils

Code and documentation contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:

TODO: update at the time of the release.