Version 1.5¶

Legend for changelogs

Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.

Version 1.5.0¶

In Development

Security¶

Fix feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer no longer store discarded tokens from the training set in their stop_words_ attribute. This attribute would hold too frequent (above max_df) but also too rare tokens (below min_df). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.

Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the stop_words_ attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.

Changed models¶

Efficiency The subsampling in preprocessing.QuantileTransformer is now more efficient for dense arrays but the fitted quantiles and the results of transform may be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.
Enhancement decomposition.PCA, decomposition.SparsePCA and decomposition.TruncatedSVD now set the sign of the components_ attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across all PCA solvers, including the new svd_solver="covariance_eigh" option introduced in this release.

Support for Array API¶

Additional estimators and functions have been updated to include support for all Array API compliant inputs.

See Array API support (experimental) for more details.

Functions:

sklearn.metrics.r2_score now supports Array API compliant inputs. #27904 by Eric Lindgren, Franck Charras <fcharras>, Olivier Grisel <ogrisel> and Tim Head <betatim>.

Classes:

linear_model.Ridge now supports the Array API for the svd solver. See Array API support (experimental) for more details. #27800 by Franck Charras, Olivier Grisel and Tim Head.

Support for building with Meson¶

Meson is now supported as a build backend, see Building from source for more details.

#28040 by Loïc Estève

TODO Fill more details before the 1.5 release, when the Meson story has settled down.

Metadata Routing¶

The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.

Feature impute.IterativeImputer now supports metadata routing in its fit method. #28187 by Stefanie Senger.
Feature ensemble.BaggingClassifier and ensemble.BaggingRegressor now support metadata routing. The fit methods now accept **fit_params which are passed to the underlying estimators via their fit methods. #28432 by Adam Li and Benjamin Bossan.
Feature linear_model.RidgeCV and linear_model.RidgeClassifierCV now support metadata routing in their fit method and route metadata to the underlying model_selection.GridSearchCV object or the underlying scorer. #27560 by Omar Salman.
Feature GraphicalLassoCV now supports metadata routing in it’s fit method and routes metadata to the CV splitter. #27566 by Omar Salman.
Feature linear_model.RANSACRegressor now supports metadata routing in its fit, score and predict methods and route metadata to its underlying estimator’s’ fit, score and predict methods. #28261 by Stefanie Senger.
Feature ensemble.VotingClassifier and ensemble.VotingRegressor now support metadata routing and pass **fit_params to the underlying estimators via their fit methods. #27584 by Stefanie Senger.
Feature pipeline.FeatureUnion now supports metadata routing in its fit and fit_transform methods and route metadata to the underlying transformers’ fit and fit_transform. #28205 by Stefanie Senger.
Fix Fix an issue when resolving default routing requests set via class attributes. #28435 by Adrin Jalali.
Fix Fix an issue when set_{method}_request methods are used as unbound methods, which can happen if one tries to decorate them. #28651 by Adrin Jalali.
Fix Prevent a RecursionError when estimators with the default scoring param (None) route metadata. #28712 by Stefanie Senger.

Changelog¶

`sklearn.calibration`¶

Fix Fixed a regression in calibration.CalibratedClassifierCV where an error was wrongly raised with string targets. #28843 by Jérémie du Boisberranger.

`sklearn.cluster`¶

Fix Create copy of precomputed sparse matrix within the fit method of OPTICS to avoid in-place modification of the sparse matrix. #28491 by Thanh Lam Dang.
Fix cluster.HDBSCAN now supports all metrics supported by sklearn.metrics.pairwise_distances when algorithm="brute" or "auto". #28664 by Manideep Yenugula.

`sklearn.compose`¶

Feature A fitted compose.ColumnTransformer now implements __getitem__ which returns the fitted transformers by name. #27990 by Thomas Fan.
Enhancement compose.TransformedTargetRegressor now raises an error in fit if only inverse_func is provided without func (that would default to identity) being explicitly set as well. #28483 by Stefanie Senger.
Enhancement compose.ColumnTransformer can now expose the “remainder” columns in the fitted transformers_ attribute as column names or boolean masks, rather than column indices. #27657 by Jérôme Dockès.
Fix Fixed an bug in compose.ColumnTransformer with n_jobs > 1, where the intermediate selected columns were passed to the transformers as read-only arrays. #28822 by Jérémie du Boisberranger.

`sklearn.cross_decomposition`¶

API Change Deprecates Y in favor of y in the methods fit, transform and inverse_transform of: cross_decomposition.PLSRegression. cross_decomposition.PLSCanonical, cross_decomposition.CCA, and cross_decomposition.PLSSVD. Y will be removed in version 1.7. #28604 by David Leon

`sklearn.datasets`¶

Enhancement Adds optional arguments n_retries and delay to functions datasets.fetch_20newsgroups, datasets.fetch_20newsgroups_vectorized, datasets.fetch_california_housing, datasets.fetch_covtype, datasets.fetch_kddcup99, datasets.fetch_lfw_pairs, datasets.fetch_lfw_people, datasets.fetch_olivetti_faces, datasets.fetch_rcv1, and datasets.fetch_species_distributions. By default, the functions will retry up to 3 times in case of network failures. #28160 by Zhehao Liu and Filip Karlo Došilović.

`sklearn.decomposition`¶

Efficiency decomposition.PCA with svd_solver="full" now assigns a contiguous components_ attribute instead of an non-contiguous slice of the singular vectors. When n_components << n_features, this can save some memory and, more importantly, help speed-up subsequent calls to the transform method by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.
Enhancement PCA now automatically selects the ARPACK solver for sparse inputs when svd_solver="auto" instead of raising an error. #28498 by Thanh Lam Dang.
Enhancement decomposition.PCA now supports a new solver option named svd_solver="covariance_eigh" which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say, n_samples >> 1000 > n_features). The svd_solver="auto" option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.
Fix decomposition.PCA fit with svd_solver="arpack", whiten=True and a value for n_components that is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.

`sklearn.dummy`¶

Enhancement dummy.DummyClassifier and dummy.DummyRegressor now have the n_features_in_ and feature_names_in_ attributes after fit. #27937 by Marco vd Boom.

`sklearn.ensemble`¶

Efficiency Improves runtime of predict of ensemble.HistGradientBoostingClassifier by avoiding to call predict_proba. #27844 by Christian Lorentzen.
Efficiency ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor are now a tiny bit faster by pre-sorting the data before finding the thresholds for binning. #28102 by Christian Lorentzen.

`sklearn.feature_extraction`¶

Efficiency feature_extraction.text.TfidfTransformer is now faster and more memory-efficient by using a NumPy vector instead of a sparse matrix for storing the inverse document frequency. #18843 by Paolo Montesel.
Enhancement feature_extraction.text.TfidfTransformer now preserves the data type of the input matrix if it is np.float64 or np.float32. #28136 by Guillaume Lemaitre.

`sklearn.feature_selection`¶

Enhancement feature_selection.mutual_info_regression and feature_selection.mutual_info_classif now support n_jobs parameter. #28085 by Neto Menoci and Florin Andrei.
Enhancement The cv_results_ attribute of feature_selection.RFECV has a new key, n_features, containing an array with the number of features selected at each step. #28670 by Miguel Silva.

`sklearn.impute`¶

Enhancement impute.SimpleImputer now supports custom strategies by passing a function in place of a strategy name. #28053 by Mark Elliot.

`sklearn.inspection`¶

Fix inspection.DecisionBoundaryDisplay.from_estimator no longer warns about missing feature names when provided a polars.DataFrame. #28718 by Patrick Wang.

`sklearn.linear_model`¶

Enhancement Solver "newton-cg" in linear_model.LogisticRegression and linear_model.LogisticRegressionCV now emits information when verbose is set to positive values. #27526 by Christian Lorentzen.
Fix linear_model.ElasticNet, linear_model.ElasticNetCV, linear_model.Lasso and linear_model.LassoCV now explicitly don’t accept large sparse data formats. #27576 by Stefanie Senger.
API Change linear_model.RidgeCV and linear_model.RidgeClassifierCV will now allow alpha=0 when cv != None, which is consistent with linear_model.Ridge and linear_model.RidgeClassifier. #28425 by Lucy Liu.
Fix linear_model.RidgeCV and RidgeClassifierCV correctly pass sample_weight to the underlying scorer when cv is None. #27560 by Omar Salman.
Fix n_nonzero_coefs_ attribute in linear_model.OrthogonalMatchingPursuit will now always be None when tol is set, as n_nonzero_coefs is ignored in this case. #28557 by Lucy Liu.
API Change Passing average=0 to disable averaging is deprecated in linear_model.PassiveAggressiveClassifier, linear_model.PassiveAggressiveRegressor, linear_model.SGDClassifier, linear_model.SGDRegressor and linear_model.SGDOneClassSVM. Pass average=False instead. #28582 by Jérémie du Boisberranger.

`sklearn.manifold`¶

API Change Deprecates n_iter in favor of max_iter in manifold.TSNE. n_iter will be removed in version 1.7. This makes manifold.TSNE consistent with the rest of the estimators. #28471 by Lucy Liu

`sklearn.metrics`¶

Feature metrics.pairwise_distances accepts calculating pairwise distances for non-numeric arrays as well. This is supported through custom metrics only. #27456 by Venkatachalam N, Kshitij Mathur and Julian Libiseller-Egger.
Efficiency Improve efficiency of functions brier_score_loss, calibration_curve, det_curve, precision_recall_curve, roc_curve when pos_label argument is specified. Also improve efficiency of methods from_estimator and from_predictions in RocCurveDisplay, PrecisionRecallDisplay, DetCurveDisplay, CalibrationDisplay. #28051 by Pierre de Fréminville.
Feature sklearn.metrics.check_scoring now returns a multi-metric scorer when scoring as a dict, set, tuple, or list. #28360 by Thomas Fan.
Fix metrics.classification_report now shows only accuracy and not micro-average when input is a subset of labels. #28399 by Vineet Joshi.
Fix Fix OpenBLAS 0.3.26 dead-lock on Windows in pairwise distances computation. This is likely to affect neighbor-based algorithms. #28692 by Loïc Estève.
API Change metrics.precision_recall_curve deprecated the keyword argument probas_pred in favor of y_score. probas_pred will be removed in version 1.7. #28092 by Adam Li.
API Change metrics.brier_score_loss deprecated the keyword argument y_prob in favor of y_proba. y_prob will be removed in version 1.7. #28092 by Adam Li.
API Change For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.

`sklearn.mixture`¶

Fix The converged_ attribute of mixture.GaussianMixture and mixture.BayesianGaussianMixture now reflects the convergence status of the best fit whereas it was previously True if any of the fits converged. #26837 by Krsto Proroković.

`sklearn.model_selection`¶

Enhancement CV splitters that ignores the group parameter now raises a warning when groups are passed in to split. #28210 by Thomas Fan.
Fix the cv_results_ attribute (of model_selection.GridSearchCV) now returns masked arrays of the appropriate NumPy dtype, as opposed to always returning dtype object. #28352 by Marco Gorelli.
Fix sklearn.model_selection.train_test_score works with Array API inputs. Previously indexing was not handled correctly leading to exceptions when using strict implementations of the Array API like CuPY. #28407 by Tim Head.
Enhancement The HTML diagram representation of GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV, and HalvingRandomSearchCV will show the best estimator when refit=True. #28722 by Yao Xiao and Thomas Fan.

`sklearn.multioutput`¶

Enhancement chain_method parameter added to multioutput.ClassifierChain. #27700 by Lucy Liu.

`sklearn.neighbors`¶

Fix Fixes neighbors.NeighborhoodComponentsAnalysis such that get_feature_names_out returns the correct number of feature names. #28306 by Brendan Lu.

`sklearn.pipeline`¶

Feature pipeline.FeatureUnion can now use the verbose_feature_names_out attribute. If True, get_feature_names_out will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out will not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.

`sklearn.preprocessing`¶

Enhancement preprocessing.QuantileTransformer and preprocessing.quantile_transform now supports disabling subsampling explicitly. #27636 by Ralph Urlus.

`sklearn.tree`¶

Enhancement Plotting trees in matplotlib via tree.plot_tree now show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.

`sklearn.utils`¶

API Change utils.IS_PYPY is deprecated and will be removed in version 1.7. #28768 by Jérémie du Boisberranger.
API Change utils.tosequence is deprecated and will be removed in version 1.7. #28763 by Jérémie du Boisberranger.
API Change utils.parallel_backend and utils.register_parallel_backend are deprecated and will be removed in version 1.7. Use joblib.parallel_backend and joblib.register_parallel_backend instead. #28847 by Jérémie du Boisberranger.
API Change Raise informative warning message in type_of_target when represented as bytes. For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.
Fix _safe_indexing now works correctly for polars DataFrame when axis=0 and supports indexing polars Series. #28521 by Yao Xiao.

Code and documentation contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:

TODO: update at the time of the release.