Version 1.5#

For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.5.

Legend for changelogs

Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.

Version 1.5.2#

September 2024

Changes impacting many modules#

Fix Fixed performance regression in a few Cython modules in sklearn._loss, sklearn.manifold, sklearn.metrics and sklearn.utils, which were built without OpenMP support. #29694 by Loïc Estèvce.

Changelog#

`sklearn.calibration`#

Fix Raise error when LeaveOneOut used in cv, matching what would happen if KFold(n_splits=n_samples) was used. #29545 by Lucy Liu

`sklearn.compose`#

Fix Fixed compose.TransformedTargetRegressor not to raise UserWarning if transform output is set to pandas or polars, since it isn’t a transformer. #29401 by Stefanie Senger.

`sklearn.decomposition`#

Fix Increase rank defficiency threshold in the whitening step of decomposition.FastICA with whiten_solver="eigh" to improve the platform-agnosticity of the estimator. #29612 by Olivier Grisel.

`sklearn.metrics`#

Fix Fix a regression in metrics.accuracy_score and in metrics.zero_one_loss causing an error for Array API dispatch with multilabel inputs. #29336 by Edoardo Abati.

`sklearn.svm`#

Fix Fixed a regression in svm.SVC and svm.SVR such that we accept C=float("inf"). #29780 by Guillaume Lemaitre.

Version 1.5.1#

July 2024

Changes impacting many modules#

Fix Fixed a regression in the validation of the input data of all estimators where an unexpected error was raised when passing a DataFrame backed by a read-only buffer. #29018 by Jérémie du Boisberranger.
Fix Fixed a regression causing a dead-lock at import time in some settings. #29235 by Jérémie du Boisberranger.

Changelog#

`sklearn.compose`#

Efficiency Fix a performance regression in compose.ColumnTransformer where the full input data was copied for each transformer when n_jobs > 1. #29330 by Jérémie du Boisberranger.

`sklearn.metrics`#

Fix Fix a regression in metrics.r2_score. Passing torch CPU tensors with array API dispatched disabled would complain about non-CPU devices instead of implicitly converting those inputs as regular NumPy arrays. #29119 by @Olivier Grisel.
Fix Fix a regression in metrics.zero_one_loss causing an error for Array API dispatch with multilabel inputs. #29269 by Yaroslav Korobko.

`sklearn.model_selection`#

Fix Fix a regression in model_selection.GridSearchCV for parameter grids that have heterogeneous parameter values. #29078 by Loïc Estève.
Fix Fix a regression in model_selection.GridSearchCV for parameter grids that have estimators as parameter values. #29179 by Marco Gorelli.
Fix Fix a regression in model_selection.GridSearchCV for parameter grids that have arrays of different sizes as parameter values. #29314 by Marco Gorelli.

`sklearn.tree`#

Fix Fix an issue in tree.export_graphviz and tree.plot_tree that could potentially result in exception or wrong results on 32bit OSes. #29327 by Loïc Estève.

`sklearn.utils`#

API Change utils.validation.check_array has a new parameter, force_writeable, to control the writeability of the output array. If set to True, the output array will be guaranteed to be writeable and a copy will be made if the input array is read-only. If set to False, no guarantee is made about the writeability of the output array. #29018 by Jérémie du Boisberranger.

Version 1.5.0#

May 2024

Security#

Fix feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer no longer store discarded tokens from the training set in their stop_words_ attribute. This attribute would hold too frequent (above max_df) but also too rare tokens (below min_df). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.

Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the stop_words_ attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.

Changed models#

Efficiency The subsampling in preprocessing.QuantileTransformer is now more efficient for dense arrays but the fitted quantiles and the results of transform may be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.
Enhancement decomposition.PCA, decomposition.SparsePCA and decomposition.TruncatedSVD now set the sign of the components_ attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across all PCA solvers, including the new svd_solver="covariance_eigh" option introduced in this release.

Changes impacting many modules#

Fix Raise ValueError with an informative error message when passing 1D sparse arrays to methods that expect 2D sparse inputs. #28988 by Olivier Grisel.
API Change The name of the input of the inverse_transform method of estimators has been standardized to X. As a consequence, Xt is deprecated and will be removed in version 1.7 in the following estimators: cluster.FeatureAgglomeration, decomposition.MiniBatchNMF, decomposition.NMF, model_selection.GridSearchCV, model_selection.RandomizedSearchCV, pipeline.Pipeline and preprocessing.KBinsDiscretizer. #28756 by Will Dean.

Support for Array API#

Additional estimators and functions have been updated to include support for all Array API compliant inputs.

See Array API support (experimental) for more details.

Functions:

sklearn.metrics.r2_score now supports Array API compliant inputs. #27904 by Eric Lindgren, Franck Charras, Olivier Grisel and Tim Head.

Classes:

linear_model.Ridge now supports the Array API for the svd solver. See Array API support (experimental) for more details. #27800 by Franck Charras, Olivier Grisel and Tim Head.

Support for building with Meson#

From scikit-learn 1.5 onwards, Meson is the main supported way to build scikit-learn, see Building from source for more details.

Unless we discover a major blocker, setuptools support will be dropped in scikit-learn 1.6. The 1.5.x releases will support building scikit-learn with setuptools.

Meson support for building scikit-learn was added in #28040 by Loïc Estève

Metadata Routing#

The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.

Feature impute.IterativeImputer now supports metadata routing in its fit method. #28187 by Stefanie Senger.
Feature ensemble.BaggingClassifier and ensemble.BaggingRegressor now support metadata routing. The fit methods now accept **fit_params which are passed to the underlying estimators via their fit methods. #28432 by Adam Li and Benjamin Bossan.
Feature linear_model.RidgeCV and linear_model.RidgeClassifierCV now support metadata routing in their fit method and route metadata to the underlying model_selection.GridSearchCV object or the underlying scorer. #27560 by Omar Salman.
Feature GraphicalLassoCV now supports metadata routing in it’s fit method and routes metadata to the CV splitter. #27566 by Omar Salman.
Feature linear_model.RANSACRegressor now supports metadata routing in its fit, score and predict methods and route metadata to its underlying estimator’s’ fit, score and predict methods. #28261 by Stefanie Senger.
Feature ensemble.VotingClassifier and ensemble.VotingRegressor now support metadata routing and pass **fit_params to the underlying estimators via their fit methods. #27584 by Stefanie Senger.
Feature pipeline.FeatureUnion now supports metadata routing in its fit and fit_transform methods and route metadata to the underlying transformers’ fit and fit_transform. #28205 by Stefanie Senger.
Fix Fix an issue when resolving default routing requests set via class attributes. #28435 by Adrin Jalali.
Fix Fix an issue when set_{method}_request methods are used as unbound methods, which can happen if one tries to decorate them. #28651 by Adrin Jalali.
Fix Prevent a RecursionError when estimators with the default scoring param (None) route metadata. #28712 by Stefanie Senger.

Changelog#

`sklearn.calibration`#

Fix Fixed a regression in calibration.CalibratedClassifierCV where an error was wrongly raised with string targets. #28843 by Jérémie du Boisberranger.

`sklearn.cluster`#

Fix The cluster.MeanShift class now properly converges for constant data. #28951 by Akihiro Kuno.
Fix Create copy of precomputed sparse matrix within the fit method of OPTICS to avoid in-place modification of the sparse matrix. #28491 by Thanh Lam Dang.
Fix cluster.HDBSCAN now supports all metrics supported by sklearn.metrics.pairwise_distances when algorithm="brute" or "auto". #28664 by Manideep Yenugula.

`sklearn.compose`#

Feature A fitted compose.ColumnTransformer now implements __getitem__ which returns the fitted transformers by name. #27990 by Thomas Fan.
Enhancement compose.TransformedTargetRegressor now raises an error in fit if only inverse_func is provided without func (that would default to identity) being explicitly set as well. #28483 by Stefanie Senger.
Enhancement compose.ColumnTransformer can now expose the “remainder” columns in the fitted transformers_ attribute as column names or boolean masks, rather than column indices. #27657 by Jérôme Dockès.
Fix Fixed an bug in compose.ColumnTransformer with n_jobs > 1, where the intermediate selected columns were passed to the transformers as read-only arrays. #28822 by Jérémie du Boisberranger.

`sklearn.cross_decomposition`#

Fix The coef_ fitted attribute of cross_decomposition.PLSRegression now takes into account both the scale of X and Y when scale=True. Note that the previous predicted values were not affected by this bug. #28612 by Guillaume Lemaitre.
API Change Deprecates Y in favor of y in the methods fit, transform and inverse_transform of: cross_decomposition.PLSRegression. cross_decomposition.PLSCanonical, cross_decomposition.CCA, and cross_decomposition.PLSSVD. Y will be removed in version 1.7. #28604 by David Leon.

`sklearn.datasets`#

Enhancement Adds optional arguments n_retries and delay to functions datasets.fetch_20newsgroups, datasets.fetch_20newsgroups_vectorized, datasets.fetch_california_housing, datasets.fetch_covtype, datasets.fetch_kddcup99, datasets.fetch_lfw_pairs, datasets.fetch_lfw_people, datasets.fetch_olivetti_faces, datasets.fetch_rcv1, and datasets.fetch_species_distributions. By default, the functions will retry up to 3 times in case of network failures. #28160 by Zhehao Liu and Filip Karlo Došilović.

`sklearn.decomposition`#

Efficiency decomposition.PCA with svd_solver="full" now assigns a contiguous components_ attribute instead of an non-contiguous slice of the singular vectors. When n_components << n_features, this can save some memory and, more importantly, help speed-up subsequent calls to the transform method by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.
Enhancement PCA now automatically selects the ARPACK solver for sparse inputs when svd_solver="auto" instead of raising an error. #28498 by Thanh Lam Dang.
Enhancement decomposition.PCA now supports a new solver option named svd_solver="covariance_eigh" which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say, n_samples >> 1000 > n_features). The svd_solver="auto" option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.
Fix decomposition.PCA fit with svd_solver="arpack", whiten=True and a value for n_components that is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.

`sklearn.dummy`#

Enhancement dummy.DummyClassifier and dummy.DummyRegressor now have the n_features_in_ and feature_names_in_ attributes after fit. #27937 by Marco vd Boom.

`sklearn.ensemble`#

Efficiency Improves runtime of predict of ensemble.HistGradientBoostingClassifier by avoiding to call predict_proba. #27844 by Christian Lorentzen.
Efficiency ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor are now a tiny bit faster by pre-sorting the data before finding the thresholds for binning. #28102 by Christian Lorentzen.
Fix Fixes a bug in ensemble.HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor when monotonic_cst is specified for non-categorical features. #28925 by Xiao Yuan.

`sklearn.feature_extraction`#

Efficiency feature_extraction.text.TfidfTransformer is now faster and more memory-efficient by using a NumPy vector instead of a sparse matrix for storing the inverse document frequency. #18843 by Paolo Montesel.
Enhancement feature_extraction.text.TfidfTransformer now preserves the data type of the input matrix if it is np.float64 or np.float32. #28136 by Guillaume Lemaitre.

`sklearn.feature_selection`#

Enhancement feature_selection.mutual_info_regression and feature_selection.mutual_info_classif now support n_jobs parameter. #28085 by Neto Menoci and Florin Andrei.
Enhancement The cv_results_ attribute of feature_selection.RFECV has a new key, n_features, containing an array with the number of features selected at each step. #28670 by Miguel Silva.

`sklearn.impute`#

Enhancement impute.SimpleImputer now supports custom strategies by passing a function in place of a strategy name. #28053 by Mark Elliot.

`sklearn.inspection`#

Fix inspection.DecisionBoundaryDisplay.from_estimator no longer warns about missing feature names when provided a polars.DataFrame. #28718 by Patrick Wang.

`sklearn.linear_model`#

Enhancement Solver "newton-cg" in linear_model.LogisticRegression and linear_model.LogisticRegressionCV now emits information when verbose is set to positive values. #27526 by Christian Lorentzen.
Fix linear_model.ElasticNet, linear_model.ElasticNetCV, linear_model.Lasso and linear_model.LassoCV now explicitly don’t accept large sparse data formats. #27576 by Stefanie Senger.
Fix linear_model.RidgeCV and RidgeClassifierCV correctly pass sample_weight to the underlying scorer when cv is None. #27560 by Omar Salman.
Fix n_nonzero_coefs_ attribute in linear_model.OrthogonalMatchingPursuit will now always be None when tol is set, as n_nonzero_coefs is ignored in this case. #28557 by Lucy Liu.
API Change linear_model.RidgeCV and linear_model.RidgeClassifierCV will now allow alpha=0 when cv != None, which is consistent with linear_model.Ridge and linear_model.RidgeClassifier. #28425 by Lucy Liu.
API Change Passing average=0 to disable averaging is deprecated in linear_model.PassiveAggressiveClassifier, linear_model.PassiveAggressiveRegressor, linear_model.SGDClassifier, linear_model.SGDRegressor and linear_model.SGDOneClassSVM. Pass average=False instead. #28582 by Jérémie du Boisberranger.
API Change Parameter multi_class was deprecated in linear_model.LogisticRegression and linear_model.LogisticRegressionCV. multi_class will be removed in 1.7, and internally, for 3 and more classes, it will always use multinomial. If you still want to use the one-vs-rest scheme, you can use OneVsRestClassifier(LogisticRegression(..)). #28703 by Christian Lorentzen.
API Change store_cv_values and cv_values_ are deprecated in favor of store_cv_results and cv_results_ in ~linear_model.RidgeCV and ~linear_model.RidgeClassifierCV. #28915 by Lucy Liu.

`sklearn.manifold`#

API Change Deprecates n_iter in favor of max_iter in manifold.TSNE. n_iter will be removed in version 1.7. This makes manifold.TSNE consistent with the rest of the estimators. #28471 by Lucy Liu

`sklearn.metrics`#

Feature metrics.pairwise_distances accepts calculating pairwise distances for non-numeric arrays as well. This is supported through custom metrics only. #27456 by Venkatachalam N, Kshitij Mathur and Julian Libiseller-Egger.
Feature sklearn.metrics.check_scoring now returns a multi-metric scorer when scoring as a dict, set, tuple, or list. #28360 by Thomas Fan.
Feature metrics.d2_log_loss_score has been added which calculates the D^2 score for the log loss. #28351 by Omar Salman.
Efficiency Improve efficiency of functions brier_score_loss, calibration_curve, det_curve, precision_recall_curve, roc_curve when pos_label argument is specified. Also improve efficiency of methods from_estimator and from_predictions in RocCurveDisplay, PrecisionRecallDisplay, DetCurveDisplay, CalibrationDisplay. #28051 by Pierre de Fréminville.
Fix metrics.classification_report now shows only accuracy and not micro-average when input is a subset of labels. #28399 by Vineet Joshi.
Fix Fix OpenBLAS 0.3.26 dead-lock on Windows in pairwise distances computation. This is likely to affect neighbor-based algorithms. #28692 by Loïc Estève.
API Change metrics.precision_recall_curve deprecated the keyword argument probas_pred in favor of y_score. probas_pred will be removed in version 1.7. #28092 by Adam Li.
API Change metrics.brier_score_loss deprecated the keyword argument y_prob in favor of y_proba. y_prob will be removed in version 1.7. #28092 by Adam Li.
API Change For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.

`sklearn.mixture`#

Fix The converged_ attribute of mixture.GaussianMixture and mixture.BayesianGaussianMixture now reflects the convergence status of the best fit whereas it was previously True if any of the fits converged. #26837 by Krsto Proroković.

`sklearn.model_selection`#

Major Feature model_selection.TunedThresholdClassifierCV finds the decision threshold of a binary classifier that maximizes a classification metric through cross-validation. model_selection.FixedThresholdClassifier is an alternative when one wants to use a fixed decision threshold without any tuning scheme. #26120 by Guillaume Lemaitre.
Enhancement CV splitters that ignores the group parameter now raises a warning when groups are passed in to split. #28210 by Thomas Fan.
Enhancement The HTML diagram representation of GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV, and HalvingRandomSearchCV will show the best estimator when refit=True. #28722 by Yao Xiao and Thomas Fan.
Fix the cv_results_ attribute (of model_selection.GridSearchCV) now returns masked arrays of the appropriate NumPy dtype, as opposed to always returning dtype object. #28352 by Marco Gorelli.
Fix model_selection.train_test_split works with Array API inputs. Previously indexing was not handled correctly leading to exceptions when using strict implementations of the Array API like CuPY. #28407 by Tim Head.

`sklearn.multioutput`#

Enhancement chain_method parameter added to multioutput.ClassifierChain. #27700 by Lucy Liu.

`sklearn.neighbors`#

Fix Fixes neighbors.NeighborhoodComponentsAnalysis such that get_feature_names_out returns the correct number of feature names. #28306 by Brendan Lu.

`sklearn.pipeline`#

Feature pipeline.FeatureUnion can now use the verbose_feature_names_out attribute. If True, get_feature_names_out will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out will not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.

`sklearn.preprocessing`#

Enhancement preprocessing.QuantileTransformer and preprocessing.quantile_transform now supports disabling subsampling explicitly. #27636 by Ralph Urlus.

`sklearn.tree`#

Enhancement Plotting trees in matplotlib via tree.plot_tree now show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.

`sklearn.utils`#

Fix _safe_indexing now works correctly for polars DataFrame when axis=0 and supports indexing polars Series. #28521 by Yao Xiao.
API Change utils.IS_PYPY is deprecated and will be removed in version 1.7. #28768 by Jérémie du Boisberranger.
API Change utils.tosequence is deprecated and will be removed in version 1.7. #28763 by Jérémie du Boisberranger.
API Change utils.parallel_backend and utils.register_parallel_backend are deprecated and will be removed in version 1.7. Use joblib.parallel_backend and joblib.register_parallel_backend instead. #28847 by Jérémie du Boisberranger.
API Change Raise informative warning message in type_of_target when represented as bytes. For classifiers and classification metrics, labels encoded as bytes is deprecated and will raise an error in v1.7. #18555 by Kaushik Amar Das.
API Change utils.estimator_checks.check_estimator_sparse_data was split into two functions: utils.estimator_checks.check_estimator_sparse_matrix and utils.estimator_checks.check_estimator_sparse_array. #27576 by Stefanie Senger.

Code and documentation contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:

101AlexMartin, Abdulaziz Aloqeely, Adam J. Stewart, Adam Li, Adarsh Wase, Adeyemi Biola, Aditi Juneja, Adrin Jalali, Advik Sinha, Aisha, Akash Srivastava, Akihiro Kuno, Alan Guedes, Alberto Torres, Alexis IMBERT, alexqiao, Ana Paula Gomes, Anderson Nelson, Andrei Dzis, Arif Qodari, Arnaud Capitaine, Arturo Amor, Aswathavicky, Audrey Flanders, awwwyan, baggiponte, Bharat Raghunathan, bme-git, brdav, Brendan Lu, Brigitta Sipőcz, Bruno, Cailean Carter, Cemlyn, Christian Lorentzen, Christian Veenhuis, Cindy Liang, Claudio Salvatore Arcidiacono, Connor Boyle, Conrad Stevens, crispinlogan, David Matthew Cherney, Davide Chicco, davidleon123, dependabot[bot], DerWeh, dinga92, Dipan Banik, Drew Craeton, Duarte São José, DUONG, Eddie Bergman, Edoardo Abati, Egehan Gunduz, Emad Izadifar, EmilyXinyi, Erich Schubert, Evelyn, Filip Karlo Došilović, Franck Charras, Gael Varoquaux, Gönül Aycı, Guillaume Lemaitre, Gyeongjae Choi, Harmanan Kohli, Hong Xiang Yue, Ian Faust, Ilya Komarov, itsaphel, Ivan Wiryadi, Jack Bowyer, Javier Marin Tur, Jérémie du Boisberranger, Jérôme Dockès, Jiawei Zhang, João Morais, Joe Cainey, Joel Nothman, Johanna Bayer, John Cant, John Enblom, John Hopfensperger, jpcars, jpienaar-tuks, Julian Chan, Julian Libiseller-Egger, Julien Jerphanion, KanchiMoe, Kaushik Amar Das, keyber, Koustav Ghosh, kraktus, Krsto Proroković, Lars, ldwy4, LeoGrin, lihaitao, Linus Sommer, Loic Esteve, Lucy Liu, Lukas Geiger, m-maggi, manasimj, Manuel Labbé, Manuel Morales, Marco Edward Gorelli, Marco Wolsza, Maren Westermann, Marija Vlajic, Mark Elliot, Martin Helm, Mateusz Sokół, mathurinm, Mavs, Michael Dawson, Michael Higgins, Michael Mayer, miguelcsilva, Miki Watanabe, Mohammed Hamdy, myenugula, Nathan Goldbaum, Naziya Mahimkar, nbrown-ScottLogic, Neto, Nithish Bolleddula, notPlancha, Olivier Grisel, Omar Salman, ParsifalXu, Patrick Wang, Pierre de Fréminville, Piotr, Priyank Shroff, Priyansh Gupta, Priyash Shah, Puneeth K, Rahil Parikh, raisadz, Raj Pulapakura, Ralf Gommers, Ralph Urlus, Randolf Scholz, renaissance0ne, Reshama Shaikh, Richard Barnes, Robert Pollak, Roberto Rosati, Rodrigo Romero, rwelsch427, Saad Mahmood, Salim Dohri, Sandip Dutta, SarahRemus, scikit-learn-bot, Shaharyar Choudhry, Shubham, sperret6, Stefanie Senger, Steffen Schneider, Suha Siddiqui, Thanh Lam DANG, thebabush, Thomas, Thomas J. Fan, Thomas Lazarus, Tialo, Tim Head, Tuhin Sharma, Tushar Parimi, VarunChaduvula, Vineet Joshi, virchan, Waël Boukhobza, Weyb, Will Dean, Xavier Beltran, Xiao Yuan, Xuefeng Xu, Yao Xiao, yareyaredesuyo, Ziad Amerr, Štěpán Sršeň