Version 1.5#

For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.5.

Legend for changelogs

  • Major Feature something big that you couldn’t do before.

  • Feature something that you couldn’t do before.

  • Efficiency an existing feature now may not require as much computation or memory.

  • Enhancement a miscellaneous minor improvement.

  • Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.

  • API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.

Version 1.5.0#

May 2024

Security#

  • Fix feature_extraction.text.CountVectorizer and feature_extraction.text.TfidfVectorizer no longer store discarded tokens from the training set in their stop_words_ attribute. This attribute would hold too frequent (above max_df) but also too rare tokens (below min_df). This fixes a potential security issue (data leak) if the discarded rare tokens hold sensitive information from the training set without the model developer’s knowledge.

    Note: users of those classes are encouraged to either retrain their pipelines with the new scikit-learn version or to manually clear the stop_words_ attribute from previously trained instances of those transformers. This attribute was designed only for model inspection purposes and has no impact on the behavior of the transformers. #28823 by Olivier Grisel.

Changed models#

  • Efficiency The subsampling in preprocessing.QuantileTransformer is now more efficient for dense arrays but the fitted quantiles and the results of transform may be slightly different than before (keeping the same statistical properties). #27344 by Xuefeng Xu.

  • Enhancement decomposition.PCA, decomposition.SparsePCA and decomposition.TruncatedSVD now set the sign of the components_ attribute based on the component values instead of using the transformed data as reference. This change is needed to be able to offer consistent component signs across all PCA solvers, including the new svd_solver="covariance_eigh" option introduced in this release.

Changes impacting many modules#

Support for Array API#

Additional estimators and functions have been updated to include support for all Array API compliant inputs.

See Array API support (experimental) for more details.

Functions:

Classes:

Support for building with Meson#

From scikit-learn 1.5 onwards, Meson is the main supported way to build scikit-learn, see Building from source for more details.

Unless we discover a major blocker, setuptools support will be dropped in scikit-learn 1.6. The 1.5.x releases will support building scikit-learn with setuptools.

Meson support for building scikit-learn was added in #28040 by Loïc Estève

Metadata Routing#

The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide for more details.

Changelog#

sklearn.calibration#

sklearn.cluster#

sklearn.compose#

sklearn.cross_decomposition#

sklearn.datasets#

sklearn.decomposition#

  • Efficiency decomposition.PCA with svd_solver="full" now assigns a contiguous components_ attribute instead of an non-contiguous slice of the singular vectors. When n_components << n_features, this can save some memory and, more importantly, help speed-up subsequent calls to the transform method by more than an order of magnitude by leveraging cache locality of BLAS GEMM on contiguous arrays. #27491 by Olivier Grisel.

  • Enhancement PCA now automatically selects the ARPACK solver for sparse inputs when svd_solver="auto" instead of raising an error. #28498 by Thanh Lam Dang.

  • Enhancement decomposition.PCA now supports a new solver option named svd_solver="covariance_eigh" which offers an order of magnitude speed-up and reduced memory usage for datasets with a large number of data points and a small number of features (say, n_samples >> 1000 > n_features). The svd_solver="auto" option has been updated to use the new solver automatically for such datasets. This solver also accepts sparse input data. #27491 by Olivier Grisel.

  • Fix decomposition.PCA fit with svd_solver="arpack", whiten=True and a value for n_components that is larger than the rank of the training set, no longer returns infinite values when transforming hold-out data. #27491 by Olivier Grisel.

sklearn.dummy#

sklearn.ensemble#

sklearn.feature_extraction#

sklearn.feature_selection#

sklearn.impute#

sklearn.inspection#

sklearn.linear_model#

sklearn.manifold#

sklearn.metrics#

sklearn.mixture#

sklearn.model_selection#

sklearn.multioutput#

sklearn.neighbors#

sklearn.pipeline#

  • Feature pipeline.FeatureUnion can now use the verbose_feature_names_out attribute. If True, get_feature_names_out will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out will not prefix any feature names and will error if feature names are not unique. #25991 by Jiawei Zhang.

sklearn.preprocessing#

sklearn.tree#

  • Enhancement Plotting trees in matplotlib via tree.plot_tree now show a “True/False” label to indicate the directionality the samples traverse given the split condition. #28552 by Adam Li.

sklearn.utils#

Code and documentation contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.4, including:

101AlexMartin, Abdulaziz Aloqeely, Adam J. Stewart, Adam Li, Adarsh Wase, Adrin Jalali, Advik Sinha, Akash Srivastava, Akihiro Kuno, Alan Guedes, Alexis IMBERT, Ana Paula Gomes, Anderson Nelson, Andrei Dzis, Arnaud Capitaine, Arturo Amor, Aswathavicky, Bharat Raghunathan, Brendan Lu, Bruno, Cemlyn, Christian Lorentzen, Christian Veenhuis, Cindy Liang, Claudio Salvatore Arcidiacono, Connor Boyle, Conrad Stevens, crispinlogan, davidleon123, DerWeh, Dipan Banik, Duarte São José, DUONG, Eddie Bergman, Edoardo Abati, Egehan Gunduz, Emad Izadifar, Erich Schubert, Filip Karlo Došilović, Franck Charras, Gael Varoquaux, Gönül Aycı, Guillaume Lemaitre, Gyeongjae Choi, Harmanan Kohli, Hong Xiang Yue, Ian Faust, itsaphel, Ivan Wiryadi, Jack Bowyer, Javier Marin Tur, Jérémie du Boisberranger, Jérôme Dockès, Jiawei Zhang, Joel Nothman, Johanna Bayer, John Cant, John Hopfensperger, jpcars, jpienaar-tuks, Julian Libiseller-Egger, Julien Jerphanion, KanchiMoe, Kaushik Amar Das, keyber, Koustav Ghosh, kraktus, Krsto Proroković, ldwy4, LeoGrin, lihaitao, Linus Sommer, Loic Esteve, Lucy Liu, Lukas Geiger, manasimj, Manuel Labbé, Manuel Morales, Marco Edward Gorelli, Maren Westermann, Marija Vlajic, Mark Elliot, Mateusz Sokół, Mavs, Michael Higgins, Michael Mayer, miguelcsilva, Miki Watanabe, Mohammed Hamdy, myenugula, Nathan Goldbaum, Naziya Mahimkar, Neto, Olivier Grisel, Omar Salman, Patrick Wang, Pierre de Fréminville, Priyash Shah, Puneeth K, Rahil Parikh, raisadz, Raj Pulapakura, Ralf Gommers, Ralph Urlus, Randolf Scholz, Reshama Shaikh, Richard Barnes, Rodrigo Romero, Saad Mahmood, Salim Dohri, Sandip Dutta, SarahRemus, scikit-learn-bot, Shaharyar Choudhry, Shubham, sperret6, Stefanie Senger, Suha Siddiqui, Thanh Lam DANG, thebabush, Thomas J. Fan, Thomas Lazarus, Thomas Li, Tialo, Tim Head, Tuhin Sharma, VarunChaduvula, Vineet Joshi, virchan, Waël Boukhobza, Weyb, Will Dean, Xavier Beltran, Xiao Yuan, Xuefeng Xu, Yao Xiao