Version 1.3.0¶
In Development
Legend for changelogs¶
Major Feature : something big that you couldn’t do before.
Feature : something that you couldn’t do before.
Efficiency : an existing feature now may not require as much computation or memory.
Enhancement : a miscellaneous minor improvement.
Fix : something that previously didn’t work as documentated – or according to reasonable expectations – should now work.
API Change : you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Changed models¶
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
Enhancement
multiclass.OutputCodeClassifier.predict
now uses a more efficient pairwise distance reduction. As a consequence, the tie-breaking strategy is different and thus the predicted labels may be different. #25196 by Guillaume Lemaitre.Enhancement The
fit_transform
method ofdecomposition.DictionaryLearning
is more efficient but may produce different results as in previous versions whentransform_algorithm
is not the same asfit_algorithm
and the number of iterations is small. #24871 by Omar Salman.Fix Treat more consistently small values in the
W
andH
matrices during thefit
andtransform
steps ofdecomposition.NMF
anddecomposition.MiniBatchNMF
which can produce different results than previous versions. #25438 by Yotam Avidar-Constantini.Enhancement The
sample_weight
parameter now will be used in centroids initialization forcluster.KMeans
,cluster.BisectingKMeans
andcluster.MiniBatchKMeans
. This change will break backward compatibility, since numbers generated from same random seeds will be different. #25752 by Gleb Levitski, Jérémie du Boisberranger, Guillaume Lemaitre.Fix
decomposition.KernelPCA
may produce different results throughinverse_transform
ifgamma
isNone
. Now it will be chosen correctly as1/n_features
of the data that it is fitted on, while previously it might be incorrectly chosen as1/n_features
of the data passed toinverse_transform
. A new attributegamma_
is provided for revealing the actual value ofgamma
used each time the kernel is called. #26337 by Yao Xiao.
Changes impacting all modules¶
Enhancement The
get_feature_names_out
method of the following classes now raises aNotFittedError
if the instance is not fitted. This ensures the error is consistent in all estimators with theget_feature_names_out
method.The
NotFittedError
displays an informative message asking to fit the instance with the appropriate arguments.#25294, #25308, #25291, #25367, #25402, by John Pangas, Rahil Parikh , and Alex Buzenet.
Enhancement Added a multi-threaded Cython routine to the compute squared Euclidean distances (sometimes followed by a fused reduction operation) for a pair of datasets consisting of a sparse CSR matrix and a dense NumPy.
This can improve the performance of following functions and estimators:
A typical example of this performance improvement happens when passing a sparse CSR matrix to the
predict
ortransform
method of estimators that rely on a dense NumPy representation to store their fitted parameters (or the reverse).For instance,
sklearn.NearestNeighbors.kneighbors
is now up to 2 times faster for this case on commonly available laptops.Enhancement All estimators that internally rely on OpenMP multi-threading (via Cython) now use a number of threads equal to the number of physical (instead of logical) cores by default. In the past, we observed that using as many threads as logical cores on SMT hosts could sometimes cause severe performance problems depending on the algorithms and the shape of the data. Note that it is still possible to manually adjust the number of threads used by OpenMP as documented in Parallelism.
Experimental / Under Development¶
Major Feature Metadata routing’s related base methods are included in this release. This feature is only available via the
enable_metadata_routing
feature flag which can be enabled usingsklearn.set_config
andsklearn.config_context
. For now this feature is mostly useful for third party developers to prepare their code base for metadata routing, and we strongly recommend that they also hide it behind the same feature flag, rather than having it enabled by default. #24027 by Adrin Jalali, Benjamin Bossan, and Omar Salman.
Changelog¶
sklearn
¶
Feature Added a new option
skip_parameter_validation
, to the functionsklearn.set_config
and context managersklearn.config_context
, that allows to skip the validation of the parameters passed to the estimators and public functions. This can be useful to speed up the code but should be used with care because it can lead to unexpected behaviors or raise obscure error messages when setting invalid parameters. #25815 by Jérémie du Boisberranger.
sklearn.base
¶
Feature A
__sklearn_clone__
protocol is now available to override the default behavior ofbase.clone
. #24568 by Thomas Fan.Fix
base.TransformerMixin
now currently keeps a namedtuple’s class iftransform
returns a namedtuple. #26121 by Thomas Fan.
sklearn.calibration
¶
Fix
calibration.CalibratedClassifierCV
now does not enforce sample alignment onfit_params
. #25805 by Adrin Jalali.
sklearn.cluster
¶
API Change The
sample_weight
parameter inpredict
forcluster.KMeans.predict
andcluster.MiniBatchKMeans.predict
is now deprecated and will be removed in v1.5. #25251 by Gleb Levitski.Enhancement The
sample_weight
parameter now will be used in centroids initialization forcluster.KMeans
,cluster.BisectingKMeans
andcluster.MiniBatchKMeans
. This change will break backward compatibility, since numbers generated from same random seeds will be different. #25752 by Gleb Levitski, Jérémie du Boisberranger, Guillaume Lemaitre.Major Feature Added
cluster.HDBSCAN
, a modern hierarchical density-based clustering algorithm. Similarly tocluster.OPTICS
, it can be seen as a generalization ofDBSCAN
by allowing for hierarchical instead of flat clustering, however it varies in its approach fromcluster.OPTICS
. This algorithm is very robust with respect to its hyperparameters’ values and can be used on a wide variety of data without much, if any, tuning.This implementation is an adaptation from the original implementation of HDBSCAN in scikit-learn-contrib/hdbscan, by Leland McInnes et al.
sklearn.compose
¶
Fix
compose.ColumnTransformer
raises an informative error when the individual transformers ofColumnTransformer
output pandas dataframes with indexes that are not consistent with each other and the output is configured to be pandas. #26286 by Thomas Fan.Fix
compose.ColumnTransformer
correctly sets the output of the remainder whenset_output
is called. #26323 by Thomas Fan.
sklearn.covariance
¶
API Change Deprecates
cov_init
incovariance.graphical_lasso
in 1.3 since the parameter has no effect. It will be removed in 1.5. #26033 by Genesis Valencia.API Change Adds
costs_
fitted attribute incovariance.GraphicalLasso
andcovariance.GraphicalLassoCV
. #26033 by Genesis Valencia.API Change Adds
covariance
parameter incovariance.GraphicalLasso
. #26033 by Genesis Valencia.API Change Adds
eps
parameter incovariance.GraphicalLasso
,covariance.graphical_lasso_path
, andcovariance.GraphicalLassoCV
. #26033 by Genesis Valencia.Fix Allows
alpha=0
incovariance.GraphicalLasso
to be consistent withcovariance.graphical_lasso
. #26033 by Genesis Valencia.Fix
covariance.empirical_covariance
now gives an informative error message when input is not appropriate. #26108 by Quentin Barthélemy.
sklearn.datasets
¶
API Change The
data_transposed
argument ofdatasets.make_sparse_coded_signal
is deprecated and will be removed in v1.5. #25784 by @Jérémie du Boisberranger.Fix
datasets.fetch_openml
returns improved data types whenas_frame=True
andparser="liac-arff"
. #26386 by Thomas Fan.Fix Following the ARFF specs, only the marker
"?"
is now considered as a missing values when opening ARFF files fetched usingdatasets.fetch_openml
when using the pandas parser. The parameterread_csv_kwargs
allows to overwrite this behaviour. #26551 by Guillaume Lemaitre.Enhancement Allows to overwrite the parameters used to open the ARFF file using the parameter
read_csv_kwargs
indatasets.fetch_openml
when using the pandas parser. #26433 by Guillaume Lemaitre.
sklearn.decomposition
¶
Enhancement
decomposition.DictionaryLearning
now accepts the parametercallback
for consistency with the functiondecomposition.dict_learning
. #24871 by Omar Salman.Efficiency
decomposition.MiniBatchDictionaryLearning
anddecomposition.MiniBatchSparsePCA
are now faster for small batch sizes by avoiding duplicate validations. #25490 by Jérémie du Boisberranger.Fix Treat more consistently small values in the
W
andH
matrices during thefit
andtransform
steps ofdecomposition.NMF
anddecomposition.MiniBatchNMF
which can produce different results than previous versions. #25438 by Yotam Avidar-Constantini.
sklearn.discriminant_analysis
¶
Enhancement
discriminant_analysis.LinearDiscriminantAnalysis
now supports the PyTorch. See Array API support (experimental) for more details. #25956 by Thomas Fan.
sklearn.ensemble
¶
Feature
ensemble.HistGradientBoostingRegressor
now supports the Gamma deviance loss vialoss="gamma"
. Using the Gamma deviance as loss function comes in handy for modelling skewed distributed, strictly positive valued targets. #22409 by Christian Lorentzen.Feature Compute a custom out-of-bag score by passing a callable to
ensemble.RandomForestClassifier
,ensemble.RandomForestRegressor
,ensemble.ExtraTreesClassifier
andensemble.ExtraTreesRegressor
. #25177 by Tim Head.Feature
ensemble.GradientBoostingClassifier
now exposes out-of-bag scores via theoob_scores_
oroob_score_
attributes. #24882 by Ashwin Mathur.Efficiency
ensemble.IsolationForest
predict time is now faster (typically by a factor of 8 or more). Internally, the estimator now precomputes decision path lengths per tree atfit
time. It is therefore not possible to load an estimator trained with scikit-learn 1.2 to make it predict with scikit-learn 1.3: retraining with scikit-learn 1.3 is required. #25186 by Felipe Breve Siola.Efficiency
ensemble.RandomForestClassifier
andensemble.RandomForestRegressor
withwarm_start=True
now only recomputes out-of-bag scores when there are actually moren_estimators
in subsequentfit
calls. #26318 by Joshua Choo Yun Keat.Enhancement
ensemble.BaggingClassifier
andensemble.BaggingRegressor
expose theallow_nan
tag from the underlying estimator. #25506 by Thomas Fan.Fix
ensemble.RandomForestClassifier.fit
setsmax_samples = 1
whenmax_samples
is a float andround(n_samples * max_samples) < 1
. #25601 by Jan Fidor.Fix
ensemble.IsolationForest.fit
no longer warns about missing feature names when called withcontamination
not"auto"
on a pandas dataframe. #25931 by Yao Xiao.Fix
ensemble.HistGradientBoostingRegressor
andensemble.HistGradientBoostingClassifier
treats negative values for categorical features consistently as missing values, following LightGBM’s and pandas’ conventions. #25629 by Thomas Fan.Fix Fix deprecation of
base_estimator
inensemble.AdaBoostClassifier
andensemble.AdaBoostRegressor
that was introduced in #23819. #26242 by Marko Toplak.
sklearn.exception
¶
Feature Added
exception.InconsistentVersionWarning
which is raised when a scikit-learn estimator is unpickled with a scikit-learn version that is inconsistent with the sckit-learn version the estimator was pickled with. #25297 by Thomas Fan.
sklearn.feature_extraction
¶
API Change
feature_extraction.image.PatchExtractor
now follows the transformer API of scikit-learn. This class is defined as a stateless transformer meaning that it is note required to callfit
before callingtransform
. Parameter validation only happens atfit
time. #24230 by Guillaume Lemaitre.
sklearn.feature_selection
¶
Enhancement All selectors in
sklearn.feature_selection
will preserve a DataFrame’s dtype when transformed. #25102 by Thomas Fan.Fix
feature_selection.SequentialFeatureSelector
’scv
parameter now supports generators. #25973 byYao Xiao <Charlie-XIAO>
.
sklearn.impute
¶
Enhancement Added the parameter
fill_value
toimpute.IterativeImputer
. #25232 by Thijs van Weezel.
sklearn.inspection
¶
Enhancement Added support for
sample_weight
ininspection.partial_dependence
. This allows for weighted averaging when aggregating for each value of the grid we are making the inspection on. The option is only available whenmethod
is set tobrute
. #25209 by Carlo Lemos.API Change
inspection.partial_dependence
returns autils.Bunch
with new key:grid_values
. Thevalues
key is deprecated in favor ofgrid_values
and thevalues
key will be removed in 1.5. #21809 and #25732 by Thomas Fan.
sklearn.linear_model
¶
Efficiency Avoid data scaling when
sample_weight=None
and other unnecessary data copies and unexpected dense to sparse data conversion inlinear_model.LinearRegression
. #26207 by Olivier Grisel.Enhancement
linear_model.SGDClassifier
,linear_model.SGDRegressor
andlinear_model.SGDOneClassSVM
now preserve dtype fornumpy.float32
. #25587 by Omar Salman.API Change Deprecates
n_iter
in favor ofmax_iter
inlinear_model.BayesianRidge
andlinear_model.ARDRegression
.n_iter
will be removed in scikit-learn 1.5. This change makes those estimators consistent with the rest of estimators. #25697 by John Pangas.Enhancement The
n_iter_
attribute has been included inlinear_model.ARDRegression
to expose the actual number of iterations required to reach the stopping criterion. #25697 by John Pangas.Fix Use a more robust criterion to detect convergence of
linear_model.LogisticRegression(penalty="l1", solver="liblinear")
on linearly separable problems. #25214 by Tom Dupre la Tour.
sklearn.metrics
¶
Efficiency The computation of the expected mutual information in
metrics.adjusted_mutual_info_score
is now faster when the number of unique labels is large and its memory usage is reduced in general. #25713 by Kshitij Mathur, Guillaume Lemaitre, Omar Salman and Jérémie du Boisberranger.Feature Adds
zero_division=np.nan
to multiple classification metrics:precision_score
,recall_score
,f1_score
,fbeta_score
,precision_recall_fscore_support
,classification_report
. Whenzero_division=np.nan
and there is a zero division, the metric is undefined and is excluded from averaging. When not used for averages, the value returned isnp.nan
. #25531 by Marc Torrellas Socastro.Fix
metric.manhattan_distances
now supports readonly sparse datasets. #25432 by Julien Jerphanion.Fix Fixed
classification_report
so that empty input will returnnp.nan
. Previously, “macro avg” andweighted avg
would return e.g.f1-score=np.nan
andf1-score=0.0
, being inconsistent. Now, they both returnnp.nan
. #25531 by Marc Torrellas Socastro.Fix
metric.ndcg_score
now gives a meaningful error message for input of length 1. #25672 by Lene Preuss and Wei-Chun Chu.Enhancement
metrics.silhouette_samples
nows accepts a sparse matrix of pairwise distances between samples, or a feature array. #18723 by Sahil Gupta and #24677 by Ashwin Mathur.Enhancement A new parameter
drop_intermediate
was added tometrics.precision_recall_curve
,metrics.PrecisionRecallDisplay.from_estimator
,metrics.PrecisionRecallDisplay.from_predictions
, which drops some suboptimal thresholds to create lighter precision-recall curves. #24668 by @dberenbaum.Enhancement
metrics.RocCurveDisplay.from_estimator
andmetrics.RocCurveDisplay.from_predictions
now accept two new keywords,plot_chance_level
andchance_level_kw
to plot the baseline chance level. This line is exposed in thechance_level_
attribute. #25987 by Yao Xiao.Enhancement
metrics.PrecisionRecallDisplay.from_estimator
andmetrics.PrecisionRecallDisplay.from_predictions
now accept two new keywords,plot_chance_level
andchance_level_kw
to plot the baseline chance level. This line is exposed in thechance_level_
attribute. #26019 by Yao Xiao.Fix
log_loss
raises a warning if the values of the parametery_pred
are not normalized, instead of actually normalizing them in the metric. Starting from 1.5 this will raise an error. #25299 by @Omar Salman <OmarManzoor.API Change The
eps
parameter of thelog_loss
has been deprecated and will be removed in 1.5. #25299 by Omar Salman.Feature
metrics.average_precision_score
now supports the multiclass case. #17388 by Geoffrey Bolmier and #24769 by Ashwin Mathur.Fix In
metrics.roc_curve
, use the threshold valuenp.inf
instead of arbitrarymax(y_score) + 1
. This threshold is associated with the ROC curve pointtpr=0
andfpr=0
. #26194 by Guillaume Lemaitre.Fix The
'matching'
metric has been removed when using SciPy>=1.9 to be consistent withscipy.spatial.distance
which does not support'matching'
anymore. #26264 by Barata T. Onggo
sklearn.gaussian_process
¶
Fix
gaussian_process.GaussianProcessRegressor
has a new argumentn_targets
, which is used to decide the number of outputs when sampling from the prior distributions. #23099 by Zhehao Liu.
sklearn.model_selection
¶
Enhancement
model_selection.cross_validate
accepts a new parameterreturn_indices
to return the train-test indices of each cv split. #25659 by Guillaume Lemaitre.
sklearn.multioutput
¶
Fix
getattr
onmultioutput.MultiOutputRegressor.partial_fit
andmultioutput.MultiOutputClassifier.partial_fit
now correctly raise anAttributeError
if done before callingfit
. #26333 by Adrin Jalali.
sklearn.naive_bayes
¶
Fix
naive_bayes.GaussianNB
does not raise anymore aZeroDivisionError
when the providedsample_weight
reduces the problem to a single class infit
. #24140 by Jonathan Ohayon and Chiara Marmo.
sklearn.neighbors
¶
Fix Remove support for
KulsinskiDistance
inneighbors.BallTree
. This dissimilarity is not a metric and cannot be supported by the BallTree. #25417 by Guillaume Lemaitre.Enhancement The performance of
neighbors.KNeighborsClassifier.predict
and ofneighbors.KNeighborsClassifier.predict_proba
has been improved whenn_neighbors
is large andalgorithm="brute"
with non Euclidean metrics. #24076 by Meekail Zain, Julien Jerphanion.API Change The support for metrics other than
euclidean
andmanhattan
and for callables inneighbors.NearestNeighbors
is deprecated and will be removed in version 1.5. #24083 by Valentin Laurent.
sklearn.neural_network
¶
Fix
neural_network.MLPRegressor
andneural_network.MLPClassifier
reports the rightn_iter_
whenwarm_start=True
. It corresponds to the number of iterations performed on the current call tofit
instead of the total number of iterations performed since the initialization of the estimator. #25443 by Marvin Krawutschke.
sklearn.pipeline
¶
Feature
pipeline.FeatureUnion
can now use indexing notation (e.g.feature_union["scalar"]
) to access transformers by name. #25093 by Thomas Fan.Feature
pipeline.FeatureUnion
can now access thefeature_names_in_
attribute if theX
value seen during.fit
has acolumns
attribute and all columns are strings. e.g. whenX
is apandas.DataFrame
#25220 by Ian Thompson.Fix
pipeline.Pipeline.fit_transform
now raises anAttributeError
if the last step of the pipeline does not supportfit_transform
. #26325 by Adrin Jalali.
sklearn.preprocessing
¶
Major Feature Introduces
preprocessing.TargetEncoder
which is a categorical encoding based on target mean conditioned on the value of the category. #25334 by Thomas Fan.Enhancement A new parameter
sparse_output
was added toSplineTransformer
, available as of SciPy 1.8. Ifsparse_output=True
,SplineTransformer
returns a sparse CSR matrix. #24145 by Christian Lorentzen.Enhancement Adds a
feature_name_combiner
parameter topreprocessing.OneHotEncoder
. This specifies a custom callable to create feature names to be returned byget_feature_names_out
. The callable combines input arguments(input_feature, category)
to a string. #22506 by Mario Kostelac.Enhancement Added support for
sample_weight
inpreprocessing.KBinsDiscretizer
. This allows specifying the parametersample_weight
for each sample to be used while fitting. The option is only available whenstrategy
is set toquantile
andkmeans
. #24935 by Seladus, Guillaume Lemaitre, and Dea María Léon, #25257 by Gleb Levitski.Feature
preprocessing.OrdinalEncoder
now supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequency
ormax_categories
. #25677 by Thomas Fan.Enhancement Subsampling through the
subsample
parameter can now be used inpreprocessing.KBinsDiscretizer
regardless of the strategy used. #26424 by Jérémie du Boisberranger.API Change The default value of the
subsample
parameter ofpreprocessing.KBinsDiscretizer
will change fromNone
to200_000
in version 1.5 whenstrategy="kmeans"
orstrategy="uniform"
. #26424 by Jérémie du Boisberranger.Fix
AdditiveChi2Sampler
is now stateless. Thesample_interval_
attribute is deprecated and will be removed in 1.5. #25190 by Vincent Maladière.
sklearn.svm
¶
API Change
dual
parameter now acceptsauto
option forsvm.LinearSVC
andsvm.LinearSVR
. #26093 by Gleb Levitski.
sklearn.tree
¶
Major Feature
tree.DecisionTreeRegressor
andtree.DecisionTreeClassifier
support missing values whensplitter='best'
and criterion isgini
,entropy
, orlog_loss
, for classification orsquared_error
,friedman_mse
, orpoisson
for regression. #23595, #26376 by Thomas Fan.Enhancement Adds a
class_names
parameter totree.export_text
. This allows specifying the parameterclass_names
for each target class in ascending numerical order. #25387 by William M and crispinlogan.Fix
tree.export_graphviz
andtree.export_text
now acceptsfeature_names
andclass_names
as array-like rather than lists. #26289 by Yao Xiao
sklearn.utils
¶
API Change
estimator_checks.check_transformers_unfitted_stateless
has been introduced to ensure stateless transformers don’t raiseNotFittedError
duringtransform
with no prior call tofit
orfit_transform
. #25190 by Vincent Maladière.Enhancement
preprocessing.PolynomialFeatures
now calculates the number of expanded terms a-priori when dealing with sparsecsr
matrices in order to optimize the choice ofdtype
forindices
andindptr
. It can now outputcsr
matrices withnp.int32
indices/indptr
components when there are few enough elements, and will automatically usenp.int64
for sufficiently large matrices. #20524 by niuk-a and #23731 by Meekail ZainAPI Change A
FutureWarning
is now raised when instantiating a class which inherits from a deprecated base class (i.e. decorated byutils.deprecated
) and which overrides the__init__
method. #25733 by Brigitta Sipőcz and Jérémie du Boisberranger.Fix Fixes
utils.validation.check_array
to properly convert pandas extension arrays. #25813 and #26106 by Thomas Fan.Fix
utils.validation.check_array
now supports pandas DataFrames with extension arrays and object dtypes by return an ndarray with object dtype. #25814 by Thomas Fan.
sklearn.semi_supervised
¶
Enhancement
LabelSpreading.fit
andLabelPropagation.fit
now accepts sparse metrics. #19664 by Kaushik Amar Das.
Miscellaneous¶
Enhancement Replace obsolete exceptions EnvironmentError, IOError and WindowsError. #26466 by Dimitri Papadopoulos ORfanos.
Code and Documentation Contributors¶
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.2, including:
TODO: update at the time of the release.