Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Combine predictors using stacking#

Stacking is an ensemble method. In this strategy, the out-of-fold predictions from several base estimators are used to train a meta-model that combines their outputs at inference time. Unlike VotingRegressor, which averages predictions with fixed (optionally user-specified) weights, StackingRegressor learns the combination through its final_estimator.

In this example, we illustrate the use case in which different regressors are stacked together and a final regularized linear regressor is used to output the prediction. We compare the performance of each individual regressor with the stacking strategy. Here, stacking slightly improves the overall performance.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Generate data#

We use synthetic data generated from a sinusoid plus a linear trend with heteroscedastic Gaussian noise. A sudden drop is introduced, as it cannot be described by a linear model, but a tree-based model can naturally deal with it.

import numpy as np
import pandas as pd

rng = np.random.RandomState(42)
X = rng.uniform(-3, 3, size=500)
trend = 2.4 * X
seasonal = 3.1 * np.sin(3.2 * X)
drop = 10.0 * (X > 2).astype(float)
sigma = 0.75 + 0.75 * X**2
y = trend + seasonal - drop + rng.normal(loc=0.0, scale=np.sqrt(sigma))

df = pd.DataFrame({"X": X, "y": y})
_ = df.plot.scatter(x="X", y="y")

Stack of predictors on a single data set#

It is sometimes not evident which model is more suited for a given task, as different model families can achieve similar performance while exhibiting different strengths and weaknesses. Stacking combines their outputs to exploit these complementary behaviors and can correct systematic errors that no single model can fix on its own. With appropriate regularization in the final_estimator, the StackingRegressor often matches the strongest base model, and can outperform it when base learners’ errors are only partially correlated, allowing the combination to reduce individual bias/variance.

Here, we combine 3 learners (linear and non-linear) and use the default RidgeCV regressor to combine their outputs together.

Note

Although some base learners include preprocessing (such as the StandardScaler), the final_estimator does not need additional preprocessing when using the default passthrough=False, as it receives only the base learners’ predictions. If passthrough=True, final_estimator should be a pipeline with proper preprocessing.

from sklearn.ensemble import HistGradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer, StandardScaler

linear_ridge = make_pipeline(StandardScaler(), RidgeCV())

spline_ridge = make_pipeline(
    SplineTransformer(n_knots=6, degree=3),
    PolynomialFeatures(interaction_only=True),
    RidgeCV(),
)

hgbt = HistGradientBoostingRegressor(random_state=0)

estimators = [
    ("Linear Ridge", linear_ridge),
    ("Spline Ridge", spline_ridge),
    ("HGBT", hgbt),
]

stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=RidgeCV())
stacking_regressor

StackingRegressor(estimators=[('Linear Ridge',
                               Pipeline(steps=[('standardscaler',
                                                StandardScaler()),
                                               ('ridgecv', RidgeCV())])),
                              ('Spline Ridge',
                               Pipeline(steps=[('splinetransformer',
                                                SplineTransformer(n_knots=6)),
                                               ('polynomialfeatures',
                                                PolynomialFeatures(interaction_only=True)),
                                               ('ridgecv', RidgeCV())])),
                              ('HGBT',
                               HistGradientBoostingRegressor(random_state=0))],
                  final_estimator=RidgeCV())

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Measure and plot the results#

We can directly plot the predictions. Indeed, the sudden drop is correctly described by the HistGradientBoostingRegressor model (HGBT), but the spline model is smoother and less overfitting. The stacked regressor then turns to be a smoother version of the HGBT.

import matplotlib.pyplot as plt

X = X.reshape(-1, 1)
linear_ridge.fit(X, y)
spline_ridge.fit(X, y)
hgbt.fit(X, y)
stacking_regressor.fit(X, y)

x_plot = np.linspace(X.min() - 0.1, X.max() + 0.1, 500).reshape(-1, 1)
preds = {
    "Linear Ridge": linear_ridge.predict(x_plot),
    "Spline Ridge": spline_ridge.predict(x_plot),
    "HGBT": hgbt.predict(x_plot),
    "Stacking (Ridge final estimator)": stacking_regressor.predict(x_plot),
}

fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True, sharey=True)
axes = axes.ravel()
for ax, (name, y_pred) in zip(axes, preds.items()):
    ax.scatter(
        X[:, 0],
        y,
        s=6,
        alpha=0.35,
        linewidths=0,
        label="observed (sample)",
    )

    ax.plot(x_plot.ravel(), y_pred, linewidth=2, alpha=0.9, label=name)
    ax.set_title(name)
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.legend(loc="lower right")

plt.suptitle("Base Models Predictions versus Stacked Predictions", y=1)
plt.tight_layout()
plt.show()

Base Models Predictions versus Stacked Predictions, Linear Ridge, Spline Ridge, HGBT, Stacking (Ridge final estimator)

We can plot the prediction errors as well and evaluate the performance of the individual predictors and the stack of the regressors.

import time

from sklearn.metrics import PredictionErrorDisplay
from sklearn.model_selection import cross_val_predict, cross_validate

fig, axs = plt.subplots(2, 2, figsize=(9, 7))
axs = np.ravel(axs)

for ax, (name, est) in zip(
    axs, estimators + [("Stacking Regressor", stacking_regressor)]
):
    scorers = {r"$R^2$": "r2", "MAE": "neg_mean_absolute_error"}

    start_time = time.time()
    scores = cross_validate(est, X, y, scoring=list(scorers.values()), n_jobs=-1)
    elapsed_time = time.time() - start_time

    y_pred = cross_val_predict(est, X, y, n_jobs=-1)
    scores = {
        key: (
            f"{np.abs(np.mean(scores[f'test_{value}'])):.2f}"
            r" $\pm$ "
            f"{np.std(scores[f'test_{value}']):.2f}"
        )
        for key, value in scorers.items()
    }

    display = PredictionErrorDisplay.from_predictions(
        y_true=y,
        y_pred=y_pred,
        kind="actual_vs_predicted",
        ax=ax,
        scatter_kwargs={"alpha": 0.2, "color": "tab:blue"},
        line_kwargs={"color": "tab:red"},
    )
    ax.set_title(f"{name}\nEvaluation in {elapsed_time:.2f} seconds")

    for name, score in scores.items():
        ax.plot([], [], " ", label=f"{name}: {score}")
    ax.legend(loc="upper left")

plt.suptitle("Prediction Errors of Base versus Stacked Predictors", y=1)
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()

Prediction Errors of Base versus Stacked Predictors, Linear Ridge Evaluation in 0.02 seconds, Spline Ridge Evaluation in 0.03 seconds, HGBT Evaluation in 0.32 seconds, Stacking Regressor Evaluation in 1.48 seconds

Even if the scores overlap considerably after cross-validation, the predictions from the stacked regressor are slightly better.

Once fitted, we can inspect the coefficients (or meta-weights) of the trained final_estimator_ (as long as it is a linear model). They reveal how much the individual estimators contribute to the stacked regressor:

stacking_regressor.fit(X, y)
stacking_regressor.final_estimator_.coef_

array([-0.00446216,  0.44878552,  0.54762418])

We see that in this case, the HGBT model dominates, with the spline ridge also contributing meaningfully. The plain linear model does not add useful signal once those two are included; with RidgeCV as the final_estimator, it is not dropped, but receives a small negative weight to correct its residual bias.

If we use LassoCV as the final_estimator, that small, unhelpful contribution is set exactly to zero, yielding a simpler blend of the spline ridge and HGBT models.

from sklearn.linear_model import LassoCV

stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=LassoCV())
stacking_regressor.fit(X, y)
stacking_regressor.final_estimator_.coef_

array([0.        , 0.41148006, 0.56187293])

How to mimic SuperLearner with scikit-learn#

The SuperLearner [Polley2010] is a stacking strategy implemented as an R package, but not available off-the-shelf in Python. It is closely related to the StackingRegressor, as both train the meta-model on out-of-fold predictions from the base estimators.

The key difference is that SuperLearner estimates a convex set of meta-weights (non-negative and summing to 1) and omits an intercept; by contrast, StackingRegressor uses an unconstrained meta-learner with an intercept by default (and can optionally include raw features via passthrough).

Without an intercept, the meta-weights are directly interpretable as fractional contributions to the final prediction.

from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression(fit_intercept=False, positive=True)
super_learner_like = StackingRegressor(
    estimators=estimators, final_estimator=linear_reg
)
super_learner_like.fit(X, y)
super_learner_like.final_estimator_.coef_

array([2.41599723e-04, 4.48129539e-01, 5.49327451e-01])

The sum of meta-weights in the stacked regressor is close to 1.0, but not exactly one:

super_learner_like.final_estimator_.coef_.sum()

np.float64(0.9976985896404182)

Beyond interpretability, the normalization to 1.0 constraint in the SuperLearner presents the following advantages:

Consensus-preserving: if all base models output the same value at a point, the ensemble returns that same value (no artificial amplification or attenuation).
Translation-equivariant: adding a constant to every base prediction shifts the ensemble by the same constant.
Removes one degree of freedom: avoiding redundancy with a constant term and modestly stabilizing weights under collinearity.

The cleanest way to enforce the coefficient normalization with scikit-learn is by defining a custom estimator, but doing so is beyond the scope of this tutorial.

Conclusions#

The stacked regressor combines the strengths of the different regressors. However, notice that training the stacked regressor is much more computationally expensive than selecting the best performing model.

References

[Polley2010]

Polley, E. C. and van der Laan, M. J., Super Learner In Prediction, 2010.

Total running time of the script: (0 minutes 6.766 seconds)

Related examples

Plot individual and voting regression predictions

Features in Histogram Gradient Boosting Trees

Decision Tree Regression with AdaBoost

Release Highlights for scikit-learn 0.22

Gallery generated by Sphinx-Gallery

	estimators estimators: list of (str, estimator) Base estimators which will be stacked together. Each element of the list is defined as a tuple of string (i.e. name) and an estimator instance. An estimator can be set to 'drop' using `set_params`.	[('Linear Ridge', ...), ('Spline Ridge', ...), ...]
	final_estimator final_estimator: estimator, default=None A regressor which will be used to combine the base estimators. The default regressor is a :class:`~sklearn.linear_model.RidgeCV`.	RidgeCV()
	cv cv: int, cross-validation generator, iterable, or "prefit", default=None Determines the cross-validation splitting strategy used in `cross_val_predict` to train `final_estimator`. Possible inputs for cv are: * None, to use the default 5-fold cross validation, * integer, to specify the number of folds in a (Stratified) KFold, * An object to be used as a cross-validation generator, * An iterable yielding train, test splits, * `"prefit"`, to assume the `estimators` are prefit. In this case, the estimators will not be refitted. For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used. In all other cases, :class:`~sklearn.model_selection.KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. If "prefit" is passed, it is assumed that all `estimators` have been fitted already. The `final_estimator_` is trained on the `estimators` predictions on the full training set and are not cross validated predictions. Please note that if the models have been trained on the same data to train the stacking model, there is a very high risk of overfitting. .. versionadded:: 1.1 The 'prefit' option was added in 1.1 .. note:: A larger number of split will provide no benefits if the number of training samples is large enough. Indeed, the training time will increase. ``cv`` is not used for model evaluation but for prediction.	None
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel for `fit` of all `estimators`. `None` means 1 unless in a `joblib.parallel_backend` context. -1 means using all processors. See :term:`Glossary <n_jobs>` for more details.	None
	passthrough passthrough: bool, default=False When False, only the predictions of estimators will be used as training data for `final_estimator`. When True, the `final_estimator` is trained on the predictions as well as the original training data.	False
	verbose verbose: int, default=0 Verbosity level.	0

	alphas alphas: array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0) Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to ``1 / (2C)`` in other linear models such as :class:`~sklearn.linear_model.LogisticRegression` or :class:`~sklearn.svm.LinearSVC`. If using Leave-One-Out cross-validation, alphas must be strictly positive. For an example on how regularization strength affects the model coefficients, see :ref:`sphx_glr_auto_examples_linear_model_plot_ridge_coeffs.py`.	(0.1, ...)
	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	scoring scoring: str, callable, default=None The scoring method to use for cross-validation. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: negative :ref:`mean squared error <mean_squared_error>` if cv is None (i.e. when using leave-one-out cross-validation), or :ref:`coefficient of determination <r2_score>` (:math:`R^2`) otherwise.	None
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds, - :term:`CV splitter`, - an iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if ``y`` is binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used, else, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.	None
	gcv_mode gcv_mode: {'auto', 'svd', 'eigen'}, default='auto' Flag indicating which strategy to use when performing Leave-One-Out Cross-Validation. Options are:: 'auto' : same as 'eigen' 'svd' : use singular value decomposition of X when X is dense, fallback to 'eigen' when X is sparse 'eigen' : use eigendecomposition of X X' when n_samples <= n_features or X' X when n_features < n_samples The 'auto' mode is the default and is intended to pick the cheaper option depending on the shape and sparsity of the training data.	None
	store_cv_results store_cv_results: bool, default=False Flag indicating if the cross-validation values corresponding to each alpha should be stored in the ``cv_results_`` attribute (see below). This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). .. versionchanged:: 1.5 Parameter name changed from `store_cv_values` to `store_cv_results`.	False
	alpha_per_target alpha_per_target: bool, default=False Flag indicating whether to optimize the alpha value (picked from the `alphas` parameter list) for each target separately (for multi-output settings: multiple prediction targets). When set to `True`, after fitting, the `alpha_` attribute will contain a value for each target. When set to `False`, a single alpha is used for all targets. This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). .. versionadded:: 0.24	False

	n_knots n_knots: int, default=5 Number of knots of the splines if `knots` equals one of {'uniform', 'quantile'}. Must be larger or equal 2. Ignored if `knots` is array-like.	6
	degree degree: int, default=3 The polynomial degree of the spline basis. Must be a non-negative integer.	3
	knots knots: {'uniform', 'quantile'} or array-like of shape (n_knots, n_features), default='uniform' Set knot positions such that first knot <= features <= last knot. - If 'uniform', `n_knots` number of knots are distributed uniformly from min to max values of the features. - If 'quantile', they are distributed uniformly along the quantiles of the features. - If an array-like is given, it directly specifies the sorted knot positions including the boundary knots. Note that, internally, `degree` number of knots are added before the first knot, the same after the last knot.	'uniform'
	extrapolation extrapolation: {'error', 'constant', 'linear', 'continue', 'periodic'}, default='constant' If 'error', values outside the min and max values of the training features raises a `ValueError`. If 'constant', the value of the splines at minimum and maximum value of the features is used as constant extrapolation. If 'linear', a linear extrapolation is used. If 'continue', the splines are extrapolated as is, i.e. option `extrapolate=True` in :class:`scipy.interpolate.BSpline`. If 'periodic', periodic splines with a periodicity equal to the distance between the first and last knot are used. Periodic splines enforce equal function values and derivatives at the first and last knot. For example, this makes it possible to avoid introducing an arbitrary jump between Dec 31st and Jan 1st in spline features derived from a naturally periodic "day-of-year" input feature. In this case it is recommended to manually set the knot values to control the period.	'constant'
	include_bias include_bias: bool, default=True If False, then the last spline element inside the data range of a feature is dropped. As B-splines sum to one over the spline basis functions for each data point, they implicitly include a bias term, i.e. a column of ones. It acts as an intercept term in a linear models.	True
	order order: {'C', 'F'}, default='C' Order of output array in the dense case. `'F'` order is faster to compute, but may slow down subsequent estimators.	'C'
	handle_missing handle_missing: {'error', 'zeros'}, default='error' Specifies the way missing values are handled. - 'error' : Raise an error if `np.nan` values are present during :meth:`fit`. - 'zeros' : Encode splines of missing values with values `0`. Note that `handle_missing='zeros'` differs from first imputing missing values with zeros and then creating the spline basis. The latter creates spline basis functions which have non-zero values at the missing values whereas this option simply sets all spline basis function values to zero at the missing values. .. versionadded:: 1.8	'error'
	sparse_output sparse_output: bool, default=False Will return sparse CSR matrix if set True else will return an array. .. versionadded:: 1.2	False

	interaction_only interaction_only: bool, default=False If `True`, only interaction features are produced: features that are products of at most `degree` distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded: - included: `x[0]`, `x[1]`, `x[0] * x[1]`, etc. - excluded: `x[0] 2`, `x[0] 2 * x[1]`, etc.	True
	degree degree: int or tuple (min_degree, max_degree), default=2 If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple `(min_degree, max_degree)` is passed, then `min_degree` is the minimum and `max_degree` is the maximum polynomial degree of the generated features. Note that `min_degree=0` and `min_degree=1` are equivalent as outputting the degree zero term is determined by `include_bias`.	2
	include_bias include_bias: bool, default=True If `True` (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).	True
	order order: {'C', 'F'}, default='C' Order of output array in the dense case. `'F'` order is faster to compute, but may slow down subsequent estimators. .. versionadded:: 0.21	'C'

	alphas alphas: array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0) Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to ``1 / (2C)`` in other linear models such as :class:`~sklearn.linear_model.LogisticRegression` or :class:`~sklearn.svm.LinearSVC`. If using Leave-One-Out cross-validation, alphas must be strictly positive. For an example on how regularization strength affects the model coefficients, see :ref:`sphx_glr_auto_examples_linear_model_plot_ridge_coeffs.py`.	(0.1, ...)
	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	scoring scoring: str, callable, default=None The scoring method to use for cross-validation. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: negative :ref:`mean squared error <mean_squared_error>` if cv is None (i.e. when using leave-one-out cross-validation), or :ref:`coefficient of determination <r2_score>` (:math:`R^2`) otherwise.	None
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds, - :term:`CV splitter`, - an iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if ``y`` is binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used, else, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.	None
	gcv_mode gcv_mode: {'auto', 'svd', 'eigen'}, default='auto' Flag indicating which strategy to use when performing Leave-One-Out Cross-Validation. Options are:: 'auto' : same as 'eigen' 'svd' : use singular value decomposition of X when X is dense, fallback to 'eigen' when X is sparse 'eigen' : use eigendecomposition of X X' when n_samples <= n_features or X' X when n_features < n_samples The 'auto' mode is the default and is intended to pick the cheaper option depending on the shape and sparsity of the training data.	None
	store_cv_results store_cv_results: bool, default=False Flag indicating if the cross-validation values corresponding to each alpha should be stored in the ``cv_results_`` attribute (see below). This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). .. versionchanged:: 1.5 Parameter name changed from `store_cv_values` to `store_cv_results`.	False
	alpha_per_target alpha_per_target: bool, default=False Flag indicating whether to optimize the alpha value (picked from the `alphas` parameter list) for each target separately (for multi-output settings: multiple prediction targets). When set to `True`, after fitting, the `alpha_` attribute will contain a value for each target. When set to `False`, a single alpha is used for all targets. This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). .. versionadded:: 0.24	False

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

	random_state random_state: int, RandomState instance or None, default=None Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	0
	loss loss: {'squared_error', 'absolute_error', 'gamma', 'poisson', 'quantile'}, default='squared_error' The loss function to use in the boosting process. Note that the "squared error", "gamma" and "poisson" losses actually implement "half least squares loss", "half gamma deviance" and "half poisson deviance" to simplify the computation of the gradient. Furthermore, "gamma" and "poisson" losses internally use a log-link, "gamma" requires ``y > 0`` and "poisson" requires ``y >= 0``. "quantile" uses the pinball loss. .. versionchanged:: 0.23 Added option 'poisson'. .. versionchanged:: 1.1 Added option 'quantile'. .. versionchanged:: 1.3 Added option 'gamma'.	'squared_error'
	quantile quantile: float, default=None If loss is "quantile", this parameter specifies which quantile to be estimated and must be between 0 and 1.	None
	learning_rate learning_rate: float, default=0.1 The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage.	0.1
	max_iter max_iter: int, default=100 The maximum number of iterations of the boosting process, i.e. the maximum number of trees.	100
	max_leaf_nodes max_leaf_nodes: int or None, default=31 The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.	31
	max_depth max_depth: int or None, default=None The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn't constrained by default.	None
	min_samples_leaf min_samples_leaf: int, default=20 The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.	20
	l2_regularization l2_regularization: float, default=0 The L2 regularization parameter penalizing leaves with small hessians. Use ``0`` for no regularization (default).	0.0
	max_features max_features: float, default=1.0 Proportion of randomly chosen features in each and every node split. This is a form of regularization, smaller values make the trees weaker learners and might prevent overfitting. If interaction constraints from `interaction_cst` are present, only allowed features are taken into account for the subsampling. .. versionadded:: 1.4	1.0
	max_bins max_bins: int, default=255 The maximum number of bins to use for non-missing values. Before training, each feature of the input array `X` is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. In addition to the ``max_bins`` bins, one more bin is always reserved for missing values. Must be no larger than 255.	255
	categorical_features categorical_features: array-like of {bool, int, str} of shape (n_features) or shape (n_categorical_features,), default='from_dtype' Indicates the categorical features. - None : no feature will be considered categorical. - boolean array-like : boolean mask indicating categorical features. - integer array-like : integer indices indicating categorical features. - str array-like: names of categorical features (assuming the training data has feature names). - `"from_dtype"`: dataframe columns with dtype "Categorical" and "Enum" are considered to be categorical features. The input must be a dataframe that is supported by narwhals (or supports it): :func:`narwhals.from_native` must work. This is the case, for instance, for pandas and polars DataFrames. For each categorical feature, there must be at most `max_bins` unique categories. Negative values for categorical features encoded as numeric dtypes are treated as missing values. All categorical values are converted to floating point numbers. This means that categorical values of 1.0 and 1 are treated as the same category. Read more in the :ref:`User Guide <categorical_support_gbdt>` and :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`. .. versionadded:: 0.24 .. versionchanged:: 1.2 Added support for feature names. .. versionchanged:: 1.4 Added `"from_dtype"` option. .. versionchanged:: 1.6 The default value changed from `None` to `"from_dtype"`.	'from_dtype'
	monotonic_cst monotonic_cst: array-like of int of shape (n_features) or dict, default=None Monotonic constraint to enforce on each feature are specified using the following integer values: - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If a dict with str keys, map feature to monotonic constraints by name. If an array, the features are mapped to constraints by position. See :ref:`monotonic_cst_features_names` for a usage example. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 0.23 .. versionchanged:: 1.2 Accept dict of constraints with feature names as keys.	None
	interaction_cst interaction_cst: {"pairwise", "no_interactions"} or sequence of lists/tuples/sets of int, default=None Specify interaction constraints, the sets of features which can interact with each other in child node splits. Each item specifies the set of feature indices that are allowed to interact with each other. If there are more features than specified in these constraints, they are treated as if they were specified as an additional set. The strings "pairwise" and "no_interactions" are shorthands for allowing only pairwise or no interactions, respectively. For instance, with 5 features in total, `interaction_cst=[{0, 1}]` is equivalent to `interaction_cst=[{0, 1}, {2, 3, 4}]`, and specifies that each branch of a tree will either only split on features 0 and 1 or only split on features 2, 3 and 4. See :ref:`this example<ice-vs-pdp>` on how to use `interaction_cst`. .. versionadded:: 1.2	None
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See :term:`the Glossary <warm_start>`.	False
	early_stopping early_stopping: 'auto' or bool, default='auto' If 'auto', early stopping is enabled if the sample size is larger than 10000 or if `X_val` and `y_val` are passed to `fit`. If True, early stopping is enabled, otherwise early stopping is disabled. .. versionadded:: 0.23	'auto'
	scoring scoring: str or callable or None, default='loss' Scoring method to use for early stopping. Only used if `early_stopping` is enabled. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: the :ref:`coefficient of determination <r2_score>` (:math:`R^2`) is used. - 'loss': early stopping is checked w.r.t the loss value.	'loss'
	validation_fraction validation_fraction: int or float or None, default=0.1 Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. The value is ignored if either early stopping is not performed, e.g. `early_stopping=False`, or if `X_val` and `y_val` are passed to fit.	0.1
	n_iter_no_change n_iter_no_change: int, default=10 Used to determine when to "early stop". The fitting process is stopped when none of the last ``n_iter_no_change`` scores are better than the ``n_iter_no_change - 1`` -th-to-last one, up to some tolerance. Only used if early stopping is performed.	10
	tol tol: float, default=1e-7 The absolute tolerance to use when comparing scores during early stopping. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.	1e-07
	verbose verbose: int, default=0 The verbosity level. If not zero, print some information about the fitting process. ``1`` prints only summary info, ``2`` prints info per iteration.	0