Analysis of the convergence of penalized logistic regression models#

The purpose of this example is three-fold:

  1. Demonstrate registering a ScoringMonitor on the logistic regression step of a pipeline nested inside GridSearchCV.

  2. Show how to plot the metric values collected at each iteration of each fit of the logistic regression model during the grid search and analyze the convergence of the model for each hyperparameter combination.

  3. Show how the monitoring of diverse scoring metrics can inform us about the quality of the model and the trade-off between refinement and calibration.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Setup#

Let’s first define the pipeline and the grid search. Here we register a ScoringMonitor callback on the logistic regression model to monitor the scores at each iteration of the L-BFGS solver.

We reuse the same scoring metrics for the grid search itself and use the D² log-loss as the primary metric to select the best hyperparameter combination.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.callback import ProgressBar, ScoringMonitor
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(
    n_samples=1000, n_features=100, n_classes=10, n_informative=30, random_state=42
)

scoring_metrics = ["d2_log_loss_score", "accuracy", "average_precision"]
scoring_monitor = ScoringMonitor(scoring=scoring_metrics)
model = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver="lbfgs", max_iter=1000).set_callbacks(scoring_monitor),
)

param_grid = {
    "standardscaler__with_std": [True, False],
    "logisticregression__C": np.geomspace(0.01, 100, 3),
}

grid_search = GridSearchCV(
    model,
    param_grid,
    cv=5,
    scoring=scoring_metrics,
    n_jobs=2,
    error_score="raise",
    refit=scoring_metrics[0],
)

Let’s fit the grid search with the auto-propagating progress bar callback. Feel free to set max_propagation_depth=3 in the ProgressBar constructor to get a more detailed output by displaying the progress bars for the pipeline, the standard scaler and the logistic regression.

grid_search.set_callbacks(ProgressBar()).fit(X, y)
GridSearchCV - fit                                                 ━━━ 100% 0:0…
  GridSearchCV - search #0                                         ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #0  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #1  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #2  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #3  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #4  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #5  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #6  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #7  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #8  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #9  ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #10 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #11 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #12 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #13 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #14 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #15 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #16 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #17 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #18 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #19 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #20 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #21 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #22 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #23 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #24 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #25 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #26 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #27 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #28 ━━━ 100% 0:0…
    GridSearchCV - candidate-split-evaluation | Pipeline - fit #29 ━━━ 100% 0:0…
  GridSearchCV - refit-with-best-params | Pipeline - fit #1        ━━━ 100% 0:0…
GridSearchCV(cv=5, error_score='raise',
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=1000))]),
             n_jobs=2,
             param_grid={'logisticregression__C': array([1.e-02, 1.e+00, 1.e+02]),
                         'standardscaler__with_std': [True, False]},
             refit='d2_log_loss_score',
             scoring=['d2_log_loss_score', 'accuracy', 'average_precision'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


We use a grid search with 3 values for the regularization parameter C and 2 values for the standardization of the features resulting in 6 parameter combinations.

Since we use 5-fold cross-validation (cv=5), we will have 5 fits of the logistic regression model for each parameter combination resulting in 30 fits as subtasks of the “search” fit task.

In addition, the grid search performs a final refit on the full dataset with the best hyperparameter combination found during the grid search. This is visible as the “refit-with-best-params” task in the output above.

Consolidation of the grid search results#

Let’s look at the results of the grid search.

cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results.sort_values(by="rank_test_d2_log_loss_score", ascending=True)
mean_fit_time std_fit_time mean_score_time std_score_time param_logisticregression__C param_standardscaler__with_std params split0_test_d2_log_loss_score split1_test_d2_log_loss_score split2_test_d2_log_loss_score split3_test_d2_log_loss_score split4_test_d2_log_loss_score mean_test_d2_log_loss_score std_test_d2_log_loss_score rank_test_d2_log_loss_score split0_test_accuracy split1_test_accuracy split2_test_accuracy split3_test_accuracy split4_test_accuracy mean_test_accuracy std_test_accuracy rank_test_accuracy split0_test_average_precision split1_test_average_precision split2_test_average_precision split3_test_average_precision split4_test_average_precision mean_test_average_precision std_test_average_precision rank_test_average_precision
1 1.383262 0.133837 0.014983 0.008207 0.01 False {'logisticregression__C': 0.01, 'standardscale... 0.135960 0.167780 0.189062 0.173567 0.186634 0.170601 0.019051 1 0.285 0.330 0.330 0.355 0.305 0.321 0.023958 1 0.319819 0.309708 0.360697 0.323126 0.340503 0.330771 0.017957 1
0 0.236534 0.029069 0.017947 0.007621 0.01 True {'logisticregression__C': 0.01, 'standardscale... 0.125785 0.112817 0.129538 0.127723 0.140616 0.127296 0.008883 2 0.290 0.275 0.315 0.320 0.280 0.296 0.018276 2 0.293645 0.258271 0.320959 0.290856 0.326432 0.298033 0.024429 2
2 0.492171 0.051622 0.024628 0.007723 1.00 True {'logisticregression__C': 1.0, 'standardscaler... -0.262989 -0.348023 -0.293989 -0.300126 -0.229700 -0.286965 0.039509 3 0.270 0.245 0.260 0.230 0.290 0.259 0.020591 5 0.299362 0.258130 0.268917 0.268336 0.276840 0.274317 0.013860 3
3 3.735553 0.036760 0.020381 0.006601 1.00 False {'logisticregression__C': 1.0, 'standardscaler... -0.344269 -0.424780 -0.381219 -0.380261 -0.299665 -0.366039 0.041863 4 0.280 0.250 0.260 0.230 0.280 0.260 0.018974 3 0.296932 0.259109 0.266720 0.268014 0.275371 0.273229 0.012926 4
4 0.627907 0.080965 0.016476 0.002917 100.00 True {'logisticregression__C': 100.0, 'standardscal... -0.455882 -0.566974 -0.524439 -0.507518 -0.417598 -0.494482 0.052390 5 0.285 0.235 0.265 0.230 0.285 0.260 0.023664 3 0.293392 0.253425 0.258071 0.263983 0.271392 0.268052 0.014024 5
5 4.113023 0.580409 0.021889 0.008974 100.00 False {'logisticregression__C': 100.0, 'standardscal... -0.458082 -0.571424 -0.526726 -0.510858 -0.419702 -0.497358 0.053110 6 0.285 0.235 0.255 0.230 0.285 0.258 0.023580 6 0.293282 0.253446 0.257693 0.263782 0.271130 0.267867 0.014034 6


We observe that the best models use regularization (small C). Feature standardization does not seem to matter much but helps reduce the fit times. We notice that many models have similar accuracy scores but different D² log-loss scores and average precision scores. D² log-loss and average precision are more sensitive to the quality of the model than accuracy because they evaluate the entire probability distribution of the predictions rather than just the match of the top predicted class with the true class.

Let’s now refine this analysis by looking at the same metrics computed on the training set at each iteration of the L-BFGS solver and for each parameter combination. Note that these are training-set scores recorded during L-BFGS iterations, not the held-out CV scores from cv_results_.

These values are stored in the scoring_monitor callback object:

all_tasks_log = scoring_monitor.get_logs().data_as_pandas
all_tasks_log
task_id_path parent_task_id_path estimator_name task_name task_id sequential_subtasks d2_log_loss_score accuracy average_precision
0 (0, 1, 1) (0, 1) LogisticRegression fit 1 True 0.399415 0.53100 0.506463
1 (0, 0, 0, 1) (0, 0, 0) LogisticRegression fit 1 True 0.306252 0.56000 0.565995
2 (0, 0, 1, 1) (0, 0, 1) LogisticRegression fit 1 True 0.311447 0.57875 0.587007
3 (0, 0, 2, 1) (0, 0, 2) LogisticRegression fit 1 True 0.305960 0.56250 0.564362
4 (0, 0, 3, 1) (0, 0, 3) LogisticRegression fit 1 True 0.309061 0.58875 0.574045
... ... ... ... ... ... ... ... ... ...
2765 (0, 0, 29, 1, 192) (0, 0, 29, 1) LogisticRegression lbfgs-iter 192 True 0.552078 0.66750 0.557115
2766 (0, 0, 29, 1, 193) (0, 0, 29, 1) LogisticRegression lbfgs-iter 193 True 0.552078 0.66750 0.557045
2767 (0, 0, 29, 1, 194) (0, 0, 29, 1) LogisticRegression lbfgs-iter 194 True 0.552078 0.66750 0.557063
2768 (0, 0, 29, 1, 195) (0, 0, 29, 1) LogisticRegression lbfgs-iter 195 True 0.552078 0.66750 0.556962
2769 (0, 0, 29, 1, 196) (0, 0, 29, 1) LogisticRegression lbfgs-iter 196 True 0.552078 0.66750 0.556983

2770 rows × 9 columns



Let’s enrich this log with the candidate parameters and the split index so we can plot the scores for each parameter combination for a particular CV split of interest.

candidate_params = pd.DataFrame(grid_search.cv_results_["params"]).add_prefix("param_")

n_splits = grid_search.n_splits_
lbfgs_log = all_tasks_log.query(
    "estimator_name == 'LogisticRegression' and task_name == 'lbfgs-iter'"
).copy()
# Index 2 in ``task_id_path`` is the ``candidate-split-evaluation`` task id.
# Future versions of scikit-learn will provide a more convenient way to
# retrieve this task id.
lbfgs_log["eval_task_id"] = lbfgs_log["task_id_path"].map(lambda path: path[2])
lbfgs_log["candidate_idx"] = lbfgs_log["eval_task_id"] // n_splits
lbfgs_log["split_idx"] = lbfgs_log["eval_task_id"] % n_splits
lbfgs_log = lbfgs_log.query("split_idx == 0").join(candidate_params, on="candidate_idx")

Exclude the final refit on the full dataset (parent_task_id_path starts with (0, 1) instead of (0, 0) for cross-validation fits). Note that it is possible to call scoring_monitor.get_logs(include_lineage=True) to retrieve the task name of the ancestor tasks if needed.

cv_lbfgs_log = lbfgs_log[
    lbfgs_log["parent_task_id_path"].map(lambda path: path[1]) == 0
]

We define labels for plotting purposes and plot each metric separately.

cv_lbfgs_log["param_label"] = cv_lbfgs_log.apply(
    lambda row: (
        f"with_std={row['param_standardscaler__with_std']}, "
        f"C={row['param_logisticregression__C']:.2g}"
    ),
    axis=1,
)

metrics = {
    "d2_log_loss_score": "D² log-loss (train)",
    "accuracy": "Accuracy (train)",
    "average_precision": "Average precision (train)",
}
_, axes = plt.subplots(
    len(metrics),
    1,
    figsize=(8, 2.5 * len(metrics)),
    sharex=True,
    constrained_layout=True,
)
for idx, (metric, ylabel) in enumerate(metrics.items()):
    ax = axes[idx]
    for param_label, group in cv_lbfgs_log.groupby("param_label", sort=False):
        ax.plot(group["task_id"], group[metric], label=param_label)
    ax.set_ylabel(ylabel)
    if idx == 0:
        ax.set_title("CV split 0")
        ax.legend(title="Hyperparameters", fontsize="small")

_ = axes[-1].set_xlabel("L-BFGS iteration")
CV split 0

Analysis of the convergence of the logistic regression models#

D² log-loss convergence#

The D² log-loss scores generally improve monotonically for all models. This is expected because the logistic regression model is fitted by minimizing the (regularized) log-loss computed on the training set.

Accuracy fluctuations#

The accuracy score improves with the number of iterations, albeit with some local fluctuations. This is expected because accuracy is discontinuous and not directly optimized by the model. Instead the model minimizes the log-loss which is a smooth surrogate for the zero-one loss (and thus related to, but not directly optimized by, accuracy).

Regularization and scaling#

We also observe that the least regularized models (larger C values) tend to reach higher D² log-loss scores, and models trained on scaled features converge in much fewer iterations.

Furthermore, models trained with high regularization (lower C values) converge to a final D² log-loss value that depends on the regularization strength while this is not the case for models trained with low regularization: there is a strong coupling between the optimal regularization strength and the feature scaling.

Average precision vs log-loss, refinement vs calibration#

Finally, we observe that the average precision value measured on the training set can improve quickly in the first iterations and then worsen even though the D² log-loss value continues to improve on the same training data. This is especially noticeable for models trained with low regularization and feature standardization. This counter-intuitive behavior can be explained as follows. First recall that average precision is a pure ranking metric that measures the ability of the model to output predicted probabilities that rank the samples of a given class higher than the samples of the other classes, but does not take into account the calibration of the predicted probabilities. In other words, average precision only evaluates if the predicted probabilities are well ordered relatively to one another but is insensitive to a rank preserving transformation of their absolute values. The log-loss, on the other hand, is a strictly proper scoring rule that accounts for both the refinement (ranking power) of the model and the calibration of the predicted probabilities.

Therefore, the average precision curves of the low-regularized models trained on scaled features suggest that the first iterations mostly improve refinement of the models temporarily leaving calibration behind. In later iterations, the log-loss score continues to improve but average precision values worsen, which suggests that the logistic regression model progressively trades off refinement for calibration over the course of the final iterations. This phenomenon has been studied in [1].

It would be interesting to see if this also happens when evaluating the model on a validation set so we could implement early stopping on average precision to explicitly select a model with high refinement on a validation set. This is not yet possible at the time of writing. Giving callbacks access to the validation set is planned for a future version of scikit-learn. Note that the callbacks API is still experimental and may change without the usual deprecation cycle.

References#

Total running time of the script: (0 minutes 29.838 seconds)

Related examples

Custom refit strategy of a grid search with cross-validation

Custom refit strategy of a grid search with cross-validation

Comparison of Calibration of Classifiers

Comparison of Calibration of Classifiers

Comparing randomized search and grid search for hyperparameter estimation

Comparing randomized search and grid search for hyperparameter estimation

Regularization path of L1- Logistic Regression

Regularization path of L1- Logistic Regression

Gallery generated by Sphinx-Gallery