Comparing randomized search and grid search for hyperparameter estimation#

Compare several strategies to optimize the hyperparameters of a linear SVM trained with stochastic gradient descent. We score the candidates with the one-vs-rest ROC AUC (roc_auc_ovr), which evaluates the ranking quality of the predicted class probabilities rather than the raw accuracy of the hard labels.

We start with an exhaustive grid search (GridSearchCV) that evaluates every combination of a discretized parameter grid. Because the grid is fixed, we are forced to fit every configuration and we may even miss a good combination that falls between two grid points.

We then run a randomized search (RandomizedSearchCV) that samples the parameter space instead of discretizing it. With a smaller budget it explores the space more efficiently and reaches an equivalent solution with fewer iterations.

Finally, we run a successive halving search (HalvingRandomSearchCV) that samples even more candidates but keeps the cost low by evaluating them on a small number of training samples first and discarding the unpromising ones early. It combines a large number of candidates with a reduced fit and predict time in the early iterations.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Loading the dataset#

We use the digits dataset and restrict ourselves to the first three classes to keep the problem small and the search fast.

from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True, n_class=3)

Defining the estimator#

We optimize a linear SVM trained with stochastic gradient descent (SGDClassifier). We use the "modified_huber" loss, a smoothed variant of the hinge loss: it keeps the large-margin behavior of a linear SVM while also exposing predict_proba. The probability estimates are required to score the candidates with the one-vs-rest ROC AUC, which the plain hinge loss could not provide.

from sklearn.linear_model import SGDClassifier

linear_svm = SGDClassifier(
    loss="modified_huber",
    penalty="elasticnet",
    fit_intercept=True,
    max_iter=5_000,
)

The report helper below prints the best parameter settings found by a given search so that we can compare the different strategies.

Successive halving stores in cv_results_ the candidates evaluated at every iteration, on increasing amounts of resources. The early iterations are scored on a small subset of the data, where perfect but unreliable scores are common. To keep the comparison fair, when an "iter" column is present we only keep the last iteration, i.e. the surviving candidates trained on the full set of resources, and we rank candidates by their mean validation score.

import pandas as pd


def report(results, n_top=3):
    """Report the top parameters for each search strategy."""
    results = pd.DataFrame(results)
    if "iter" in results:
        results = results[results["iter"] == results["iter"].max()]

    for rank, (_, candidate) in enumerate(
        results.nlargest(n_top, "mean_test_score").iterrows(), start=1
    ):
        print(
            f"Model with rank: {rank}\n"
            f"Mean validation score: "
            f"{candidate['mean_test_score']:.3f} "
            f"(std: {candidate['std_test_score']:.3f})\n"
            f"Parameters: {candidate['params']}\n"
        )

Conclusion#

Running the three searches on the same problem highlights their trade-offs:

  • Grid search evaluates all 60 combinations of the grid and reaches a best mean validation ROC AUC of essentially 1.0. It is exhaustive, but its cost grows with the resolution of the grid and a finer grid would be needed to refine the continuous parameters, making it the slowest of the three.

  • Randomized search reaches an essentially equivalent score while sampling only 30 candidates, i.e. half the budget, and is therefore markedly faster. Drawing the continuous parameters from distributions is usually a better use of a limited budget than refining a grid.

  • Successive halving screens the 60 candidates for a run time comparable to the randomized search by spending most of its resources only on the most promising candidates. It explores more candidates than the randomized search without paying the full cost of the grid search.

A word of caution when reading the halving output: the cv_results_ of HalvingRandomSearchCV aggregates every iteration, including the first ones evaluated on very few samples where perfect but unreliable scores are common. This is why the report helper above keeps only the last iteration for the halving search, so that the reported scores are computed on the full set of resources and remain comparable to the grid and randomized searches. More generally, rely on best_params_ – which the halving search selects among the last-iteration candidates – and confirm the chosen model on a held-out test set.

Total running time of the script: (4 minutes 9.877 seconds)

Related examples

Successive Halving Iterations

Successive Halving Iterations

Analysis of the convergence of penalized logistic regression models

Analysis of the convergence of penalized logistic regression models

Custom refit strategy of a grid search with cross-validation

Custom refit strategy of a grid search with cross-validation

Comparison between grid search and successive halving

Comparison between grid search and successive halving

Gallery generated by Sphinx-Gallery