{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Scaling the regularization parameter for SVCs\n\nThe following example illustrates the effect of scaling the\nregularization parameter when using `svm` for\n`classification `.\nFor SVC classification, we are interested in a risk minimization for the\nequation:\n\n\n\\begin{align}C \\sum_{i=1, n} \\mathcal{L} (f(x_i), y_i) + \\Omega (w)\\end{align}\n\nwhere\n\n - $C$ is used to set the amount of regularization\n - $\\mathcal{L}$ is a `loss` function of our samples\n and our model parameters.\n - $\\Omega$ is a `penalty` function of our model parameters\n\nIf we consider the loss function to be the individual error per\nsample, then the data-fit term, or the sum of the error for each sample, will\nincrease as we add more samples. The penalization term, however, will not\nincrease.\n\nWhen using, for example, `cross validation `, to\nset the amount of regularization with `C`, there will be a\ndifferent amount of samples between the main problem and the smaller problems\nwithin the folds of the cross validation.\n\nSince our loss function is dependent on the amount of samples, the latter\nwill influence the selected value of `C`.\nThe question that arises is `How do we optimally adjust C to\naccount for the different amount of training samples?`\n\nThe figures below are used to illustrate the effect of scaling our\n`C` to compensate for the change in the number of samples, in the\ncase of using an `l1` penalty, as well as the `l2` penalty.\n\n## l1-penalty case\nIn the `l1` case, theory says that prediction consistency\n(i.e. that under given hypothesis, the estimator\nlearned predicts as well as a model knowing the true distribution)\nis not possible because of the bias of the `l1`. It does say, however,\nthat model consistency, in terms of finding the right set of non-zero\nparameters as well as their signs, can be achieved by scaling\n`C1`.\n\n## l2-penalty case\nThe theory says that in order to achieve prediction consistency, the\npenalty parameter should be kept constant\nas the number of samples grow.\n\n## Simulations\n\nThe two figures below plot the values of `C` on the `x-axis` and the\ncorresponding cross-validation scores on the `y-axis`, for several different\nfractions of a generated data-set.\n\nIn the `l1` penalty case, the cross-validation-error correlates best with\nthe test-error, when scaling our `C` with the number of samples, `n`,\nwhich can be seen in the first figure.\n\nFor the `l2` penalty case, the best result comes from the case where `C`\nis not scaled.\n\n.. topic:: Note:\n\n Two separate datasets are used for the two different plots. The reason\n behind this is the `l1` case works better on sparse data, while `l2`\n is better suited to the non-sparse case.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(__doc__)\n\n\n# Author: Andreas Mueller \n# Jaques Grobler \n# License: BSD 3 clause\n\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.svm import LinearSVC\nfrom sklearn.model_selection import ShuffleSplit\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.utils import check_random_state\nfrom sklearn import datasets\n\nrnd = check_random_state(1)\n\n# set up dataset\nn_samples = 100\nn_features = 300\n\n# l1 data (only 5 informative features)\nX_1, y_1 = datasets.make_classification(\n n_samples=n_samples, n_features=n_features, n_informative=5, random_state=1\n)\n\n# l2 data: non sparse, but less features\ny_2 = np.sign(0.5 - rnd.rand(n_samples))\nX_2 = rnd.randn(n_samples, n_features // 5) + y_2[:, np.newaxis]\nX_2 += 5 * rnd.randn(n_samples, n_features // 5)\n\nclf_sets = [\n (\n LinearSVC(penalty=\"l1\", loss=\"squared_hinge\", dual=False, tol=1e-3),\n np.logspace(-2.3, -1.3, 10),\n X_1,\n y_1,\n ),\n (\n LinearSVC(penalty=\"l2\", loss=\"squared_hinge\", dual=True),\n np.logspace(-4.5, -2, 10),\n X_2,\n y_2,\n ),\n]\n\ncolors = [\"navy\", \"cyan\", \"darkorange\"]\nlw = 2\n\nfor clf, cs, X, y in clf_sets:\n # set up the plot for each regressor\n fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(9, 10))\n\n for k, train_size in enumerate(np.linspace(0.3, 0.7, 3)[::-1]):\n param_grid = dict(C=cs)\n # To get nice curve, we need a large number of iterations to\n # reduce the variance\n grid = GridSearchCV(\n clf,\n refit=False,\n param_grid=param_grid,\n cv=ShuffleSplit(\n train_size=train_size, test_size=0.3, n_splits=250, random_state=1\n ),\n )\n grid.fit(X, y)\n scores = grid.cv_results_[\"mean_test_score\"]\n\n scales = [\n (1, \"No scaling\"),\n ((n_samples * train_size), \"1/n_samples\"),\n ]\n\n for ax, (scaler, name) in zip(axes, scales):\n ax.set_xlabel(\"C\")\n ax.set_ylabel(\"CV Score\")\n grid_cs = cs * float(scaler) # scale the C's\n ax.semilogx(\n grid_cs,\n scores,\n label=\"fraction %.2f\" % train_size,\n color=colors[k],\n lw=lw,\n )\n ax.set_title(\n \"scaling=%s, penalty=%s, loss=%s\" % (name, clf.penalty, clf.loss)\n )\n\n plt.legend(loc=\"best\")\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 0
}