{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Importance of Feature Scaling\n\nFeature scaling through standardization, also called Z-score normalization, is\nan important preprocessing step for many machine learning algorithms. It\ninvolves rescaling each feature such that it has a standard deviation of 1 and a\nmean of 0.\n\nEven if tree based models are (almost) not affected by scaling, many other\nalgorithms require features to be normalized, often for different reasons: to\nease the convergence (such as a non-penalized logistic regression), to create a\ncompletely different model fit compared to the fit with unscaled data (such as\nKNeighbors models). The latter is demoed on the first part of the present\nexample.\n\nOn the second part of the example we show how Principal Component Analysis (PCA)\nis impacted by normalization of features. To illustrate this, we compare the\nprincipal components found using :class:`~sklearn.decomposition.PCA` on unscaled\ndata with those obatined when using a\n:class:`~sklearn.preprocessing.StandardScaler` to scale data first.\n\nIn the last part of the example we show the effect of the normalization on the\naccuracy of a model trained on PCA-reduced data.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and prepare data\n\nThe dataset used is the `wine_dataset` available at UCI. This dataset has\ncontinuous features that are heterogeneous in scale due to differing\nproperties that they measure (e.g. alcohol content and malic acid).\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.datasets import load_wine\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\nX, y = load_wine(return_X_y=True, as_frame=True)\nscaler = StandardScaler().set_output(transform=\"pandas\")\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.30, random_state=42\n)\nscaled_X_train = scaler.fit_transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n## Effect of rescaling on a k-neighbors models\n\nFor the sake of visualizing the decision boundary of a\n:class:`~sklearn.neighbors.KNeighborsClassifier`, in this section we select a\nsubset of 2 features that have values with different orders of magnitude.\n\nKeep in mind that using a subset of the features to train the model may likely\nleave out feature with high predictive impact, resulting in a decision\nboundary that is much worse in comparison to a model trained on the full set\nof features.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n\nfrom sklearn.inspection import DecisionBoundaryDisplay\nfrom sklearn.neighbors import KNeighborsClassifier\n\nX_plot = X[[\"proline\", \"hue\"]]\nX_plot_scaled = scaler.fit_transform(X_plot)\nclf = KNeighborsClassifier(n_neighbors=20)\n\n\ndef fit_and_plot_model(X_plot, y, clf, ax):\n clf.fit(X_plot, y)\n disp = DecisionBoundaryDisplay.from_estimator(\n clf,\n X_plot,\n response_method=\"predict\",\n alpha=0.5,\n ax=ax,\n )\n disp.ax_.scatter(X_plot[\"proline\"], X_plot[\"hue\"], c=y, s=20, edgecolor=\"k\")\n disp.ax_.set_xlim((X_plot[\"proline\"].min(), X_plot[\"proline\"].max()))\n disp.ax_.set_ylim((X_plot[\"hue\"].min(), X_plot[\"hue\"].max()))\n return disp.ax_\n\n\nfig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6))\n\nfit_and_plot_model(X_plot, y, clf, ax1)\nax1.set_title(\"KNN without scaling\")\n\nfit_and_plot_model(X_plot_scaled, y, clf, ax2)\nax2.set_xlabel(\"scaled proline\")\nax2.set_ylabel(\"scaled hue\")\n_ = ax2.set_title(\"KNN with scaling\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the decision boundary shows that fitting scaled or non-scaled data lead\nto completely different models. The reason is that the variable \"proline\" has\nvalues which vary between 0 and 1,000; whereas the variable \"hue\" varies\nbetween 1 and 10. Because of this, distances between samples are mostly\nimpacted by differences in values of \"proline\", while values of the \"hue\" will\nbe comparatively ignored. If one uses\n:class:`~sklearn.preprocessing.StandardScaler` to normalize this database,\nboth scaled values lay approximately between -3 and 3 and the neighbors\nstructure will be impacted more or less equivalently by both variables.\n\n## Effect of rescaling on a PCA dimensional reduction\n\nDimensional reduction using :class:`~sklearn.decomposition.PCA` consists of\nfinding the features that maximize the variance. If one feature varies more\nthan the others only because of their respective scales,\n:class:`~sklearn.decomposition.PCA` would determine that such feature\ndominates the direction of the principal components.\n\nWe can inspect the first principal components using all the original features:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n\nfrom sklearn.decomposition import PCA\n\npca = PCA(n_components=2).fit(X_train)\nscaled_pca = PCA(n_components=2).fit(scaled_X_train)\nX_train_transformed = pca.transform(X_train)\nX_train_std_transformed = scaled_pca.transform(scaled_X_train)\n\nfirst_pca_component = pd.DataFrame(\n pca.components_[0], index=X.columns, columns=[\"without scaling\"]\n)\nfirst_pca_component[\"with scaling\"] = scaled_pca.components_[0]\nfirst_pca_component.plot.bar(\n title=\"Weights of the first principal component\", figsize=(6, 8)\n)\n\n_ = plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indeed we find that the \"proline\" feature dominates the direction of the first\nprincipal component without scaling, being about two orders of magnitude above\nthe other features. This is contrasted when observing the first principal\ncomponent for the scaled version of the data, where the orders of magnitude\nare roughly the same across all the features.\n\nWe can visualize the distribution of the principal components in both cases:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))\n\ntarget_classes = range(0, 3)\ncolors = (\"blue\", \"red\", \"green\")\nmarkers = (\"^\", \"s\", \"o\")\n\nfor target_class, color, marker in zip(target_classes, colors, markers):\n ax1.scatter(\n x=X_train_transformed[y_train == target_class, 0],\n y=X_train_transformed[y_train == target_class, 1],\n color=color,\n label=f\"class {target_class}\",\n alpha=0.5,\n marker=marker,\n )\n\n ax2.scatter(\n x=X_train_std_transformed[y_train == target_class, 0],\n y=X_train_std_transformed[y_train == target_class, 1],\n color=color,\n label=f\"class {target_class}\",\n alpha=0.5,\n marker=marker,\n )\n\nax1.set_title(\"Unscaled training dataset after PCA\")\nax2.set_title(\"Standardized training dataset after PCA\")\n\nfor ax in (ax1, ax2):\n ax.set_xlabel(\"1st principal component\")\n ax.set_ylabel(\"2nd principal component\")\n ax.legend(loc=\"upper right\")\n ax.grid()\n\n_ = plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the plot above we observe that scaling the features before reducing the\ndimensionality results in components with the same order of magnitude. In this\ncase it also improves the separability of the classes. Indeed, in the next\nsection we confirm that a better separability has a good repercussion on the\noverall model's performance.\n\n## Effect of rescaling on model's performance\n\nFirst we show how the optimal regularization of a\n:class:`~sklearn.linear_model.LogisticRegressionCV` depends on the scaling or\nnon-scaling of the data:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n\nfrom sklearn.linear_model import LogisticRegressionCV\nfrom sklearn.pipeline import make_pipeline\n\nCs = np.logspace(-5, 5, 20)\n\nunscaled_clf = make_pipeline(pca, LogisticRegressionCV(Cs=Cs))\nunscaled_clf.fit(X_train, y_train)\n\nscaled_clf = make_pipeline(scaler, pca, LogisticRegressionCV(Cs=Cs))\nscaled_clf.fit(X_train, y_train)\n\nprint(f\"Optimal C for the unscaled PCA: {unscaled_clf[-1].C_[0]:.4f}\\n\")\nprint(f\"Optimal C for the standardized data with PCA: {scaled_clf[-1].C_[0]:.2f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The need for regularization is higher (lower values of `C`) for the data that\nwas not scaled before applying PCA. We now evaluate the effect of scaling on\nthe accuracy and the mean log-loss of the optimal models:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score, log_loss\n\ny_pred = unscaled_clf.predict(X_test)\ny_pred_scaled = scaled_clf.predict(X_test)\ny_proba = unscaled_clf.predict_proba(X_test)\ny_proba_scaled = scaled_clf.predict_proba(X_test)\n\nprint(\"Test accuracy for the unscaled PCA\")\nprint(f\"{accuracy_score(y_test, y_pred):.2%}\\n\")\nprint(\"Test accuracy for the standardized data with PCA\")\nprint(f\"{accuracy_score(y_test, y_pred_scaled):.2%}\\n\")\nprint(\"Log-loss for the unscaled PCA\")\nprint(f\"{log_loss(y_test, y_proba):.3}\\n\")\nprint(\"Log-loss for the standardized data with PCA\")\nprint(f\"{log_loss(y_test, y_proba_scaled):.3}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A clear difference in prediction accuracies is observed when the data is\nscaled before :class:`~sklearn.decomposition.PCA`, as it vastly outperforms\nthe unscaled version. This corresponds to the intuition obtained from the plot\nin the previous section, where the components become linearly separable when\nscaling before using :class:`~sklearn.decomposition.PCA`.\n\nNotice that in this case the models with scaled features perform better than\nthe models with non-scaled features because all the variables are expected to\nbe predictive and we rather avoid some of them being comparatively ignored.\n\nIf the variables in lower scales were not predictive, one may experience a\ndecrease of the performance after scaling the features: noisy features would\ncontribute more to the prediction after scaling and therefore scaling would\nincrease overfitting.\n\nLast but not least, we observe that one achieves a lower log-loss by means of\nthe scaling step.\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 0
}