.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/compose/plot_column_transformer_mixed_types.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py: =================================== Column Transformer with Mixed Types =================================== .. currentmodule:: sklearn This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using :class:`~compose.ColumnTransformer`. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones. In this example, the numeric data is standard-scaled after mean-imputation. The categorical data is one-hot encoded via ``OneHotEncoder``, which creates a new category for missing values. We further reduce the dimensionality by selecting categories using a chi-squared test. In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types. Finally, the preprocessing pipeline is integrated in a full prediction pipeline using :class:`~pipeline.Pipeline`, together with a simple classification model. .. GENERATED FROM PYTHON SOURCE LINES 27-32 .. code-block:: Python # Author: Pedro Morales # # License: BSD 3 clause .. GENERATED FROM PYTHON SOURCE LINES 33-46 .. code-block:: Python import numpy as np from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_openml from sklearn.feature_selection import SelectPercentile, chi2 from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import RandomizedSearchCV, train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler np.random.seed(0) .. GENERATED FROM PYTHON SOURCE LINES 47-48 Load data from https://www.openml.org/d/40945 .. GENERATED FROM PYTHON SOURCE LINES 48-54 .. code-block:: Python X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) # Alternatively X and y can be obtained directly from the frame attribute: # X = titanic.frame.drop('survived', axis=1) # y = titanic.frame['survived'] .. GENERATED FROM PYTHON SOURCE LINES 55-73 Use ``ColumnTransformer`` by selecting column by names We will train our classifier with the following features: Numeric Features: * ``age``: float; * ``fare``: float. Categorical Features: * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``; * ``sex``: categories encoded as strings ``{'female', 'male'}``; * ``pclass``: ordinal integers ``{1, 2, 3}``. We create the preprocessing pipelines for both numeric and categorical data. Note that ``pclass`` could either be treated as a categorical or numeric feature. .. GENERATED FROM PYTHON SOURCE LINES 73-93 .. code-block:: Python numeric_features = ["age", "fare"] numeric_transformer = Pipeline( steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())] ) categorical_features = ["embarked", "sex", "pclass"] categorical_transformer = Pipeline( steps=[ ("encoder", OneHotEncoder(handle_unknown="ignore")), ("selector", SelectPercentile(chi2, percentile=50)), ] ) preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ] ) .. GENERATED FROM PYTHON SOURCE LINES 94-96 Append classifier to preprocessing pipeline. Now we have a full prediction pipeline. .. GENERATED FROM PYTHON SOURCE LINES 96-105 .. code-block:: Python clf = Pipeline( steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())] ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test)) .. rst-class:: sphx-glr-script-out .. code-block:: none model score: 0.798 .. GENERATED FROM PYTHON SOURCE LINES 106-110 HTML representation of ``Pipeline`` (display diagram) When the ``Pipeline`` is printed out in a jupyter notebook an HTML representation of the estimator is displayed: .. GENERATED FROM PYTHON SOURCE LINES 110-112 .. code-block:: Python clf .. raw:: html
Pipeline(steps=[('preprocessor',
                     ColumnTransformer(transformers=[('num',
                                                      Pipeline(steps=[('imputer',
                                                                       SimpleImputer(strategy='median')),
                                                                      ('scaler',
                                                                       StandardScaler())]),
                                                      ['age', 'fare']),
                                                     ('cat',
                                                      Pipeline(steps=[('encoder',
                                                                       OneHotEncoder(handle_unknown='ignore')),
                                                                      ('selector',
                                                                       SelectPercentile(percentile=50,
                                                                                        score_func=<function chi2 at 0x7f9515821430>))]),
                                                      ['embarked', 'sex',
                                                       'pclass'])])),
                    ('classifier', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 113-121 Use ``ColumnTransformer`` by selecting column by data types When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature. :func:`sklearn.compose.make_column_selector` gives this possibility. First, let's only select a subset of columns to simplify our example. .. GENERATED FROM PYTHON SOURCE LINES 121-125 .. code-block:: Python subset_feature = ["embarked", "sex", "pclass", "age", "fare"] X_train, X_test = X_train[subset_feature], X_test[subset_feature] .. GENERATED FROM PYTHON SOURCE LINES 126-127 Then, we introspect the information regarding each column data type. .. GENERATED FROM PYTHON SOURCE LINES 127-130 .. code-block:: Python X_train.info() .. rst-class:: sphx-glr-script-out .. code-block:: none Index: 1047 entries, 1118 to 684 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 embarked 1045 non-null category 1 sex 1047 non-null category 2 pclass 1047 non-null int64 3 age 841 non-null float64 4 fare 1046 non-null float64 dtypes: category(2), float64(2), int64(1) memory usage: 35.0 KB .. GENERATED FROM PYTHON SOURCE LINES 131-136 We can observe that the `embarked` and `sex` columns were tagged as `category` columns when loading the data with ``fetch_openml``. Therefore, we can use this information to dispatch the categorical columns to the ``categorical_transformer`` and the remaining columns to the ``numerical_transformer``. .. GENERATED FROM PYTHON SOURCE LINES 138-143 .. note:: In practice, you will have to handle yourself the column data type. If you want some columns to be considered as `category`, you will have to convert them into categorical columns. If you are using pandas, you can refer to their documentation regarding `Categorical data `_. .. GENERATED FROM PYTHON SOURCE LINES 143-161 .. code-block:: Python from sklearn.compose import make_column_selector as selector preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, selector(dtype_exclude="category")), ("cat", categorical_transformer, selector(dtype_include="category")), ] ) clf = Pipeline( steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())] ) clf.fit(X_train, y_train) print("model score: %.3f" % clf.score(X_test, y_test)) clf .. rst-class:: sphx-glr-script-out .. code-block:: none model score: 0.798 .. raw:: html
Pipeline(steps=[('preprocessor',
                     ColumnTransformer(transformers=[('num',
                                                      Pipeline(steps=[('imputer',
                                                                       SimpleImputer(strategy='median')),
                                                                      ('scaler',
                                                                       StandardScaler())]),
                                                      <sklearn.compose._column_transformer.make_column_selector object at 0x7f94edd92490>),
                                                     ('cat',
                                                      Pipeline(steps=[('encoder',
                                                                       OneHotEncoder(handle_unknown='ignore')),
                                                                      ('selector',
                                                                       SelectPercentile(percentile=50,
                                                                                        score_func=<function chi2 at 0x7f9515821430>))]),
                                                      <sklearn.compose._column_transformer.make_column_selector object at 0x7f94edd923d0>)])),
                    ('classifier', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 162-165 The resulting score is not exactly the same as the one from the previous pipeline because the dtype-based selector treats the ``pclass`` column as a numeric feature instead of a categorical feature as previously: .. GENERATED FROM PYTHON SOURCE LINES 165-168 .. code-block:: Python selector(dtype_exclude="category")(X_train) .. rst-class:: sphx-glr-script-out .. code-block:: none ['pclass', 'age', 'fare'] .. GENERATED FROM PYTHON SOURCE LINES 169-172 .. code-block:: Python selector(dtype_include="category")(X_train) .. rst-class:: sphx-glr-script-out .. code-block:: none ['embarked', 'sex'] .. GENERATED FROM PYTHON SOURCE LINES 173-185 Using the prediction pipeline in a grid search Grid search can also be performed on the different preprocessing steps defined in the ``ColumnTransformer`` object, together with the classifier's hyperparameters as part of the ``Pipeline``. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using :class:`~sklearn.model_selection.RandomizedSearchCV`. This hyperparameter search randomly selects a fixed number of parameter settings configured by `n_iter`. Alternatively, one can use :class:`~sklearn.model_selection.GridSearchCV` but the cartesian product of the parameter space will be evaluated. .. GENERATED FROM PYTHON SOURCE LINES 185-195 .. code-block:: Python param_grid = { "preprocessor__num__imputer__strategy": ["mean", "median"], "preprocessor__cat__selector__percentile": [10, 30, 50, 70], "classifier__C": [0.1, 1.0, 10, 100], } search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0) search_cv .. raw:: html
RandomizedSearchCV(estimator=Pipeline(steps=[('preprocessor',
                                                  ColumnTransformer(transformers=[('num',
                                                                                   Pipeline(steps=[('imputer',
                                                                                                    SimpleImputer(strategy='median')),
                                                                                                   ('scaler',
                                                                                                    StandardScaler())]),
                                                                                   <sklearn.compose._column_transformer.make_column_selector object at 0x7f94edd92490>),
                                                                                  ('cat',
                                                                                   Pipeline(steps=[('encoder',
                                                                                                    OneHotEncoder(handle_unknown='ignore')),
                                                                                                   ('s...
                                                                                                                     score_func=<function chi2 at 0x7f9515821430>))]),
                                                                                   <sklearn.compose._column_transformer.make_column_selector object at 0x7f94edd923d0>)])),
                                                 ('classifier',
                                                  LogisticRegression())]),
                       param_distributions={'classifier__C': [0.1, 1.0, 10, 100],
                                            'preprocessor__cat__selector__percentile': [10,
                                                                                        30,
                                                                                        50,
                                                                                        70],
                                            'preprocessor__num__imputer__strategy': ['mean',
                                                                                     'median']},
                       random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 196-199 Calling 'fit' triggers the cross-validated search for the best hyper-parameters combination: .. GENERATED FROM PYTHON SOURCE LINES 199-204 .. code-block:: Python search_cv.fit(X_train, y_train) print("Best params:") print(search_cv.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none Best params: {'preprocessor__num__imputer__strategy': 'mean', 'preprocessor__cat__selector__percentile': 30, 'classifier__C': 100} .. GENERATED FROM PYTHON SOURCE LINES 205-206 The internal cross-validation scores obtained by those parameters is: .. GENERATED FROM PYTHON SOURCE LINES 206-208 .. code-block:: Python print(f"Internal CV score: {search_cv.best_score_:.3f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Internal CV score: 0.786 .. GENERATED FROM PYTHON SOURCE LINES 209-210 We can also introspect the top grid search results as a pandas dataframe: .. GENERATED FROM PYTHON SOURCE LINES 210-224 .. code-block:: Python import pandas as pd cv_results = pd.DataFrame(search_cv.cv_results_) cv_results = cv_results.sort_values("mean_test_score", ascending=False) cv_results[ [ "mean_test_score", "std_test_score", "param_preprocessor__num__imputer__strategy", "param_preprocessor__cat__selector__percentile", "param_classifier__C", ] ].head(5) .. raw:: html
mean_test_score std_test_score param_preprocessor__num__imputer__strategy param_preprocessor__cat__selector__percentile param_classifier__C
7 0.786015 0.031020 mean 30 100
0 0.785063 0.030498 median 30 1.0
4 0.785063 0.030498 mean 10 10
2 0.785063 0.030498 mean 30 1.0
3 0.783149 0.030462 mean 30 0.1


.. GENERATED FROM PYTHON SOURCE LINES 225-229 The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyperparameter tuning. .. GENERATED FROM PYTHON SOURCE LINES 229-233 .. code-block:: Python print( "accuracy of the best model from randomized search: " f"{search_cv.score(X_test, y_test):.3f}" ) .. rst-class:: sphx-glr-script-out .. code-block:: none accuracy of the best model from randomized search: 0.798 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.278 seconds) .. _sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.4.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_column_transformer_mixed_types.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/?path=auto_examples/compose/plot_column_transformer_mixed_types.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_column_transformer_mixed_types.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_column_transformer_mixed_types.py ` .. include:: plot_column_transformer_mixed_types.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_