.. note::
    :class: sphx-glr-download-link-note

    Click :ref:`here <sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py>` to download the full example code or to run this example in your browser via Binder
.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py:


===================================
Column Transformer with Mixed Types
===================================

This example illustrates how to apply different preprocessing and
feature extraction pipelines to different subsets of features,
using :class:`sklearn.compose.ColumnTransformer`.
This is particularly handy for the case of datasets that contain
heterogeneous data types, since we may want to scale the
numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after
mean-imputation, while the categorical data is one-hot
encoded after imputing missing values with a new category
(``'missing'``).

Finally, the preprocessing pipeline is integrated in a
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,
together with a simple classification model.


.. code-block:: default


    # Author: Pedro Morales <part.morales@gmail.com>
    #
    # License: BSD 3 clause

    import numpy as np

    from sklearn.compose import ColumnTransformer
    from sklearn.datasets import fetch_openml
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV

    np.random.seed(0)

    # Load data from https://www.openml.org/d/40945
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

    # Alternatively X and y can be obtained directly from the frame attribute:
    # X = titanic.frame.drop('survived', axis=1)
    # y = titanic.frame['survived']

    # We will train our classifier with the following features:
    # Numeric Features:
    # - age: float.
    # - fare: float.
    # Categorical Features:
    # - embarked: categories encoded as strings {'C', 'S', 'Q'}.
    # - sex: categories encoded as strings {'female', 'male'}.
    # - pclass: ordinal integers {1, 2, 3}.

    # We create the preprocessing pipelines for both numeric and categorical data.
    numeric_features = ['age', 'fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_features = ['embarked', 'sex', 'pclass']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression())])

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    model score: 0.790


Using the prediction pipeline in a grid search
##############################################################################
 Grid search can also be performed on the different preprocessing steps
 defined in the ``ColumnTransformer`` object, together with the classifier's
 hyperparameters as part of the ``Pipeline``.
 We will search for both the imputer strategy of the numeric preprocessing
 and the regularization parameter of the logistic regression using
 :class:`sklearn.model_selection.GridSearchCV`.


.. code-block:: default


    param_grid = {
        'preprocessor__num__imputer__strategy': ['mean', 'median'],
        'classifier__C': [0.1, 1.0, 10, 100],
    }

    grid_search = GridSearchCV(clf, param_grid, cv=10)
    grid_search.fit(X_train, y_train)

    print(("best logistic regression from grid search: %.3f"
           % grid_search.score(X_test, y_test)))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    best logistic regression from grid search: 0.798


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  2.795 seconds)

**Estimated memory usage:**  8 MB


.. _sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: binder-badge

    .. image:: https://mybinder.org/badge_logo.svg
      :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.22.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_column_transformer_mixed_types.ipynb
      :width: 150 px


  .. container:: sphx-glr-download

     :download:`Download Python source code: plot_column_transformer_mixed_types.py <plot_column_transformer_mixed_types.py>`


  .. container:: sphx-glr-download

     :download:`Download Jupyter notebook: plot_column_transformer_mixed_types.ipynb <plot_column_transformer_mixed_types.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_