.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/preprocessing/plot_target_encoder_cross_val.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_preprocessing_plot_target_encoder_cross_val.py>`
        to download the full example code or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_preprocessing_plot_target_encoder_cross_val.py:


=======================================
Target Encoder's Internal Cross fitting
=======================================

.. currentmodule:: sklearn.preprocessing

The :class:`TargetEncoder` replaces each category of a categorical feature with
the shrunk mean of the target variable for that category. This method is useful
in cases where there is a strong relationship between the categorical feature
and the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses
an internal :term:`cross fitting` scheme to encode the training data to be used
by a downstream model. This scheme involves splitting the data into *k* folds
and encoding each fold using the encodings learnt using the other *k-1* folds.
In this example, we demonstrate the importance of the cross
fitting procedure to prevent overfitting.

.. GENERATED FROM PYTHON SOURCE LINES 20-29

Create Synthetic Dataset
========================
For this example, we build a dataset with three categorical features:

* an informative feature with medium cardinality ("informative")
* an uninformative feature with medium cardinality ("shuffled")
* an uninformative feature with high cardinality ("near_unique")

First, we generate the informative feature:

.. GENERATED FROM PYTHON SOURCE LINES 29-54

.. code-block:: Python

    import numpy as np

    from sklearn.preprocessing import KBinsDiscretizer

    n_samples = 50_000

    rng = np.random.RandomState(42)
    y = rng.randn(n_samples)
    noise = 0.5 * rng.randn(n_samples)
    n_categories = 100

    kbins = KBinsDiscretizer(
        n_bins=n_categories,
        encode="ordinal",
        strategy="uniform",
        random_state=rng,
        subsample=None,
    )
    X_informative = kbins.fit_transform((y + noise).reshape(-1, 1))

    # Remove the linear relationship between y and the bin index by permuting the
    # values of X_informative:
    permuted_categories = rng.permutation(n_categories)
    X_informative = permuted_categories[X_informative.astype(np.int32)]


.. GENERATED FROM PYTHON SOURCE LINES 55-57

The uninformative feature with medium cardinality is generated by permuting the
informative feature and removing the relationship with the target:

.. GENERATED FROM PYTHON SOURCE LINES 57-59

.. code-block:: Python

    X_shuffled = rng.permutation(X_informative)


.. GENERATED FROM PYTHON SOURCE LINES 60-67

The uninformative feature with high cardinality is generated so that it is
independent of the target variable. We will show that target encoding without
:term:`cross fitting` will cause catastrophic overfitting for the downstream
regressor. These high cardinality features are basically unique identifiers
for samples which should generally be removed from machine learning datasets.
In this example, we generate them to show how :class:`TargetEncoder`'s default
:term:`cross fitting` behavior mitigates the overfitting issue automatically.

.. GENERATED FROM PYTHON SOURCE LINES 67-71

.. code-block:: Python

    X_near_unique_categories = rng.choice(
        int(0.9 * n_samples), size=n_samples, replace=True
    ).reshape(-1, 1)


.. GENERATED FROM PYTHON SOURCE LINES 72-73

Finally, we assemble the dataset and perform a train test split:

.. GENERATED FROM PYTHON SOURCE LINES 73-86

.. code-block:: Python

    import pandas as pd

    from sklearn.model_selection import train_test_split

    X = pd.DataFrame(
        np.concatenate(
            [X_informative, X_shuffled, X_near_unique_categories],
            axis=1,
        ),
        columns=["informative", "shuffled", "near_unique"],
    )
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


.. GENERATED FROM PYTHON SOURCE LINES 87-95

Training a Ridge Regressor
==========================
In this section, we train a ridge regressor on the dataset with and without
encoding and explore the influence of target encoder with and without the
internal :term:`cross fitting`. First, we see the Ridge model trained on the
raw features will have low performance. This is because we permuted the order
of the informative feature meaning `X_informative` is not informative when
raw:

.. GENERATED FROM PYTHON SOURCE LINES 95-107

.. code-block:: Python

    import sklearn
    from sklearn.linear_model import Ridge

    # Configure transformers to always output DataFrames
    sklearn.set_config(transform_output="pandas")

    ridge = Ridge(alpha=1e-6, solver="lsqr", fit_intercept=False)

    raw_model = ridge.fit(X_train, y_train)
    print("Raw Model score on training set: ", raw_model.score(X_train, y_train))
    print("Raw Model score on test set: ", raw_model.score(X_test, y_test))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Raw Model score on training set:  0.0049896314219659565
    Raw Model score on test set:  0.004577621581492997


.. GENERATED FROM PYTHON SOURCE LINES 108-111

Next, we create a pipeline with the target encoder and ridge model. The pipeline
uses :meth:`TargetEncoder.fit_transform` which uses :term:`cross fitting`. We
see that the model fits the data well and generalizes to the test set:

.. GENERATED FROM PYTHON SOURCE LINES 111-119

.. code-block:: Python

    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import TargetEncoder

    model_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge)
    model_with_cf.fit(X_train, y_train)
    print("Model with CF on train set: ", model_with_cf.score(X_train, y_train))
    print("Model with CF on test set: ", model_with_cf.score(X_test, y_test))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Model with CF on train set:  0.8000184677460305
    Model with CF on test set:  0.7927845601690917


.. GENERATED FROM PYTHON SOURCE LINES 120-122

The coefficients of the linear model shows that most of the weight is on the
feature at column index 0, which is the informative feature

.. GENERATED FROM PYTHON SOURCE LINES 122-137

.. code-block:: Python

    import matplotlib.pyplot as plt
    import pandas as pd

    plt.rcParams["figure.constrained_layout.use"] = True

    coefs_cf = pd.Series(
        model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_
    ).sort_values()
    ax = coefs_cf.plot(kind="barh")
    _ = ax.set(
        title="Target encoded with cross fitting",
        xlabel="Ridge coefficient",
        ylabel="Feature",
    )


.. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png
   :alt: Target encoded with cross fitting
   :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 138-145

While :meth:`TargetEncoder.fit_transform` uses an internal
:term:`cross fitting` scheme to learn encodings for the training set,
:meth:`TargetEncoder.transform` itself does not.
It uses the complete training set to learn encodings and to transform the
categorical features. Thus, we can use :meth:`TargetEncoder.fit` followed by
:meth:`TargetEncoder.transform` to disable the :term:`cross fitting`. This
encoding is then passed to the ridge model.

.. GENERATED FROM PYTHON SOURCE LINES 145-152

.. code-block:: Python

    target_encoder = TargetEncoder(random_state=0)
    target_encoder.fit(X_train, y_train)
    X_train_no_cf_encoding = target_encoder.transform(X_train)
    X_test_no_cf_encoding = target_encoder.transform(X_test)

    model_no_cf = ridge.fit(X_train_no_cf_encoding, y_train)


.. GENERATED FROM PYTHON SOURCE LINES 153-155

We evaluate the model that did not use :term:`cross fitting` when encoding and
see that it overfits:

.. GENERATED FROM PYTHON SOURCE LINES 155-167

.. code-block:: Python

    print(
        "Model without CF on training set: ",
        model_no_cf.score(X_train_no_cf_encoding, y_train),
    )
    print(
        "Model without CF on test set: ",
        model_no_cf.score(
            X_test_no_cf_encoding,
            y_test,
        ),
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Model without CF on training set:  0.858486250088675
    Model without CF on test set:  0.6338211367102258


.. GENERATED FROM PYTHON SOURCE LINES 168-172

The ridge model overfits because it assigns much more weight to the
uninformative extremely high cardinality ("near_unique") and medium
cardinality ("shuffled") features than when the model used
:term:`cross fitting` to encode the features.

.. GENERATED FROM PYTHON SOURCE LINES 172-182

.. code-block:: Python

    coefs_no_cf = pd.Series(
        model_no_cf.coef_, index=model_no_cf.feature_names_in_
    ).sort_values()
    ax = coefs_no_cf.plot(kind="barh")
    _ = ax.set(
        title="Target encoded without cross fitting",
        xlabel="Ridge coefficient",
        ylabel="Feature",
    )


.. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png
   :alt: Target encoded without cross fitting
   :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 183-192

Conclusion
==========
This example demonstrates the importance of :class:`TargetEncoder`'s internal
:term:`cross fitting`. It is important to use
:meth:`TargetEncoder.fit_transform` to encode training data before passing it
to a machine learning model. When a :class:`TargetEncoder` is a part of a
:class:`~sklearn.pipeline.Pipeline` and the pipeline is fitted, the pipeline
will correctly call :meth:`TargetEncoder.fit_transform` and use
:term:`cross fitting` when encoding the training data.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.285 seconds)


.. _sphx_glr_download_auto_examples_preprocessing_plot_target_encoder_cross_val.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/preprocessing/plot_target_encoder_cross_val.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/?path=auto_examples/preprocessing/plot_target_encoder_cross_val.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_target_encoder_cross_val.ipynb <plot_target_encoder_cross_val.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_target_encoder_cross_val.py <plot_target_encoder_cross_val.py>`


.. include:: plot_target_encoder_cross_val.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_