12.1. Pandas/Polars Output for Transformers with set_output API#

This part of the user guide explains how scikit-learn supports tabular data.

12.1.1. Propagation of Feature Names#

By default, scikit-learn transformers (estimators with a transform method) return numpy arrays (sometimes also sparse arrays). Because numpy arrays do not provide names for the indices of axes/dimensions, prior to version 1.0 the pipeline.Pipeline did not know how to propagate feature names:

  • The single step estimators did not know how to handle incoming feature names.

  • The pipeline did not know how to pass feature names from step to step.

In practice, a lot of use cases start with tabular data like a pandas dataframe or a polars dataframe which have column/feature names.

A first step to support this important use case was made by the addition of the compose.ColumnTransformer in version 0.20. It acts as a gateway to apply different estimators on the different features. Most notably it understands incoming feature names.

It was then properly solved by SLEP007: Feature names, their generation and the API and fully implemented in version 1.1, see the release highlights 1.0 and release highlights 1.1. When an estimator is passed a dataframe during fit, the estimator will set a feature_names_in_ attribute containing the feature names. It understands pandas dataframes as well as dataframes with the Python dataframe interchange protocol __dataframe__. Furthermore, fitted estimators have the method get_feature_names_out. The get_feature_names_out of a transformer returns⸺you guessed it⸺the feature names of what transform returns.

12.1.2. Introducing the set_output API#

A further major step to support dataframes in a “dataframe in, dataframe out” fashion was SLEP018, implemented for pandas dataframes in version 1.2 and for polars dataframes in version 1.4. It introduced the set_output API to configure transformers to output pandas or polars DataFrames. The output of transformers can be configured per estimator by calling the set_output method or globally, by setting set_config(transform_output="pandas"). Set it to "polars" instead of "pandas" if you want the same thing to happen but with polars DataFrames.

The usage is basically as follows:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import OneHotEncoder
>>> from sklearn.linear_model import LinearRegression

>>> X = pd.DataFrame(
...     {"animals": ["cat", "cat", "dog", "dog"], "numeric": np.linspace(-1, 1, 4)}
... )
>>> y = np.array([-1.5, 0, 0.1, 1.0])
>>> ct = ColumnTransformer(
...     [("categorical", OneHotEncoder(sparse_output=False), ["animals"])],
...     remainder="passthrough",
... )
>>> model = make_pipeline(ct, LinearRegression()).fit(X, y)
>>> model.feature_names_in_
array(['animals', 'numeric'], dtype=object)
>>> model[0].get_feature_names_out()
array(['categorical__animals_cat', 'categorical__animals_dog',
   'remainder__numeric'], dtype=object)
>>> model[0].transform(X)
array([[ 1.        ,  0.        , -1.        ],
       [ 1.        ,  0.        , -0.33333333],
       [ 0.        ,  1.        ,  0.33333333],
       [ 0.        ,  1.        ,  1.        ]])

Now the same, but with pandas set as output:

>>> from sklearn import set_config
>>> set_config(transform_output="pandas")
>>> model[0].transform(X)
       c...
categorical__animals_cat categorical__animals_dog remainder__numeric
0 1.0 0.0 -1.000000
1 1.0 0.0 -0.333333
2 0.0 1.0 0.333333
3 0.0 1.0 1.000000

To return to the default, simply run:

>>> set_config(transform_output="default")

A more detailed example can be found in Introducing the set_output API.