.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here ` to download the full example code or to run this example in your browser via Binder
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_auto_examples_compose_plot_column_transformer.py:
==================================================
Column Transformer with Heterogeneous Data Sources
==================================================
Datasets can often contain components that require different feature
extraction and processing pipelines. This scenario might occur when:
1. your dataset consists of heterogeneous data types (e.g. raster images and
text captions),
2. your dataset is stored in a :class:`pandas.DataFrame` and different columns
require different processing pipelines.
This example demonstrates how to use
:class:`~sklearn.compose.ColumnTransformer` on a dataset containing
different types of features. The choice of features is not particularly
helpful, but serves to illustrate the technique.
.. code-block:: default
# Author: Matt Terry
#
# License: BSD 3 clause
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC
20 newsgroups dataset
---------------------
We will use the :ref:`20 newsgroups dataset <20newsgroups_dataset>`, which
comprises posts from newsgroups on 20 topics. This dataset is split
into train and test subsets based on messages posted before and after
a specific date. We will only use posts from 2 categories to speed up running
time.
.. code-block:: default
categories = ['sci.med', 'sci.space']
X_train, y_train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
remove=('footers', 'quotes'),
return_X_y=True)
X_test, y_test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
remove=('footers', 'quotes'),
return_X_y=True)
Each feature comprises meta information about that post, such as the subject,
and the body of the news post.
.. code-block:: default
print(X_train[0])
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539)
Subject: Re: Metric vs English
Article-I.D.: mksol.1993Apr6.131900.8407
Organization: Texas Instruments Inc
Lines: 31
American, perhaps, but nothing military about it. I learned (mostly)
slugs when we talked English units in high school physics and while
the teacher was an ex-Navy fighter jock the book certainly wasn't
produced by the military.
[Poundals were just too flinking small and made the math come out
funny; sort of the same reason proponents of SI give for using that.]
--
"Insisting on perfect safety is for people who don't have the balls to live
in the real world." -- Mary Shafer, NASA Ames Dryden
Creating transformers
---------------------
First, we would like a transformer that extracts the subject and
body of each post. Since this is a stateless transformation (does not
require state information from training data), we can define a function that
performs the data transformation then use
:class:`~sklearn.preprocessing.FunctionTransformer` to create a scikit-learn
transformer.
.. code-block:: default
def subject_body_extractor(posts):
# construct object dtype array with two columns
# first column = 'subject' and second column = 'body'
features = np.empty(shape=(len(posts), 2), dtype=object)
for i, text in enumerate(posts):
# temporary variable `_` stores '\n\n'
headers, _, body = text.partition('\n\n')
# store body text in second column
features[i, 1] = body
prefix = 'Subject:'
sub = ''
# save text after 'Subject:' in first column
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features[i, 0] = sub
return features
subject_body_transformer = FunctionTransformer(subject_body_extractor)
We will also create a transformer that extracts the
length of the text and the number of sentences.
.. code-block:: default
def text_stats(posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts]
text_stats_transformer = FunctionTransformer(text_stats)
Classification pipeline
-----------------------
The pipeline below extracts the subject and body from each post using
``SubjectBodyExtractor``, producing a (n_samples, 2) array. This array is
then used to compute standard bag-of-words features for the subject and body
as well as text length and number of sentences on the body, using
``ColumnTransformer``. We combine them, with weights, then train a
classifier on the combined set of features.
.. code-block:: default
pipeline = Pipeline([
# Extract subject & body
('subjectbody', subject_body_transformer),
# Use ColumnTransformer to combine the subject and body features
('union', ColumnTransformer(
[
# bag-of-words for subject (col 0)
('subject', TfidfVectorizer(min_df=50), 0),
# bag-of-words with decomposition for body (col 1)
('body_bow', Pipeline([
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
]), 1),
# Pipeline for pulling text stats from post's body
('body_stats', Pipeline([
('stats', text_stats_transformer), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
]), 1),
],
# weight above ColumnTransformer features
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
}
)),
# Use a SVC classifier on the combined features
('svc', LinearSVC(dual=False)),
], verbose=True)
Finally, we fit our pipeline on the training data and use it to predict
topics for ``X_test``. Performance metrics of our pipeline are then printed.
.. code-block:: default
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print('Classification report:\n\n{}'.format(
classification_report(y_test, y_pred))
)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
[Pipeline] ....... (step 1 of 3) Processing subjectbody, total= 0.0s
[Pipeline] ............. (step 2 of 3) Processing union, total= 0.5s
[Pipeline] ............... (step 3 of 3) Processing svc, total= 0.0s
Classification report:
precision recall f1-score support
0 0.84 0.88 0.86 396
1 0.87 0.83 0.85 394
accuracy 0.85 790
macro avg 0.85 0.85 0.85 790
weighted avg 0.85 0.85 0.85 790
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 2.772 seconds)
.. _sphx_glr_download_auto_examples_compose_plot_column_transformer.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: binder-badge
.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_column_transformer.ipynb
:width: 150 px
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: plot_column_transformer.py `
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: plot_column_transformer.ipynb `
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery `_