.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compose_plot_column_transformer.py: ================================================== Column Transformer with Heterogeneous Data Sources ================================================== Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when: 1. Your dataset consists of heterogeneous data types (e.g. raster images and text captions) 2. Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines. This example demonstrates how to use :class:`sklearn.compose.ColumnTransformer` on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a ColumnTransformer and finally train a classifier on the combined set of features. The choice of features is not particularly helpful, but serves to illustrate the technique. .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [Pipeline] ....... (step 1 of 3) Processing subjectbody, total= 0.0s [Pipeline] ............. (step 2 of 3) Processing union, total= 0.3s [Pipeline] ............... (step 3 of 3) Processing svc, total= 0.0s precision recall f1-score support 0 0.78 0.71 0.74 319 1 0.67 0.75 0.71 251 accuracy 0.73 570 macro avg 0.73 0.73 0.72 570 weighted avg 0.73 0.73 0.73 570 | .. code-block:: default # Author: Matt Terry # # License: BSD 3 clause import numpy as np from sklearn.base import BaseEstimator, TransformerMixin from sklearn.datasets import fetch_20newsgroups from sklearn.decomposition import TruncatedSVD from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.svm import LinearSVC class TextStats(TransformerMixin, BaseEstimator): """Extract features from each document for DictVectorizer""" def fit(self, x, y=None): return self def transform(self, posts): return [{'length': len(text), 'num_sentences': text.count('.')} for text in posts] class SubjectBodyExtractor(TransformerMixin, BaseEstimator): """Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are `subject` and `body`. """ def fit(self, x, y=None): return self def transform(self, posts): # construct object dtype array with two columns # first column = 'subject' and second column = 'body' features = np.empty(shape=(len(posts), 2), dtype=object) for i, text in enumerate(posts): headers, _, bod = text.partition('\n\n') features[i, 1] = bod prefix = 'Subject:' sub = '' for line in headers.split('\n'): if line.startswith(prefix): sub = line[len(prefix):] break features[i, 0] = sub return features pipeline = Pipeline([ # Extract the subject & body ('subjectbody', SubjectBodyExtractor()), # Use ColumnTransformer to combine the features from subject and body ('union', ColumnTransformer( [ # Pulling features from the post's subject line (first column) ('subject', TfidfVectorizer(min_df=50), 0), # Pipeline for standard bag-of-words model for body (second column) ('body_bow', Pipeline([ ('tfidf', TfidfVectorizer()), ('best', TruncatedSVD(n_components=50)), ]), 1), # Pipeline for pulling ad hoc features from post's body ('body_stats', Pipeline([ ('stats', TextStats()), # returns a list of dicts ('vect', DictVectorizer()), # list of dicts -> feature matrix ]), 1), ], # weight components in ColumnTransformer transformer_weights={ 'subject': 0.8, 'body_bow': 0.5, 'body_stats': 1.0, } )), # Use a SVC classifier on the combined features ('svc', LinearSVC(dual=False)), ], verbose=True) # limit the list of categories to make running this example faster. categories = ['alt.atheism', 'talk.religion.misc'] X_train, y_train = fetch_20newsgroups(random_state=1, subset='train', categories=categories, remove=('footers', 'quotes'), return_X_y=True) X_test, y_test = fetch_20newsgroups(random_state=1, subset='test', categories=categories, remove=('footers', 'quotes'), return_X_y=True) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 2.683 seconds) **Estimated memory usage:** 50 MB .. _sphx_glr_download_auto_examples_compose_plot_column_transformer.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.22.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_column_transformer.ipynb :width: 150 px .. container:: sphx-glr-download :download:`Download Python source code: plot_column_transformer.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: plot_column_transformer.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_