.. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compose_plot_column_transformer.py: ================================================== Column Transformer with Heterogeneous Data Sources ================================================== Datasets can often contain components that require different feature extraction and processing pipelines. This scenario might occur when: 1. your dataset consists of heterogeneous data types (e.g. raster images and text captions), 2. your dataset is stored in a :class:`pandas.DataFrame` and different columns require different processing pipelines. This example demonstrates how to use :class:`~sklearn.compose.ColumnTransformer` on a dataset containing different types of features. The choice of features is not particularly helpful, but serves to illustrate the technique. .. code-block:: default # Author: Matt Terry # # License: BSD 3 clause import numpy as np from sklearn.preprocessing import FunctionTransformer from sklearn.datasets import fetch_20newsgroups from sklearn.decomposition import TruncatedSVD from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.svm import LinearSVC 20 newsgroups dataset --------------------- We will use the :ref:`20 newsgroups dataset <20newsgroups_dataset>`, which comprises posts from newsgroups on 20 topics. This dataset is split into train and test subsets based on messages posted before and after a specific date. We will only use posts from 2 categories to speed up running time. .. code-block:: default categories = ['sci.med', 'sci.space'] X_train, y_train = fetch_20newsgroups(random_state=1, subset='train', categories=categories, remove=('footers', 'quotes'), return_X_y=True) X_test, y_test = fetch_20newsgroups(random_state=1, subset='test', categories=categories, remove=('footers', 'quotes'), return_X_y=True) Each feature comprises meta information about that post, such as the subject, and the body of the news post. .. code-block:: default print(X_train[0]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539) Subject: Re: Metric vs English Article-I.D.: mksol.1993Apr6.131900.8407 Organization: Texas Instruments Inc Lines: 31 American, perhaps, but nothing military about it. I learned (mostly) slugs when we talked English units in high school physics and while the teacher was an ex-Navy fighter jock the book certainly wasn't produced by the military. [Poundals were just too flinking small and made the math come out funny; sort of the same reason proponents of SI give for using that.] -- "Insisting on perfect safety is for people who don't have the balls to live in the real world." -- Mary Shafer, NASA Ames Dryden Creating transformers --------------------- First, we would like a transformer that extracts the subject and body of each post. Since this is a stateless transformation (does not require state information from training data), we can define a function that performs the data transformation then use :class:`~sklearn.preprocessing.FunctionTransformer` to create a scikit-learn transformer. .. code-block:: default def subject_body_extractor(posts): # construct object dtype array with two columns # first column = 'subject' and second column = 'body' features = np.empty(shape=(len(posts), 2), dtype=object) for i, text in enumerate(posts): # temporary variable `_` stores '\n\n' headers, _, body = text.partition('\n\n') # store body text in second column features[i, 1] = body prefix = 'Subject:' sub = '' # save text after 'Subject:' in first column for line in headers.split('\n'): if line.startswith(prefix): sub = line[len(prefix):] break features[i, 0] = sub return features subject_body_transformer = FunctionTransformer(subject_body_extractor) We will also create a transformer that extracts the length of the text and the number of sentences. .. code-block:: default def text_stats(posts): return [{'length': len(text), 'num_sentences': text.count('.')} for text in posts] text_stats_transformer = FunctionTransformer(text_stats) Classification pipeline ----------------------- The pipeline below extracts the subject and body from each post using ``SubjectBodyExtractor``, producing a (n_samples, 2) array. This array is then used to compute standard bag-of-words features for the subject and body as well as text length and number of sentences on the body, using ``ColumnTransformer``. We combine them, with weights, then train a classifier on the combined set of features. .. code-block:: default pipeline = Pipeline([ # Extract subject & body ('subjectbody', subject_body_transformer), # Use ColumnTransformer to combine the subject and body features ('union', ColumnTransformer( [ # bag-of-words for subject (col 0) ('subject', TfidfVectorizer(min_df=50), 0), # bag-of-words with decomposition for body (col 1) ('body_bow', Pipeline([ ('tfidf', TfidfVectorizer()), ('best', TruncatedSVD(n_components=50)), ]), 1), # Pipeline for pulling text stats from post's body ('body_stats', Pipeline([ ('stats', text_stats_transformer), # returns a list of dicts ('vect', DictVectorizer()), # list of dicts -> feature matrix ]), 1), ], # weight above ColumnTransformer features transformer_weights={ 'subject': 0.8, 'body_bow': 0.5, 'body_stats': 1.0, } )), # Use a SVC classifier on the combined features ('svc', LinearSVC(dual=False)), ], verbose=True) Finally, we fit our pipeline on the training data and use it to predict topics for ``X_test``. Performance metrics of our pipeline are then printed. .. code-block:: default pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) print('Classification report:\n\n{}'.format( classification_report(y_test, y_pred)) ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [Pipeline] ....... (step 1 of 3) Processing subjectbody, total= 0.0s [Pipeline] ............. (step 2 of 3) Processing union, total= 0.5s [Pipeline] ............... (step 3 of 3) Processing svc, total= 0.0s Classification report: precision recall f1-score support 0 0.84 0.88 0.86 396 1 0.87 0.83 0.85 394 accuracy 0.85 790 macro avg 0.85 0.85 0.85 790 weighted avg 0.85 0.85 0.85 790 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 2.772 seconds) .. _sphx_glr_download_auto_examples_compose_plot_column_transformer.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/compose/plot_column_transformer.ipynb :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_column_transformer.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_column_transformer.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_