.. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_text_plot_hashing_vs_dict_vectorizer.py: =========================================== FeatureHasher and DictVectorizer Comparison =========================================== Compares FeatureHasher and DictVectorizer by using both to vectorize text documents. The example demonstrates syntax and speed only; it doesn't actually do anything useful with the extracted vectors. See the example scripts {document_classification_20newsgroups,clustering}.py for actual learning on text documents. A discrepancy between the number of terms reported for DictVectorizer and for FeatureHasher is to be expected due to hash collisions. .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Usage: /home/circleci/project/examples/text/plot_hashing_vs_dict_vectorizer.py [n_features_for_hashing] The default number of features is 2**18. Loading 20 newsgroups training data 3803 documents - 6.245MB DictVectorizer done in 1.034657s at 6.036MB/s Found 47928 unique terms FeatureHasher on frequency dicts done in 0.639051s at 9.772MB/s Found 43873 unique terms FeatureHasher on raw tokens done in 0.603110s at 10.354MB/s Found 43873 unique terms | .. code-block:: default # Author: Lars Buitinck # License: BSD 3 clause from collections import defaultdict import re import sys from time import time import numpy as np from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction import DictVectorizer, FeatureHasher def n_nonzero_columns(X): """Returns the number of non-zero columns in a CSR matrix X.""" return len(np.unique(X.nonzero()[1])) def tokens(doc): """Extract tokens from doc. This uses a simple regex to break strings into tokens. For a more principled approach, see CountVectorizer or TfidfVectorizer. """ return (tok.lower() for tok in re.findall(r"\w+", doc)) def token_freqs(doc): """Extract a dict mapping tokens from doc to their frequencies.""" freq = defaultdict(int) for tok in tokens(doc): freq[tok] += 1 return freq categories = [ 'alt.atheism', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'misc.forsale', 'rec.autos', 'sci.space', 'talk.religion.misc', ] # Uncomment the following line to use a larger set (11k+ documents) # categories = None print(__doc__) print("Usage: %s [n_features_for_hashing]" % sys.argv[0]) print(" The default number of features is 2**18.") print() try: n_features = int(sys.argv[1]) except IndexError: n_features = 2 ** 18 except ValueError: print("not a valid number of features: %r" % sys.argv[1]) sys.exit(1) print("Loading 20 newsgroups training data") raw_data, _ = fetch_20newsgroups(subset='train', categories=categories, return_X_y=True) data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6 print("%d documents - %0.3fMB" % (len(raw_data), data_size_mb)) print() print("DictVectorizer") t0 = time() vectorizer = DictVectorizer() vectorizer.fit_transform(token_freqs(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms" % len(vectorizer.get_feature_names())) print() print("FeatureHasher on frequency dicts") t0 = time() hasher = FeatureHasher(n_features=n_features) X = hasher.transform(token_freqs(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms" % n_nonzero_columns(X)) print() print("FeatureHasher on raw tokens") t0 = time() hasher = FeatureHasher(n_features=n_features, input_type="string") X = hasher.transform(tokens(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms" % n_nonzero_columns(X)) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 2.546 seconds) .. _sphx_glr_download_auto_examples_text_plot_hashing_vs_dict_vectorizer.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/text/plot_hashing_vs_dict_vectorizer.ipynb :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_hashing_vs_dict_vectorizer.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_hashing_vs_dict_vectorizer.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_