sklearn.feature_extraction.DictVectorizer¶
-
class
sklearn.feature_extraction.DictVectorizer(*, dtype=<class 'numpy.float64'>, separator='=', sparse=True, sort=True)[source]¶ Transforms lists of feature-value mappings to vectors.
This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators.
When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on. For instance, a feature “f” that can take on the values “ham” and “spam” will become two features in the output, one signifying “f=ham”, the other “f=spam”.
However, note that this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int, the DictVectorizer can be followed by
sklearn.preprocessing.OneHotEncoderto complete binary one-hot encoding.Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix.
Read more in the User Guide.
- Parameters
- dtypedtype, default=np.float64
The type of feature values. Passed to Numpy array/scipy.sparse matrix constructors as the dtype argument.
- separatorstr, default=”=”
Separator string used when constructing new features for one-hot coding.
- sparsebool, default=True
Whether transform should produce scipy.sparse matrices.
- sortbool, default=True
Whether
feature_names_andvocabulary_should be sorted when fitting.
- Attributes
- vocabulary_dict
A dictionary mapping feature names to feature indices.
- feature_names_list
A list of length n_features containing the feature names (e.g., “f=ham” and “f=spam”).
See also
FeatureHasherperforms vectorization using only a hash function.
sklearn.preprocessing.OrdinalEncoderhandles nominal/categorical features encoded as columns of arbitrary data types.
Examples
>>> from sklearn.feature_extraction import DictVectorizer >>> v = DictVectorizer(sparse=False) >>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] >>> X = v.fit_transform(D) >>> X array([[2., 0., 1.], [0., 1., 3.]]) >>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}] True >>> v.transform({'foo': 4, 'unseen_feature': 3}) array([[0., 0., 4.]])
Methods
fit(X[, y])Learn a list of feature name -> indices mappings.
fit_transform(X[, y])Learn a list of feature name -> indices mappings and transform X.
Returns a list of feature names, ordered by their indices.
get_params([deep])Get parameters for this estimator.
inverse_transform(X[, dict_type])Transform array or sparse matrix X back to feature mappings.
restrict(support[, indices])Restrict the features to those in support using feature selection.
set_params(**params)Set the parameters of this estimator.
transform(X)Transform feature->value dicts to array or sparse matrix.
-
__init__(*, dtype=<class 'numpy.float64'>, separator='=', sparse=True, sort=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit(X, y=None)[source]¶ Learn a list of feature name -> indices mappings.
- Parameters
- XMapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).
- y(ignored)
- Returns
- self
-
fit_transform(X, y=None)[source]¶ Learn a list of feature name -> indices mappings and transform X.
Like fit(X) followed by transform(X), but does not require materializing X in memory.
- Parameters
- XMapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).
- y(ignored)
- Returns
- Xa{array, sparse matrix}
Feature vectors; always 2-d.
-
get_feature_names()[source]¶ Returns a list of feature names, ordered by their indices.
If one-of-K coding is applied to categorical features, this will include the constructed feature names but not the original ones.
-
get_params(deep=True)[source]¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsmapping of string to any
Parameter names mapped to their values.
-
inverse_transform(X, dict_type=<class 'dict'>)[source]¶ Transform array or sparse matrix X back to feature mappings.
X must have been produced by this DictVectorizer’s transform or fit_transform method; it may only have passed through transformers that preserve the number of features and their order.
In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than the original ones.
- Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Sample matrix.
- dict_typetype, default=dict
Constructor for feature mappings. Must conform to the collections.Mapping API.
- Returns
- Dlist of dict_type objects of shape (n_samples,)
Feature mappings for the samples in X.
-
restrict(support, indices=False)[source]¶ Restrict the features to those in support using feature selection.
This function modifies the estimator in-place.
- Parameters
- supportarray-like
Boolean mask or list of indices (as returned by the get_support member of feature selectors).
- indicesbool, default=False
Whether support is a list of indices.
- Returns
- self
Examples
>>> from sklearn.feature_extraction import DictVectorizer >>> from sklearn.feature_selection import SelectKBest, chi2 >>> v = DictVectorizer() >>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] >>> X = v.fit_transform(D) >>> support = SelectKBest(chi2, k=2).fit(X, [0, 1]) >>> v.get_feature_names() ['bar', 'baz', 'foo'] >>> v.restrict(support.get_support()) DictVectorizer() >>> v.get_feature_names() ['bar', 'foo']
-
set_params(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfobject
Estimator instance.
-
transform(X)[source]¶ Transform feature->value dicts to array or sparse matrix.
Named features not encountered during fit or fit_transform will be silently ignored.
- Parameters
- XMapping or iterable over Mappings of shape (n_samples,)
Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).
- Returns
- Xa{array, sparse matrix}
Feature vectors; always 2-d.