TfidfVectorizer#

class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]#

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

See also

CountVectorizer: Transforms text into a sparse matrix of n-gram counts.
TfidfTransformer: Performs the TF-IDF transformation from a provided matrix of counts.

Examples

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.shape)
(4, 9)

build_analyzer()[source]#

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Returns:

analyzer: callable: A function to handle preprocessing, tokenization and n-grams generation.

build_preprocessor()[source]#

Return a function to preprocess the text before tokenization.

Returns:

preprocessor: callable: A function to preprocess the text before tokenization.

build_tokenizer()[source]#

Return a function that splits a string into a sequence of tokens.

Returns:

tokenizer: callable: A function to split a string into a sequence of tokens.

decode(doc)[source]#

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters:

docbytes or str: The string to decode.

Returns:

doc: str: A string of unicode symbols.

fit(raw_documents, y=None)[source]#

Learn vocabulary and idf from training set.

Parameters:

raw_documentsiterable: An iterable which generates either str, unicode or file objects.
yNone: This parameter is not needed to compute tfidf.

Returns:

selfobject: Fitted vectorizer.

fit_transform(raw_documents, y=None)[source]#

Learn vocabulary and idf, return document-term matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:

raw_documentsiterable: An iterable which generates either str, unicode or file objects.
yNone: This parameter is ignored.

Returns:

Xsparse matrix of (n_samples, n_features): Tf-idf-weighted document-term matrix.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_featuresarray-like of str or None, default=None: Not used, present here for API consistency by convention.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

get_stop_words()[source]#

Build or fetch the effective stop words list.

Returns:

stop_words: list or None: A list of stop words.

inverse_transform(X)[source]#

Return terms per document with nonzero entries in X.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Document-term matrix.

Returns:

X_originallist of arrays of shape (n_samples,): List of arrays of terms.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(raw_documents)[source]#

Transform documents to document-term matrix.

Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Parameters:

raw_documentsiterable: An iterable which generates either str, unicode or file objects.

Returns:

Xsparse matrix of (n_samples, n_features): Tf-idf-weighted document-term matrix.

Gallery examples#

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

Biclustering documents with the Spectral Co-clustering algorithm

Column Transformer with Heterogeneous Data Sources

Sample pipeline for text feature extraction and evaluation

Classification of text documents using sparse features

Clustering text documents using k-means

FeatureHasher and DictVectorizer Comparison