Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Pipeline ANOVA SVM#

This example shows how a feature selection can be easily integrated within a machine learning pipeline.

We also show that you can easily inspect part of the pipeline.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

We will start by generating a binary classification dataset. Subsequently, we will divide the dataset into two subsets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_features=20,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

A common mistake done with feature selection is to search a subset of discriminative features on the full dataset, instead of only using the training set. The usage of scikit-learn Pipeline prevents to make such mistake.

Here, we will demonstrate how to build a pipeline where the first step will be the feature selection.

When calling fit on the training data, a subset of feature will be selected and the index of these selected features will be stored. The feature selector will subsequently reduce the number of features, and pass this subset to the classifier which will be trained.

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

anova_filter = SelectKBest(f_classif, k=3)
clf = LinearSVC()
anova_svm = make_pipeline(anova_filter, clf)
anova_svm.fit(X_train, y_train)

Pipeline(steps=[('selectkbest', SelectKBest(k=3)), ('linearsvc', LinearSVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators <combining_estimators>` for more details.	[('selectkbest', ...), ('linearsvc', ...)]
	transform_input transform_input: tuple or list of str, default=("X_val",) The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing <metadata_routing>`. For instance, this can be used to pass a validation set through the pipeline. By default, the validation set `X_val` is always transformed. You can only use this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6 .. versionchanged:: 1.10 The default changed from `None` to `("X_val",)`.	('X_val',)
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The classes labels. Only exist if the last step of the pipeline is a classifier.	ndarray[int64](2,)	[0,1]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if the underlying first estimator in `steps` exposes such an attribute when fit. .. versionadded:: 0.24	int	20

SelectKBest

?Documentation for SelectKBest

Parameters

	k k: int or "all", default=10 Number of top features to select. The "all" option bypasses selection, for use in a parameter search.	3
	score_func score_func: callable, default=f_classif Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See Also"). The default function only works with classification tasks. .. versionadded:: 0.18	<function f_c...x7644971235e0>

Fitted attributes

Name	Type	Value
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	20
pvalues_ pvalues_: array-like of shape (n_features,) p-values of feature scores, None if `score_func` returned only scores.	ndarray[float64](20,)	[0.71,0.36,0. ,...,0.64,0.4 ,0.02]
scores_ scores_: array-like of shape (n_features,) Scores of features.	ndarray[float64](20,)	[ 0.14, 0.86,104.06,..., 0.22, 0.73, 6.08]

3 features

x2

x9

x19

LinearSVC

?Documentation for LinearSVC

Parameters

	penalty penalty: {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization. The 'l2' penalty is the standard used in SVC. The 'l1' leads to ``coef_`` vectors that are sparse.	'l2'
	loss loss: {'hinge', 'squared_hinge'}, default='squared_hinge' Specifies the loss function. 'hinge' is the standard SVM loss (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss. The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.	'squared_hinge'
	dual dual: "auto" or bool, default="auto" Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features. `dual="auto"` will choose the value of the parameter automatically, based on the values of `n_samples`, `n_features`, `loss`, `multi_class` and `penalty`. If `n_samples` < `n_features` and optimizer supports chosen `loss`, `multi_class` and `penalty`, then dual will be set to True, otherwise it will be set to False. .. versionchanged:: 1.3 The `"auto"` option is added in version 1.3 and will be the default in version 1.5.	'auto'
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	C C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.	1.0
	multi_class multi_class: {'ovr', 'crammer_singer'}, default='ovr' Determines the multi-class strategy if `y` contains more than two classes. ``"ovr"`` trains n_classes one-vs-rest classifiers, while ``"crammer_singer"`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``"crammer_singer"`` is chosen, the options loss, penalty and dual will be ignored.	'ovr'
	fit_intercept fit_intercept: bool, default=True Whether or not to fit an intercept. If set to True, the feature vector is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where 1 corresponds to the intercept. If set to False, no intercept will be used in calculations (i.e. data is expected to be already centered).	True
	intercept_scaling intercept_scaling: float, default=1.0 When `fit_intercept` is True, the instance vector x becomes ``[x_1, ..., x_n, intercept_scaling]``, i.e. a "synthetic" feature with a constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight. Note that liblinear internally penalizes the intercept, treating it like any other term in the feature vector. To reduce the impact of the regularization on the intercept, the `intercept_scaling` parameter can be set to a value greater than 1; the higher the value of `intercept_scaling`, the lower the impact of regularization on it. Then, the weights become `[w_x_1, ..., w_x_n, w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent the feature weights and the intercept weight is scaled by `intercept_scaling`. This scaling allows the intercept term to have a different regularization behavior compared to the other features.	1
	class_weight class_weight: dict or 'balanced', default=None Set the parameter C of class i to ``class_weight[i]C`` for SVC. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes np.bincount(y))``.	None
	verbose verbose: int, default=0 Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.	0
	random_state random_state: int, RandomState instance or None, default=None Controls the pseudo random number generation for shuffling the data for the dual coordinate descent (if ``dual=True``). When ``dual=False`` the underlying implementation of :class:`LinearSVC` is not random and ``random_state`` has no effect on the results. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	None
	max_iter max_iter: int, default=1000 The maximum number of iterations to be run.	1000

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) The unique classes labels.	ndarray[int64](2,)	[0,1]
coef_ coef_: ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features) Weights assigned to the features (coefficients in the primal problem). ``coef_`` is a readonly property derived from ``raw_coef_`` that follows the internal memory layout of liblinear.	ndarray[float64](1, 3)	[[0.76,0.27,0.26]]
intercept_ intercept_: ndarray of shape (1,) if n_classes == 2 else (n_classes,) Constants in decision function.	ndarray[float64](1,)	[-0.06]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	3
n_iter_ n_iter_: int Maximum number of iterations run across all classes.	int	5

Once the training is complete, we can predict on new unseen samples. In this case, the feature selector will only select the most discriminative features based on the information stored during training. Then, the data will be passed to the classifier which will make the prediction.

Here, we show the final metrics via a classification report.

from sklearn.metrics import classification_report

y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.80      0.86        15
           1       0.75      0.90      0.82        10

    accuracy                           0.84        25
   macro avg       0.84      0.85      0.84        25
weighted avg       0.85      0.84      0.84        25

Be aware that you can inspect a step in the pipeline. For instance, we might be interested about the parameters of the classifier. Since we selected three features, we expect to have three coefficients.

anova_svm[-1].coef_

array([[0.75788833, 0.27161955, 0.26113448]])

However, we do not know which features were selected from the original dataset. We could proceed by several manners. Here, we will invert the transformation of these coefficients to get information about the original space.

anova_svm[:-1].inverse_transform(anova_svm[-1].coef_)

array([[0.        , 0.        , 0.75788833, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.27161955,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.26113448]])

We can see that the features with non-zero coefficients are the selected features by the first step.

Total running time of the script: (0 minutes 0.025 seconds)