chi2#

sklearn.feature_selection.chi2(X, y)[source]#

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative integer feature values such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

If some of your features are continuous, you need to bin them, for example by using KBinsDiscretizer.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

See also

f_classif: ANOVA F-value between label/feature for classification tasks.
f_regression: F-value between label/feature for regression tasks.

Notes

Complexity of this algorithm is O(n_classes * n_features).

Examples

>>> import numpy as np
>>> from sklearn.feature_selection import chi2
>>> X = np.array([[1, 1, 3],
...               [0, 1, 5],
...               [5, 4, 1],
...               [6, 6, 2],
...               [1, 4, 0],
...               [0, 0, 0]])
>>> y = np.array([1, 1, 0, 0, 2, 2])
>>> chi2_stats, p_values = chi2(X, y)
>>> chi2_stats
array([15.3,  6.5       ,  8.9])
>>> p_values
array([0.000456, 0.0387, 0.0116 ])

Gallery examples#

Column Transformer with Mixed Types