sklearn.feature_selection
.f_regression¶
- sklearn.feature_selection.f_regression(X, y, *, center=True)[source]¶
Univariate linear regression tests returning F-statistic and p-values.
Quick linear model for testing the effect of a single regressor, sequentially for many regressors.
This is done in 2 steps:
The cross correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)) using r_regression function.
It is converted to an F score and then to a p-value.
f_regression
is derived fromr_regression
and will rank features in the same order if all the features are positively correlated with the target.Note however that contrary to
f_regression
,r_regression
values lie in [-1, 1] and can thus be negative.f_regression
is therefore recommended as a feature selection criterion to identify potentially predictive feature for a downstream classifier, irrespective of the sign of the association with the target variable.Furthermore
f_regression
returns p-values whiler_regression
does not.Read more in the User Guide.
- Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
The data matrix.
- yarray-like of shape (n_samples,)
The target vector.
- centerbool, default=True
Whether or not to center the data matrix
X
and the target vectory
. By default,X
andy
will be centered.
- Returns
- f_statisticndarray of shape (n_features,)
F-statistic for each feature.
- p_valuesndarray of shape (n_features,)
P-values associated with the F-statistic.
See also
r_regression
Pearson’s R between label/feature for regression tasks.
f_classif
ANOVA F-value between label/feature for classification tasks.
chi2
Chi-squared stats of non-negative features for classification tasks.
SelectKBest
Select features based on the k highest scores.
SelectFpr
Select features based on a false positive rate test.
SelectFdr
Select features based on an estimated false discovery rate.
SelectFwe
Select features based on family-wise error rate.
SelectPercentile
Select features based on percentile of the highest scores.