.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/decomposition/plot_pca_vs_fa_model_selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_decomposition_plot_pca_vs_fa_model_selection.py: =============================================================== Model selection with Probabilistic PCA and Factor Analysis (FA) =============================================================== Probabilistic PCA and Factor Analysis are probabilistic models. The consequence is that the likelihood of new data can be used for model selection and covariance estimation. Here we compare PCA and FA with cross-validation on low rank data corrupted with homoscedastic noise (noise variance is the same for each feature) or heteroscedastic noise (noise variance is the different for each feature). In a second step we compare the model likelihood to the likelihoods obtained from shrinkage covariance estimators. One can observe that with homoscedastic noise both FA and PCA succeed in recovering the size of the low rank subspace. The likelihood with PCA is higher than FA in this case. However PCA fails and overestimates the rank when heteroscedastic noise is present. Under appropriate circumstances (choice of the number of components), the held-out data is more likely for low rank models than for shrinkage models. The automatic estimation from Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604 by Thomas P. Minka is also compared. .. GENERATED FROM PYTHON SOURCE LINES 27-31 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 32-34 Create the data --------------- .. GENERATED FROM PYTHON SOURCE LINES 34-51 .. code-block:: Python import numpy as np from scipy import linalg n_samples, n_features, rank = 500, 25, 5 sigma = 1.0 rng = np.random.RandomState(42) U, _, _ = linalg.svd(rng.randn(n_features, n_features)) X = np.dot(rng.randn(n_samples, rank), U[:, :rank].T) # Adding homoscedastic noise X_homo = X + sigma * rng.randn(n_samples, n_features) # Adding heteroscedastic noise sigmas = sigma * rng.rand(n_features) + sigma / 2.0 X_hetero = X + rng.randn(n_samples, n_features) * sigmas .. GENERATED FROM PYTHON SOURCE LINES 52-54 Fit the models -------------- .. GENERATED FROM PYTHON SOURCE LINES 54-144 .. code-block:: Python import matplotlib.pyplot as plt from sklearn.covariance import LedoitWolf, ShrunkCovariance from sklearn.decomposition import PCA, FactorAnalysis from sklearn.model_selection import GridSearchCV, cross_val_score n_components = np.arange(0, n_features, 5) # options for n_components def compute_scores(X): pca = PCA(svd_solver="full") fa = FactorAnalysis() pca_scores, fa_scores = [], [] for n in n_components: pca.n_components = n fa.n_components = n pca_scores.append(np.mean(cross_val_score(pca, X))) fa_scores.append(np.mean(cross_val_score(fa, X))) return pca_scores, fa_scores def shrunk_cov_score(X): shrinkages = np.logspace(-2, 0, 30) cv = GridSearchCV(ShrunkCovariance(), {"shrinkage": shrinkages}) return np.mean(cross_val_score(cv.fit(X).best_estimator_, X)) def lw_score(X): return np.mean(cross_val_score(LedoitWolf(), X)) for X, title in [(X_homo, "Homoscedastic Noise"), (X_hetero, "Heteroscedastic Noise")]: pca_scores, fa_scores = compute_scores(X) n_components_pca = n_components[np.argmax(pca_scores)] n_components_fa = n_components[np.argmax(fa_scores)] pca = PCA(svd_solver="full", n_components="mle") pca.fit(X) n_components_pca_mle = pca.n_components_ print("best n_components by PCA CV = %d" % n_components_pca) print("best n_components by FactorAnalysis CV = %d" % n_components_fa) print("best n_components by PCA MLE = %d" % n_components_pca_mle) plt.figure() plt.plot(n_components, pca_scores, "b", label="PCA scores") plt.plot(n_components, fa_scores, "r", label="FA scores") plt.axvline(rank, color="g", label="TRUTH: %d" % rank, linestyle="-") plt.axvline( n_components_pca, color="b", label="PCA CV: %d" % n_components_pca, linestyle="--", ) plt.axvline( n_components_fa, color="r", label="FactorAnalysis CV: %d" % n_components_fa, linestyle="--", ) plt.axvline( n_components_pca_mle, color="k", label="PCA MLE: %d" % n_components_pca_mle, linestyle="--", ) # compare with other covariance estimators plt.axhline( shrunk_cov_score(X), color="violet", label="Shrunk Covariance MLE", linestyle="-.", ) plt.axhline( lw_score(X), color="orange", label="LedoitWolf MLE" % n_components_pca_mle, linestyle="-.", ) plt.xlabel("nb of components") plt.ylabel("CV scores") plt.legend(loc="lower right") plt.title(title) plt.show() .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_001.png :alt: Homoscedastic Noise :srcset: /auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_001.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_002.png :alt: Heteroscedastic Noise :srcset: /auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_002.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none best n_components by PCA CV = 5 best n_components by FactorAnalysis CV = 5 best n_components by PCA MLE = 5 best n_components by PCA CV = 20 best n_components by FactorAnalysis CV = 5 best n_components by PCA MLE = 18 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 3.231 seconds) .. _sphx_glr_download_auto_examples_decomposition_plot_pca_vs_fa_model_selection.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.6.X?urlpath=lab/tree/notebooks/auto_examples/decomposition/plot_pca_vs_fa_model_selection.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/decomposition/plot_pca_vs_fa_model_selection.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_pca_vs_fa_model_selection.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_pca_vs_fa_model_selection.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_pca_vs_fa_model_selection.zip ` .. include:: plot_pca_vs_fa_model_selection.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_