.. _sphx_glr_auto_examples_cross_decomposition_plot_compare_cross_decomposition.py: =================================== Compare cross decomposition methods =================================== Simple usage of various cross decomposition algorithms: - PLSCanonical - PLSRegression, with multivariate response, a.k.a. PLS2 - PLSRegression, with univariate response, a.k.a. PLS1 - CCA Given 2 multivariate covarying two-dimensional datasets, X, and Y, PLS extracts the 'directions of covariance', i.e. the components of each datasets that explain the most shared variance between both datasets. This is apparent on the **scatterplot matrix** display: components 1 in dataset X and dataset Y are maximally correlated (points lie around the first diagonal). This is also true for components 2 in both dataset, however, the correlation across datasets for different components is weak: the point cloud is very spherical. .. code-block:: python print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.cross_decomposition import PLSCanonical, PLSRegression, CCA Dataset based latent variables model .. code-block:: python n = 500 # 2 latents vars: l1 = np.random.normal(size=n) l2 = np.random.normal(size=n) latents = np.array([l1, l1, l2, l2]).T X = latents + np.random.normal(size=4 * n).reshape((n, 4)) Y = latents + np.random.normal(size=4 * n).reshape((n, 4)) X_train = X[:n // 2] Y_train = Y[:n // 2] X_test = X[n // 2:] Y_test = Y[n // 2:] print("Corr(X)") print(np.round(np.corrcoef(X.T), 2)) print("Corr(Y)") print(np.round(np.corrcoef(Y.T), 2)) .. rst-class:: sphx-glr-script-out Out:: Corr(X) [[ 1. 0.52 -0.03 0. ] [ 0.52 1. 0.02 -0.01] [-0.03 0.02 1. 0.45] [ 0. -0.01 0.45 1. ]] Corr(Y) [[ 1. 0.52 0.01 -0.01] [ 0.52 1. 0. 0.06] [ 0.01 0. 1. 0.52] [-0.01 0.06 0.52 1. ]] Canonical (symmetric) PLS .. code-block:: python # Transform data # ~~~~~~~~~~~~~~ plsca = PLSCanonical(n_components=2) plsca.fit(X_train, Y_train) X_train_r, Y_train_r = plsca.transform(X_train, Y_train) X_test_r, Y_test_r = plsca.transform(X_test, Y_test) # Scatter plot of scores # ~~~~~~~~~~~~~~~~~~~~~~ # 1) On diagonal plot X vs Y scores on each components plt.figure(figsize=(12, 8)) plt.subplot(221) plt.plot(X_train_r[:, 0], Y_train_r[:, 0], "ob", label="train") plt.plot(X_test_r[:, 0], Y_test_r[:, 0], "or", label="test") plt.xlabel("x scores") plt.ylabel("y scores") plt.title('Comp. 1: X vs Y (test corr = %.2f)' % np.corrcoef(X_test_r[:, 0], Y_test_r[:, 0])[0, 1]) plt.xticks(()) plt.yticks(()) plt.legend(loc="best") plt.subplot(224) plt.plot(X_train_r[:, 1], Y_train_r[:, 1], "ob", label="train") plt.plot(X_test_r[:, 1], Y_test_r[:, 1], "or", label="test") plt.xlabel("x scores") plt.ylabel("y scores") plt.title('Comp. 2: X vs Y (test corr = %.2f)' % np.corrcoef(X_test_r[:, 1], Y_test_r[:, 1])[0, 1]) plt.xticks(()) plt.yticks(()) plt.legend(loc="best") # 2) Off diagonal plot components 1 vs 2 for X and Y plt.subplot(222) plt.plot(X_train_r[:, 0], X_train_r[:, 1], "*b", label="train") plt.plot(X_test_r[:, 0], X_test_r[:, 1], "*r", label="test") plt.xlabel("X comp. 1") plt.ylabel("X comp. 2") plt.title('X comp. 1 vs X comp. 2 (test corr = %.2f)' % np.corrcoef(X_test_r[:, 0], X_test_r[:, 1])[0, 1]) plt.legend(loc="best") plt.xticks(()) plt.yticks(()) plt.subplot(223) plt.plot(Y_train_r[:, 0], Y_train_r[:, 1], "*b", label="train") plt.plot(Y_test_r[:, 0], Y_test_r[:, 1], "*r", label="test") plt.xlabel("Y comp. 1") plt.ylabel("Y comp. 2") plt.title('Y comp. 1 vs Y comp. 2 , (test corr = %.2f)' % np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1]) plt.legend(loc="best") plt.xticks(()) plt.yticks(()) plt.show() .. image:: /auto_examples/cross_decomposition/images/sphx_glr_plot_compare_cross_decomposition_001.png :align: center PLS regression, with multivariate response, a.k.a. PLS2 .. code-block:: python n = 1000 q = 3 p = 10 X = np.random.normal(size=n * p).reshape((n, p)) B = np.array([[1, 2] + [0] * (p - 2)] * q).T # each Yj = 1*X1 + 2*X2 + noize Y = np.dot(X, B) + np.random.normal(size=n * q).reshape((n, q)) + 5 pls2 = PLSRegression(n_components=3) pls2.fit(X, Y) print("True B (such that: Y = XB + Err)") print(B) # compare pls2.coef_ with B print("Estimated B") print(np.round(pls2.coef_, 1)) pls2.predict(X) .. rst-class:: sphx-glr-script-out Out:: True B (such that: Y = XB + Err) [[1 1 1] [2 2 2] [0 0 0] [0 0 0] [0 0 0] [0 0 0] [0 0 0] [0 0 0] [0 0 0] [0 0 0]] Estimated B [[ 1. 1. 1. ] [ 2. 2. 2. ] [-0. 0. -0. ] [-0. -0. -0. ] [-0. -0. -0. ] [ 0. -0.1 -0. ] [-0. -0.1 0. ] [-0. 0. -0. ] [ 0. 0. 0. ] [ 0. 0. 0. ]] PLS regression, with univariate response, a.k.a. PLS1 .. code-block:: python n = 1000 p = 10 X = np.random.normal(size=n * p).reshape((n, p)) y = X[:, 0] + 2 * X[:, 1] + np.random.normal(size=n * 1) + 5 pls1 = PLSRegression(n_components=3) pls1.fit(X, y) # note that the number of components exceeds 1 (the dimension of y) print("Estimated betas") print(np.round(pls1.coef_, 1)) .. rst-class:: sphx-glr-script-out Out:: Estimated betas [[ 1. ] [ 2. ] [-0. ] [ 0. ] [-0. ] [ 0. ] [ 0.1] [-0.1] [-0. ] [ 0. ]] CCA (PLS mode B with symmetric deflation) .. code-block:: python cca = CCA(n_components=2) cca.fit(X_train, Y_train) X_train_r, Y_train_r = plsca.transform(X_train, Y_train) X_test_r, Y_test_r = plsca.transform(X_test, Y_test) **Total running time of the script:** (0 minutes 0.322 seconds) .. container:: sphx-glr-download **Download Python source code:** :download:`plot_compare_cross_decomposition.py ` .. container:: sphx-glr-download **Download IPython notebook:** :download:`plot_compare_cross_decomposition.ipynb `