.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_gaussian_process_plot_gpr_on_structured_data.py>`     to download the full example code or to run this example in your browser via Binder
    .. rst-class:: sphx-glr-example-title

    .. _sphx_glr_auto_examples_gaussian_process_plot_gpr_on_structured_data.py:


==========================================================================
Gaussian processes on discrete data structures
==========================================================================

This example illustrates the use of Gaussian processes for regression and
classification tasks on data that are not in fixed-length feature vector form.
This is achieved through the use of kernel functions that operates directly
on discrete structures such as variable-length sequences, trees, and graphs.

Specifically, here the input variables are some gene sequences stored as
variable-length strings consisting of letters 'A', 'T', 'C', and 'G',
while the output variables are floating point numbers and True/False labels
in the regression and classification tasks, respectively.

A kernel between the gene sequences is defined using R-convolution [1]_ by
integrating a binary letter-wise kernel over all pairs of letters among a pair
of strings.

This example will generate three figures.

In the first figure, we visualize the value of the kernel, i.e. the similarity
of the sequences, using a colormap. Brighter color here indicates higher
similarity.

In the second figure, we show some regression result on a dataset of 6
sequences. Here we use the 1st, 2nd, 4th, and 5th sequences as the training set
to make predictions on the 3rd and 6th sequences.

In the third figure, we demonstrate a classification model by training on 6
sequences and make predictions on another 5 sequences. The ground truth here is
simply  whether there is at least one 'A' in the sequence. Here the model makes
four correct classifications and fails on one.

.. [1] Haussler, D. (1999). Convolution kernels on discrete structures
       (Vol. 646). Technical report, Department of Computer Science, University
       of California at Santa Cruz.


.. rst-class:: sphx-glr-horizontal


    *

      .. image:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_on_structured_data_001.png
          :alt: Sequence similarity under the kernel
          :class: sphx-glr-multi-img

    *

      .. image:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_on_structured_data_002.png
          :alt: Regression on sequences
          :class: sphx-glr-multi-img

    *

      .. image:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_on_structured_data_003.png
          :alt: Classification on sequences
          :class: sphx-glr-multi-img


.. code-block:: default

    print(__doc__)

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.gaussian_process.kernels import Kernel, Hyperparameter
    from sklearn.gaussian_process.kernels import GenericKernelMixin
    from sklearn.gaussian_process import GaussianProcessRegressor
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.base import clone


    class SequenceKernel(GenericKernelMixin, Kernel):
        '''
        A minimal (but valid) convolutional kernel for sequences of variable
        lengths.'''
        def __init__(self,
                     baseline_similarity=0.5,
                     baseline_similarity_bounds=(1e-5, 1)):
            self.baseline_similarity = baseline_similarity
            self.baseline_similarity_bounds = baseline_similarity_bounds

        @property
        def hyperparameter_baseline_similarity(self):
            return Hyperparameter("baseline_similarity",
                                  "numeric",
                                  self.baseline_similarity_bounds)

        def _f(self, s1, s2):
            '''
            kernel value between a pair of sequences
            '''
            return sum([1.0 if c1 == c2 else self.baseline_similarity
                       for c1 in s1
                       for c2 in s2])

        def _g(self, s1, s2):
            '''
            kernel derivative between a pair of sequences
            '''
            return sum([0.0 if c1 == c2 else 1.0
                        for c1 in s1
                        for c2 in s2])

        def __call__(self, X, Y=None, eval_gradient=False):
            if Y is None:
                Y = X

            if eval_gradient:
                return (np.array([[self._f(x, y) for y in Y] for x in X]),
                        np.array([[[self._g(x, y)] for y in Y] for x in X]))
            else:
                return np.array([[self._f(x, y) for y in Y] for x in X])

        def diag(self, X):
            return np.array([self._f(x, x) for x in X])

        def is_stationary(self):
            return False

        def clone_with_theta(self, theta):
            cloned = clone(self)
            cloned.theta = theta
            return cloned


    kernel = SequenceKernel()

    '''
    Sequence similarity matrix under the kernel
    ===========================================
    '''

    X = np.array(['AGCT', 'AGC', 'AACT', 'TAA', 'AAA', 'GAACA'])

    K = kernel(X)
    D = kernel.diag(X)

    plt.figure(figsize=(8, 5))
    plt.imshow(np.diag(D**-0.5).dot(K).dot(np.diag(D**-0.5)))
    plt.xticks(np.arange(len(X)), X)
    plt.yticks(np.arange(len(X)), X)
    plt.title('Sequence similarity under the kernel')

    '''
    Regression
    ==========
    '''

    X = np.array(['AGCT', 'AGC', 'AACT', 'TAA', 'AAA', 'GAACA'])
    Y = np.array([1.0, 1.0, 2.0, 2.0, 3.0, 3.0])

    training_idx = [0, 1, 3, 4]
    gp = GaussianProcessRegressor(kernel=kernel)
    gp.fit(X[training_idx], Y[training_idx])

    plt.figure(figsize=(8, 5))
    plt.bar(np.arange(len(X)), gp.predict(X), color='b', label='prediction')
    plt.bar(training_idx, Y[training_idx], width=0.2, color='r',
            alpha=1, label='training')
    plt.xticks(np.arange(len(X)), X)
    plt.title('Regression on sequences')
    plt.legend()

    '''
    Classification
    ==============
    '''

    X_train = np.array(['AGCT', 'CGA', 'TAAC', 'TCG', 'CTTT', 'TGCT'])
    # whether there are 'A's in the sequence
    Y_train = np.array([True, True, True, False, False, False])

    gp = GaussianProcessClassifier(kernel)
    gp.fit(X_train, Y_train)

    X_test = ['AAA', 'ATAG', 'CTC', 'CT', 'C']
    Y_test = [True, True, False, False, False]

    plt.figure(figsize=(8, 5))
    plt.scatter(np.arange(len(X_train)), [1.0 if c else -1.0 for c in Y_train],
                s=100, marker='o', edgecolor='none', facecolor=(1, 0.75, 0),
                label='training')
    plt.scatter(len(X_train) + np.arange(len(X_test)),
                [1.0 if c else -1.0 for c in Y_test],
                s=100, marker='o', edgecolor='none', facecolor='r', label='truth')
    plt.scatter(len(X_train) + np.arange(len(X_test)),
                [1.0 if c else -1.0 for c in gp.predict(X_test)],
                s=100, marker='x', edgecolor=(0, 1.0, 0.3), linewidth=2,
                label='prediction')
    plt.xticks(np.arange(len(X_train) + len(X_test)),
               np.concatenate((X_train, X_test)))
    plt.yticks([-1, 1], [False, True])
    plt.title('Classification on sequences')
    plt.legend()

    plt.show()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.279 seconds)


.. _sphx_glr_download_auto_examples_gaussian_process_plot_gpr_on_structured_data.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: binder-badge

    .. image:: https://mybinder.org/badge_logo.svg
      :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/0.23.X?urlpath=lab/tree/notebooks/auto_examples/gaussian_process/plot_gpr_on_structured_data.ipynb
      :width: 150 px


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_gpr_on_structured_data.py <plot_gpr_on_structured_data.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_gpr_on_structured_data.ipynb <plot_gpr_on_structured_data.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_