pairwise_distances_chunked#

sklearn.metrics.pairwise_distances_chunked(X, Y=None, *, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds)[source]#

Generate a distance matrix chunk by chunk with optional reduction.

In cases where not all of a pairwise distance matrix needs to be stored at once, this is used to calculate pairwise distances in working_memory-sized chunks. If reduce_func is given, it is run on each chunk and its return values are concatenated into lists, arrays or sparse matrices.

Parameters:

X{array-like, sparse matrix} of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)

Array of pairwise distances between samples, or a feature array. The shape the array should be (n_samples_X, n_samples_X) if metric=’precomputed’ and (n_samples_X, n_features) otherwise.

Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None

An optional second feature array. Only allowed if metric != “precomputed”.

reduce_funccallable, default=None

The function which is applied on each chunk of the distance matrix, reducing it to needed values. reduce_func(D_chunk, start) is called repeatedly, where D_chunk is a contiguous vertical slice of the pairwise distance matrix, starting at row start. It should return one of: None; an array, a list, or a sparse matrix of length D_chunk.shape[0]; or a tuple of such objects. Returning None is useful for in-place operations, rather than reductions.

If None, pairwise_distances_chunked returns a generator of vertical chunks of the distance matrix.

metricstr or callable, default=’euclidean’

The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

n_jobsint, default=None

The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

working_memoryfloat, default=None

The sought maximum memory for temporary distance matrix chunks. When None (default), the value of sklearn.get_config()['working_memory'] is used.

**kwdsoptional keyword parameters

Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Yields:

D_chunk{ndarray, sparse matrix}: A contiguous slice of distance matrix, optionally processed by reduce_func.

Examples

Without reduce_func:

>>> import numpy as np
>>> from sklearn.metrics import pairwise_distances_chunked
>>> X = np.random.RandomState(0).rand(5, 3)
>>> D_chunk = next(pairwise_distances_chunked(X))
>>> D_chunk
array([[0.   , 0.295, 0.417, 0.197, 0.572],
       [0.295, 0.   , 0.576, 0.419, 0.764],
       [0.417, 0.576, 0.   , 0.449, 0.903],
       [0.197, 0.419, 0.449, 0.   , 0.512],
       [0.572, 0.764, 0.903, 0.512, 0.   ]])

Retrieve all neighbors and average distance within radius r:

>>> r = .2
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r) for d in D_chunk]
...     avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1)
...     return neigh, avg_dist
>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func)
>>> neigh, avg_dist = next(gen)
>>> neigh
[array([0, 3]), array([1]), array([2]), array([0, 3]), array([4])]
>>> avg_dist
array([0.039, 0.        , 0.        , 0.039, 0.        ])

Where r is defined per sample, we need to make use of start:

>>> r = [.2, .4, .4, .3, .1]
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r[i])
...              for i, d in enumerate(D_chunk, start)]
...     return neigh
>>> neigh = next(pairwise_distances_chunked(X, reduce_func=reduce_func))
>>> neigh
[array([0, 3]), array([0, 1]), array([2]), array([0, 3]), array([4])]

Force row-by-row generation by reducing working_memory:

>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func,
...                                  working_memory=0)
>>> next(gen)
[array([0, 3])]
>>> next(gen)
[array([0, 1])]