sklearn.metrics
.pairwise_distances_chunked¶
- sklearn.metrics.pairwise_distances_chunked(X, Y=None, *, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds)[source]¶
Generate a distance matrix chunk by chunk with optional reduction.
In cases where not all of a pairwise distance matrix needs to be stored at once, this is used to calculate pairwise distances in
working_memory
-sized chunks. Ifreduce_func
is given, it is run on each chunk and its return values are concatenated into lists, arrays or sparse matrices.- Parameters:
- X{array-like, sparse matrix} of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)
Array of pairwise distances between samples, or a feature array. The shape the array should be (n_samples_X, n_samples_X) if metric=’precomputed’ and (n_samples_X, n_features) otherwise.
- Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None
An optional second feature array. Only allowed if metric != “precomputed”.
- reduce_funccallable, default=None
The function which is applied on each chunk of the distance matrix, reducing it to needed values.
reduce_func(D_chunk, start)
is called repeatedly, whereD_chunk
is a contiguous vertical slice of the pairwise distance matrix, starting at rowstart
. It should return one of: None; an array, a list, or a sparse matrix of lengthD_chunk.shape[0]
; or a tuple of such objects. Returning None is useful for in-place operations, rather than reductions.If None, pairwise_distances_chunked returns a generator of vertical chunks of the distance matrix.
- metricstr or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.
- n_jobsint, default=None
The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- working_memoryfloat, default=None
The sought maximum memory for temporary distance matrix chunks. When None (default), the value of
sklearn.get_config()['working_memory']
is used.- **kwdsoptional keyword parameters
Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.
- Yields:
- D_chunk{ndarray, sparse matrix}
A contiguous slice of distance matrix, optionally processed by
reduce_func
.
Examples
Without reduce_func:
>>> import numpy as np >>> from sklearn.metrics import pairwise_distances_chunked >>> X = np.random.RandomState(0).rand(5, 3) >>> D_chunk = next(pairwise_distances_chunked(X)) >>> D_chunk array([[0. ..., 0.29..., 0.41..., 0.19..., 0.57...], [0.29..., 0. ..., 0.57..., 0.41..., 0.76...], [0.41..., 0.57..., 0. ..., 0.44..., 0.90...], [0.19..., 0.41..., 0.44..., 0. ..., 0.51...], [0.57..., 0.76..., 0.90..., 0.51..., 0. ...]])
Retrieve all neighbors and average distance within radius r:
>>> r = .2 >>> def reduce_func(D_chunk, start): ... neigh = [np.flatnonzero(d < r) for d in D_chunk] ... avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1) ... return neigh, avg_dist >>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func) >>> neigh, avg_dist = next(gen) >>> neigh [array([0, 3]), array([1]), array([2]), array([0, 3]), array([4])] >>> avg_dist array([0.039..., 0. , 0. , 0.039..., 0. ])
Where r is defined per sample, we need to make use of
start
:>>> r = [.2, .4, .4, .3, .1] >>> def reduce_func(D_chunk, start): ... neigh = [np.flatnonzero(d < r[i]) ... for i, d in enumerate(D_chunk, start)] ... return neigh >>> neigh = next(pairwise_distances_chunked(X, reduce_func=reduce_func)) >>> neigh [array([0, 3]), array([0, 1]), array([2]), array([0, 3]), array([4])]
Force row-by-row generation by reducing
working_memory
:>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func, ... working_memory=0) >>> next(gen) [array([0, 3])] >>> next(gen) [array([0, 1])]