`sklearn.neighbors`.BallTree¶

class `sklearn.neighbors.``BallTree`

BallTree for fast generalized N-point problems

BallTree(X, leaf_size=40, metric=’minkowski’, **kwargs)

Parameters: X : array-like, shape = [n_samples, n_features] n_samples is the number of points in the data set, and n_features is the dimension of the parameter space. Note: if X is a C-contiguous array of doubles then data will not be copied. Otherwise, an internal copy will be made. leaf_size : positive integer (default = 40) Number of points at which to switch to brute-force. Changing leaf_size will not affect the results of a query, but can significantly impact the speed of a query and the memory required to store the constructed tree. The amount of memory needed to store the tree scales as approximately n_samples / leaf_size. For a specified `leaf_size`, a leaf node is guaranteed to satisfy `leaf_size <= n_points <= 2 * leaf_size`, except in the case that `n_samples < leaf_size`. metric : string or DistanceMetric object the distance metric to use for the tree. Default=’minkowski’ with p=2 (that is, a euclidean metric). See the documentation of the DistanceMetric class for a list of available metrics. ball_tree.valid_metrics gives a list of the metrics which are valid for BallTree. Additional keywords are passed to the distance metric class. : data : np.ndarray The training data

Examples

Query for k-nearest neighbors

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10, 3))  # 10 points in 3 dimensions
>>> tree = BallTree(X, leaf_size=2)
>>> dist, ind = tree.query([X[0]], k=3)
>>> print(ind)  # indices of 3 closest neighbors
[0 3 1]
>>> print(dist)  # distances to 3 closest neighbors
[ 0.          0.19662693  0.29473397]
```

Pickle and Unpickle a tree. Note that the state of the tree is saved in the pickle operation: the tree needs not be rebuilt upon unpickling.

```>>> import numpy as np
>>> import pickle
>>> np.random.seed(0)
>>> X = np.random.random((10, 3))  # 10 points in 3 dimensions
>>> tree = BallTree(X, leaf_size=2)
>>> s = pickle.dumps(tree)
>>> tree_copy = pickle.loads(s)
>>> dist, ind = tree_copy.query(X[0], k=3)
>>> print(ind)  # indices of 3 closest neighbors
[0 3 1]
>>> print(dist)  # distances to 3 closest neighbors
[ 0.          0.19662693  0.29473397]
```

Query for neighbors within a given radius

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10, 3))  # 10 points in 3 dimensions
>>> tree = BallTree(X, leaf_size=2)
>>> print(tree.query_radius(X[0], r=0.3, count_only=True))
3
>>> ind = tree.query_radius(X[0], r=0.3)
>>> print(ind)  # indices of neighbors within distance 0.3
[3 0 1]
```

Compute a gaussian kernel density estimate:

```>>> import numpy as np
>>> np.random.seed(1)
>>> X = np.random.random((100, 3))
>>> tree = BallTree(X)
>>> tree.kernel_density(X[:3], h=0.1, kernel='gaussian')
array([ 6.94114649,  7.83281226,  7.2071716 ])
```

Compute a two-point auto-correlation function

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((30, 3))
>>> r = np.linspace(0, 1, 5)
>>> tree = BallTree(X)
>>> tree.two_point_correlation(X, r)
array([ 30,  62, 278, 580, 820])
```

Methods

 `get_arrays` `get_n_calls` `get_tree_stats` `kernel_density`(self, X, h[, kernel, atol, …]) Compute the kernel density estimate at points X with the given kernel, using the distance metric specified at tree creation. `query`(X[, k, return_distance, dualtree, …]) query the tree for the k nearest neighbors `query_radius` query_radius(self, X, r, count_only = False): `reset_n_calls` `two_point_correlation` Compute the two-point correlation function
`__init__`()

Initialize self. See help(type(self)) for accurate signature.

`kernel_density`(self, X, h, kernel=’gaussian’, atol=0, rtol=1E-8, breadth_first=True, return_log=False)

Compute the kernel density estimate at points X with the given kernel, using the distance metric specified at tree creation.

Parameters: X : array_like An array of points to query. Last dimension should match dimension of training data. h : float the bandwidth of the kernel kernel : string specify the kernel to use. Options are - ‘gaussian’ - ‘tophat’ - ‘epanechnikov’ - ‘exponential’ - ‘linear’ - ‘cosine’ Default is kernel = ‘gaussian’ atol, rtol : float (default = 0) Specify the desired relative and absolute tolerance of the result. If the true result is K_true, then the returned result K_ret satisfies `abs(K_true - K_ret) < atol + rtol * K_ret` The default is zero (i.e. machine precision) for both. breadth_first : boolean (default = False) if True, use a breadth-first search. If False (default) use a depth-first search. Breadth-first is generally faster for compact kernels and/or high tolerances. return_log : boolean (default = False) return the logarithm of the result. This can be more accurate than returning the result itself for narrow kernels. density : ndarray The array of (log)-density evaluations, shape = X.shape[:-1]

Examples

Compute a gaussian kernel density estimate:

```>>> import numpy as np
>>> np.random.seed(1)
>>> X = np.random.random((100, 3))
>>> tree = BinaryTree(X)
>>> tree.kernel_density(X[:3], h=0.1, kernel='gaussian')
array([ 6.94114649,  7.83281226,  7.2071716 ])
```
`query`(X, k=1, return_distance=True, dualtree=False, breadth_first=False)

query the tree for the k nearest neighbors

Parameters: X : array-like, last dimension self.dim An array of points to query k : integer (default = 1) The number of nearest neighbors to return return_distance : boolean (default = True) if True, return a tuple (d, i) of distances and indices if False, return array i dualtree : boolean (default = False) if True, use the dual tree formalism for the query: a tree is built for the query points, and the pair of trees is used to efficiently search this space. This can lead to better performance as the number of points grows large. breadth_first : boolean (default = False) if True, then query the nodes in a breadth-first manner. Otherwise, query the nodes in a depth-first manner. sort_results : boolean (default = True) if True, then distances and indices of each point are sorted on return, so that the first column contains the closest points. Otherwise, neighbors are returned in an arbitrary order. i : if return_distance == False (d,i) : if return_distance == True d : array of doubles - shape: x.shape[:-1] + (k,) each entry gives the list of distances to the neighbors of the corresponding point i : array of integers - shape: x.shape[:-1] + (k,) each entry gives the list of indices of neighbors of the corresponding point

Examples

Query for k-nearest neighbors

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10, 3))  # 10 points in 3 dimensions
>>> tree = BinaryTree(X, leaf_size=2)
>>> dist, ind = tree.query(X[0], k=3)
>>> print(ind)  # indices of 3 closest neighbors
[0 3 1]
>>> print(dist)  # distances to 3 closest neighbors
[ 0.          0.19662693  0.29473397]
```
`query_radius`()

query_radius(self, X, r, count_only = False):

query the tree for neighbors within a radius r

Parameters: X : array-like, last dimension self.dim An array of points to query r : distance within which neighbors are returned r can be a single value, or an array of values of shape x.shape[:-1] if different radii are desired for each point. return_distance : boolean (default = False) if True, return distances to neighbors of each point if False, return only neighbors Note that unlike the query() method, setting return_distance=True here adds to the computation time. Not all distances need to be calculated explicitly for return_distance=False. Results are not sorted by default: see `sort_results` keyword. count_only : boolean (default = False) if True, return only the count of points within distance r if False, return the indices of all points within distance r If return_distance==True, setting count_only=True will result in an error. sort_results : boolean (default = False) if True, the distances and indices will be sorted before being returned. If False, the results will not be sorted. If return_distance == False, setting sort_results = True will result in an error. count : if count_only == True ind : if count_only == False and return_distance == False (ind, dist) : if count_only == False and return_distance == True count : array of integers, shape = X.shape[:-1] each entry gives the number of neighbors within a distance r of the corresponding point. ind : array of objects, shape = X.shape[:-1] each element is a numpy integer array listing the indices of neighbors of the corresponding point. Note that unlike the results of a k-neighbors query, the returned neighbors are not sorted by distance by default. dist : array of objects, shape = X.shape[:-1] each element is a numpy double array listing the distances corresponding to indices in i.

Examples

Query for neighbors in a given radius

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((10, 3))  # 10 points in 3 dimensions
>>> tree = BinaryTree(X, leaf_size=2)
>>> print(tree.query_radius(X[0], r=0.3, count_only=True))
3
>>> ind = tree.query_radius(X[0], r=0.3)
>>> print(ind)  # indices of neighbors within distance 0.3
[3 0 1]
```
`two_point_correlation`()

Compute the two-point correlation function

Parameters: X : array_like An array of points to query. Last dimension should match dimension of training data. r : array_like A one-dimensional array of distances dualtree : boolean (default = False) If true, use a dualtree algorithm. Otherwise, use a single-tree algorithm. Dual tree algorithms can have better scaling for large N. counts : ndarray counts[i] contains the number of pairs of points with distance less than or equal to r[i]

Examples

Compute the two-point autocorrelation function of X:

```>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((30, 3))
>>> r = np.linspace(0, 1, 5)
>>> tree = BinaryTree(X)
>>> tree.two_point_correlation(X, r)
array([ 30,  62, 278, 580, 820])
```