Home > Software design >  Optimise Euclidean distance matrix algorithm if only interested in closest points
Optimise Euclidean distance matrix algorithm if only interested in closest points

Time:08-04

The following Euclidean distance algorithm creates a MxM matrix of distances between the rows of an MxN input matrix (representative of points in some N dimensional space). The speed of this algorithm scales in O(m^2). Can this be improved upon if only interested in the rows (i.e. points) that are closest to each other? (My downstream task constists of performing K-NN, amongst other things)

import numpy as np


vectors = np.random.randn(100, 20)
m = vectors.shape[0]

distances = np.zeros([m, m])
for i in range(m):
    vec = vectors[i]
    distances[i] = [np.linalg.norm(vec - vectors[j]) for j in range(m)]

CodePudding user response:

I would suggest leveraging scipy's condensed distance matrix instead of the for-loop of pairwise comparisons. In particular,

from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(vectors))

provides a ~85x speedup! The documentation can be found on here.

Fundamentally, the complexity seems to remain quadratic (as you need to compare every element of vectors with one another). However, the implementation leverages symmetry and the fact that the distance of every element to itself is 0, thereby only computing the upper triangular sub-matrix and then mirroring it along the diagonal to obtain the quadratic distance matrix.

Your code ran in 71ms while SciPy ran in 0.83ms; the aforementioned 85x speed-up.

Regardless, if you try to run kNN you might want to consider scikit-learn where you can simply provide the vectors as X as shown on here.

  • Related