Home > Mobile >  Numpy calculate difference of matrices against all rows in matrix
Numpy calculate difference of matrices against all rows in matrix

Time:09-17

Given two matrices, I want to create a new array of the sum of squared differences of each row, but I cannot seem to find a way.

To be more clear what I mean, let's have an example. I would like to do the following for-loop in numpy matrix calculations:

a = np.array([[0.0, 0.0, 0.0], [1.0, 1.0, 1.0]])
b = np.array([[1.2, 2.3, 3.4], [4.5, 5.6, 7.8], [9.10, 10.11, 11.12]])
summed = np.ones((2,3))
for i, aSample in enumerate(a):
    for j, bSample in enumerate(b):
       summed[i, j] = np.sum(np.power(aSample - bSample, 2))

>>>summed
array([[ 18.29  , 112.45  , 308.6765],
       [  7.49  ,  79.65  , 251.0165]])

These are just example matrices, in my use case both of the matrices have over tens of thousands of rows. So the shapes of these matrices are more like (20000, 1000). Is there a way to do this efficiently with numpy?

EDIT: @Blorgon provided correct results, but in my case, I couldn't allocate bigger matrix with np.newaxis. The solution by a @MadPhysicist calculated successfully the distance of the vectors within memory limits.

CodePudding user response:

You can use np.newaxis to achieve your desired result:

>>> np.sum(np.power(b - a[:, np.newaxis], 2), axis=2)
array([[ 18.29  , 112.45  , 308.6765],
       [  7.49  ,  79.65  , 251.0165]])

Edit: while my solution is faster, if memory is important then the scipy solution is better.

CodePudding user response:

The numpy solution suggested by blorgon is likely faster, but you can also use scipy.spatial.distance.cdist:

>>> from scipy.spatial.distance import cdist
>>> cdist(a, b)**2
array([[ 18.29  , 112.45  , 308.6765],
       [  7.49  ,  79.65  , 251.0165]])

The problem with this approach is that it takes a square root and then undoes it. The advantage is that it does not use a large intermediate array. You can avoid some intermediates in numpy like this:

>>> diff = b - a[:, np.newaxis]
>>> np.power(diff, 2, out=diff).sum(axis=2)
array([[ 18.29  , 112.45  , 308.6765],
       [  7.49  ,  79.65  , 251.0165]])
  • Related