What is the most efficient way to run a cosine similarity test comparing all rows to all rows of two-CodePudding

I'm working to speed up this chunk of code. It's currently taking multiple days to complete two compare two dataframes [192184 rows x 256 columns] by [7739 rows x 256 columns]. This is a relatively small use-case and its way too slow already. I'd like to compare all rows of dataframe 1 to all rows of dataframe 2. This is done in order to rank the rows in DF2 by similarity to each row of DF1.

There is no need to keep this in pandas format, my end goal is just to write the output to a csv.

For simplicity, I'll write short example inputs, and the expected example output.

Input:

DF1:
A    1    10    7    8    3
B    1    0    1    0    1
C    6    1    0    0    0

DF2:
D    4    0    0    0    1
E    1    0    0    0    2
F    2    8    8    5    2

Output:

          D         E         F
A  0.113690  0.209633  0.971074
B  0.700140  0.774597  0.546019
C  0.956943  0.441129  0.259129

My current code:

import pandas
from scipy import spatial

#set up test case
a = [1,10,7,8,3]
b = [1,0,1,0,1]
c = [6,1,0,0,0]
d = [4,0,0,0,1]
e = [1,0,0,0,2]
f = [2,8,8,5,2]

DF1 = pd.DataFrame([a,b,c])
DF1.index = ["A","B","C"]

DF2 = pd.DataFrame([d,e,f])
DF2.index = ["D","E","F"]


#declare variables
a_dict = {}
rows_list = list(DF2.index.values)

#something is wrong with this meat
for line in DF1.iterrows():
    temp = []
    for row in DF2.iterrows():
        temp.append(1- spatial.distance.cosine(line[1],row[1]))
    a_dict[line[0]] = temp

#write final output
final_output = pd.DataFrame.from_dict(a_dict, orient="index", columns = rows_list)

For speeding this up, I've read that itertuples should be faster, but I don't think that addresses the main issue which is either the nested for loop or the spatial.distance.cosine test itself. I've read about different functions to run this test, but honestly don't know how to apply them in this scenario.

Also, this is my first question on here - feel free to give me tips for next time :)

CodePudding user response：

Let's start to see how we could have accelerated the same code. Just as an exercise about how to avoid some for loops. Tho, spoiler alert, we won't use that algorithm at the end anyway, so it is just an exercise.

Vectorization of your code

What you are doing here is calling spatial.distance.cosine for every combination of rows. Hence the two nested for loops. Let's remove the inner for loop, by using np.vectorize. Whose purpose is to apply the same function to all items of a vector, and return the vector of result. A bit like map, but for numpy.

Example:

def f(a,b):
   return a*b

fv=np.vectorize(f)
fv(np.array([1,2,3]), np.array([4,5,6]))
#array([ 4, 10, 18])
fv(np.array([1,2,3]), 4)
#array([ 4,  8, 12])
fv(np.array([[1,2],[3,4]]), np.array([5,6]))
#array([[ 5, 12],
#       [15, 24]])

So, fv is the vectorized version of f. You recognize the behavior of * on numpy arrays. Not a coincidence.

np.vectorize goes down any dimension needed to find scalars, in each parameters, and apply the original function on them. But in your case, we don't want it to do that. Because we don't want it to iterate through all 256 columns, and then apply spatial.distance.cosine on individual scalars.

But there is a remedy to that. We can say specify during vectorization, what is the type of the "atoms", what is the signature of f.

sdcv=np.vectorize(spatial.distance.cosine, signature="(n),(n)->()")

creates a vectorized version of cosine expecting to find, in both arguments, arrays of the same size, or arrays of such arrays, or arrays of arrays of such arrays, etc.

So, now, if you call

sdcv(line[1], DF2.values)

Since line[1] is an array of scalars (5 in your example. 256 in your real use case), and DF2.values is an array of arrays of scalars, what this does is calling cosine(line[1], x) for x being each line of DF2.values. And return the len(DF2)` results as an array.

Hence an optimization

sdcv=np.vectorize(spatial.distance.cosine, signature="(n),(n)->()")

def mvec():
    #declare variables
    a_dict = {}
    rows_list = list(DF2.index.values)

    for line in DF1.iterrows():
        temp = 1-sdcv(line[1], DF2.values)
        a_dict[line[0]] = temp

    #write final output
    return pd.DataFrame.from_dict(a_dict, orient="index", columns = rows_list)

We still have a loop. But the inner loop is now implicit, and done by vectorization.

Second layer

We can do the same with the outer loop

sdcv2=np.vectorize(sdcv, signature="(n),(m,n)->(m)")
def mvec2():
    return pd.DataFrame(1-sdcv2(DF1.values, DF2.values))

This time, we tell vectorize that we want 1st argument to be a row, and 2nd argument to be a whole dataframe. And result of to be a list of cosine. So, what the previous vectorization usage was doing.

So, if we call sdcv2 with DF1 and DF2, since DF1 is an array of rows, it will call sdcv for each row of DF1, with second argument being DF2 (it doesn't iterate on DF2, because DF2 is already the expected 2d array). And since a call to sdcv, with DF2 as second argument, returns an array for each row, we get as a result an array of array.

Linalg solution

Now, that was just an exercise on vectorization. Because, there is a detail in your code, and therefore in both of mine. It is the 1- before spatial.distance.cosine. But spatial.distance.cosine(u,v) definition is 1-u.v/(||u||.||v||)

So what you really want is u.v/(||u||.||v||)

For the u.v part, it is very easy. There is a well known operation in linear algebra that computes the scalar product of all possible combinations of a set of rows. That is simply matrix multiplication. Or almost so: you need to transpose the 2nd matrix (since matrix multiplication results in the array of all combination of a scalar product between a row of 1st matrix and a column of second matrix).

So, in your case,

DF1.values @ DF2.values.T

is matrix of n×n scalar products between rows of DF1 and rows of DF2. The u.v for u being a row of DF1 and v being a row of DF2

We need to divide that by norms of u and v in each cell.

To be more accurate, in all cells of the first line, u was the first row of DF1. In all cells of 2nd line, u was the 2nd row of DF1. Etc.

So we need to divide the kth line of [email protected] by norm of kth line of DF1.

We can compute the norms of all lines of DF1 easily

np.linalg.norm(DF1.values, axis=1)

is an 1D array of the norm of all lines of DF1. And

np.linalg.norm(DF1.values, axis=1).reshape(-1,1)

is the same result, but in the form of a column So

([email protected])/np.linalg.norm(DF1.values,axis=1).reshape(-1,1)

is a matrix of u.v/||u|| for u each row of DF1, and v each row of DF2.

Likewise, line of norms of rows of DF2 is np.linalg.norm(DF2,axis=1) So

([email protected])/np.linalg.norm(DF1.values,axis=1).reshape(-1,1)/np.linalg.norm(DF2.values, axis=1)

Is a matrix made of u.v/||u||/||v|| for u each row of DF1 and v each row of DF2.

Exactly what you wanted.

Hence my linalg method

def mnp():
    d=([email protected]) / np.linalg.norm(DF1.values,axis=1).reshape(-1,1) / np.linalg.norm(DF2.values, axis=1)
    return pd.DataFrame(d, columns=DF2.index, index=DF1.index)

Timings

With dataframes of 500 columns and 3 rows (At first, I just increased the number of rows so that vectorization has a chance to be usefull, but still keeping a 3x3 matrix that I can visually control)

Method	Timing (ms)
Yours	2.9259588550194167
Vec	2.3712358399643563
Vec2	1.5134614360285923
NP	0.19741979095852003

So, 20% gain for vectorization. ×2 for double vectorization. And ×15 gain for linalg.

But 3 rows is not realistic. Now that we know it works, let increase the number of rows

With 100 rows (so 30x more, and it is a O(N²) algorithm, even with constant column size)

Method	Timing (ms)
Yours	1968.870209006127
Vec	904.9228489748202
Vec2	807.5719091983046
NP	5.865840002661571

This times, advantage of vectorization, and moreover, of using global numpy operations, is more visible. ×2 for vectorization (2nd layer this time is less impressive. Because outer loop is never the one that works the more). And ×335 for NP.

On my machine, for 256 columns, and 1000 rows for DF1 and 7739 for DF2, it takes 4 seconds (and that is a very slow machine. Pretty sure, with vectorization, that my Mac would do it way under 1 second, but I don't have it at hand)

You've 192184 rows in DF1, you say, not 1000. But since you have "only" 7739 for DF2, at this point it is linear: it would take only 192 times more to have the all 192184 rows than 1000. So a few minutes.

What might be more a problem with that many data, is memory. Because of storage of result matrix itself.

But that is not a problem. You can slice DF1 by batches of 1000 or 10000 rows. And you'll get a 1000x7739 matrix each time. The real result is just the concatenation of those matrix.

Edit

A variant I've tried, but expecting that improvement would be negligible, is using einsum. Not a game changer, but not as negligible I was expected.

einsum is a way to specify many arrays iterative sum operation. For example, matrix multiplication A@B is the equivalent of np.einsum('ij,jk',A,B). So, cross scalar of every rows, would be np.einsum('ij,kj', A, B). Which is, again, the same as matrix multiplication, with a transposition of second argument.

And just avoiding this .T is not that negligible apparently.

So, code for this methond

def mein():
    d=np.einsum('ij,kj', DF1.values, DF2.values) / np.linalg.norm(DF1.values,axis=1).reshape(-1,1) / np.linalg.norm(DF2.values, axis=1)
    return pd.DataFrame(d, columns=DF2.index, index=DF1.index)

It is the exact same as before. But [email protected] has been changed into np.einsum('ij,kj',DF1.values,DF2.values)

And update of timings with this

Method	Timing (ms)
Yours	1968.87
Vec	904.92
Vec2	807.57
NP	5.87
einsum	5.41

So, not a revolution, but still, and optimization.