Home > OS >  Efficient Cosine Similarity between Dataframe Rows
Efficient Cosine Similarity between Dataframe Rows

Time:09-26

I'm handling a large collection of data (~150k rows) in the form of a pandas DataFrame structured as follows:

Id       Features
27834   [-21.722315, 11.017685, -23.340332, -4.7431817...]
27848   [-24.009565, 2.7699656, -10.014014, 9.293142, ...]
27895   [-10.533865, 18.835657, -10.094039, -7.5399566...]
27900   [-18.124702, 2.6769984, -11.778319, -7.14392, ...]
27957   [-10.765628, -11.368319, -19.968557, 2.1252406... ]

Each row in Features is a 50-position numpy array. My goal is to then take a random feature vector from the set and compute its cosine similarity with all the other vectors.

The code below works for the intended purpose, however, it is very slow:

# Extract a random row to use as input data
sample = copy.sample()
# Scale sample to match original size
scaled = pd.DataFrame(np.repeat(sample.values, len(copy),axis=0), columns=['ID','Features'])
# Copy scaled dataframe columns to original
copy['Features_compare'] = scaled['Features']
# Apply cosine similarity for Features
copy['cosine'] = copy.apply(lambda row: 1 - cosine(row['Features'],row['Features_compare']), axis=1)
copy

Output:

Id      Features                    Features_compare                cosine
27834   [-21.722315, 11.017685...]  [-25.151268, 1.0155457...]      0.452093
27848   [-24.009565, 2.7699656...]  [-25.151268, 1.0155457...]      0.528901
27895   [-10.533865, 18.835657...]  [-25.151268, 1.0155457...]      0.266685
27900   [-18.124702, 2.6769984...]  [-25.151268, 1.0155457...]      0.381307
27957   [-10.765628, -11.368319...] [-25.151268, 1.0155457...]      0.220016
Elapsed Time: 14.1623s

Is there a more efficient way of computing a large number of similarity values in dataframes?

CodePudding user response:

Lets vectorize your code

from sklearn.metrics.pairwise import cosine_similarity

s = df['Features'].sample(n=1).tolist()

df['Features_compare'] = np.tile(s, (len(df), 1)).tolist()
df['cosine'] = cosine_similarity(df['Features'].tolist(), s)

      Id                                         Features                               Features_compare    cosine
0  27834  [-21.722315, 11.017685, -23.340332, -4.7431817]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.937401
1  27848    [-24.009565, 2.7699656, -10.014014, 9.293142]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.776474
2  27895  [-10.533865, 18.835657, -10.094039, -7.5399566]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.722914
3  27900    [-18.124702, 2.6769984, -11.778319, -7.14392]  [-18.124702, 2.6769984, -11.778319, -7.14392]  1.000000
4  27957  [-10.765628, -11.368319, -19.968557, 2.1252406]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.659093
  • Related