I'm handling a large collection of data (~150k rows) in the form of a pandas DataFrame structured as follows:
Id Features
27834 [-21.722315, 11.017685, -23.340332, -4.7431817...]
27848 [-24.009565, 2.7699656, -10.014014, 9.293142, ...]
27895 [-10.533865, 18.835657, -10.094039, -7.5399566...]
27900 [-18.124702, 2.6769984, -11.778319, -7.14392, ...]
27957 [-10.765628, -11.368319, -19.968557, 2.1252406... ]
Each row in Features
is a 50-position numpy array. My goal is to then take a random feature vector from the set and compute its cosine similarity with all the other vectors.
The code below works for the intended purpose, however, it is very slow:
# Extract a random row to use as input data
sample = copy.sample()
# Scale sample to match original size
scaled = pd.DataFrame(np.repeat(sample.values, len(copy),axis=0), columns=['ID','Features'])
# Copy scaled dataframe columns to original
copy['Features_compare'] = scaled['Features']
# Apply cosine similarity for Features
copy['cosine'] = copy.apply(lambda row: 1 - cosine(row['Features'],row['Features_compare']), axis=1)
copy
Output:
Id Features Features_compare cosine
27834 [-21.722315, 11.017685...] [-25.151268, 1.0155457...] 0.452093
27848 [-24.009565, 2.7699656...] [-25.151268, 1.0155457...] 0.528901
27895 [-10.533865, 18.835657...] [-25.151268, 1.0155457...] 0.266685
27900 [-18.124702, 2.6769984...] [-25.151268, 1.0155457...] 0.381307
27957 [-10.765628, -11.368319...] [-25.151268, 1.0155457...] 0.220016
Elapsed Time: 14.1623s
Is there a more efficient way of computing a large number of similarity values in dataframes?
CodePudding user response:
Lets vectorize your code
from sklearn.metrics.pairwise import cosine_similarity
s = df['Features'].sample(n=1).tolist()
df['Features_compare'] = np.tile(s, (len(df), 1)).tolist()
df['cosine'] = cosine_similarity(df['Features'].tolist(), s)
Id Features Features_compare cosine
0 27834 [-21.722315, 11.017685, -23.340332, -4.7431817] [-18.124702, 2.6769984, -11.778319, -7.14392] 0.937401
1 27848 [-24.009565, 2.7699656, -10.014014, 9.293142] [-18.124702, 2.6769984, -11.778319, -7.14392] 0.776474
2 27895 [-10.533865, 18.835657, -10.094039, -7.5399566] [-18.124702, 2.6769984, -11.778319, -7.14392] 0.722914
3 27900 [-18.124702, 2.6769984, -11.778319, -7.14392] [-18.124702, 2.6769984, -11.778319, -7.14392] 1.000000
4 27957 [-10.765628, -11.368319, -19.968557, 2.1252406] [-18.124702, 2.6769984, -11.778319, -7.14392] 0.659093