I am trying to compute cosine similarity between 2D-array.
Let's say I have a dataframe whose shape is (5,4)
df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
df
ref_x ref_y alt_x alt_y
0 2.523641 1.270625 0.127030 0.680601
1 -0.992681 -0.021022 0.461249 0.183311
2 -0.865873 -0.117191 -1.521882 -0.388608
3 -0.081354 -1.852463 -0.086464 0.249440
4 -0.057760 0.023642 0.002147 -1.009961
I know how to compute cosine similarity with scipy.
This gives me the cosine similarity.
df['sim'] = df.apply(lambda row: 1 - spatial.distance.cosine(row[['ref_x', 'ref_y']], row[['alt_x', 'alt_y']]), axis=1)
But it is slow (Actually I have a big dataframe that I would like to compute the similarity)
I want to do something like bellow, but it gives me a "ValueError: Input vector should be 1-D." message
df['sim'] = 1 - spatial.distance.cosine(df[['ref_x', 'ref_y']], df[['alt_x', 'alt_y']])
Does anyone have any suggestion or comments?
CodePudding user response:
use cosine_similarity from sklearn
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
co_sim = cosine_similarity(df.to_numpy())
pd.DataFrame(co_sim)
output:
0 1 2 3 4
0 1.000000 0.085483 -0.126060 -0.137558 -0.411323
1 0.085483 1.000000 -0.447271 -0.277837 0.440389
2 -0.126060 -0.447271 1.000000 0.309562 -0.306372
3 -0.137558 -0.277837 0.309562 1.000000 -0.811515
4 -0.411323 0.440389 -0.306372 -0.811515 1.000000