I have dataframe:
import pandas as pd
data = [['apple', 'one', 0.0, [0.047668457, -0.04888916]], ['banana', 'two', 0.0 , [0.0287323, -0.037841797] ], ['qiwi', 'three', 0.0, [0.031051636, -0.05227661]],
['orange', 'one', 1.0, [0.0020618439, -0.055389404]], ['mango', 'two', 1.0, [0.0030326843, -0.036193848]], ['strawberry', 'three', 1.0, [0.008613586, -0.06561279]]]
df = pd.DataFrame(data, columns=['word', 'group', 'count', 'vec'])
---------- ----- ----- -------------------- ----------
| word|group|count| vec| word2|
---------- ----- ----- -------------------- ----------
| apple| one| 0.0|[0.047668457, -0....| apple|
| banana| two| 0.0|[0.0287323, -0.03...| banana|
| qiwi|three| 0.0|[0.031051636, -0....| qiwi|
| orange| one| 1.0|[0.0020618439, -0...| orange|
| mango| two| 1.0|[0.0030326843, -0...| mango|
|strawberry|three| 1.0|[0.008613586, -0....|strawberry|
---------- ----- ----- -------------------- ----------
I want to create a 5x5 dataframe where the cosine similarity of each row will be calculated. Result look like this(I showed only 2 lines in the example):
------ ---------- ---------- ------------------ ------------------ ------------------ ------------------
| word| apple| banana| qiwi| orange| mango| strawberry|
------ ---------- ---------- ------------------ ------------------ ------------------ ------------------
| apple| 1.0|0.99240247|0.9721006775103194|0.7414623055821596|0.7414623055821596|0.8007656107780402|
|banana|0.99240247| 1.0| 0.99357443| 0.81838407| 0.84415172| 0.868376|
------ ---------- ---------- ------------------ ------------------ ------------------ ------------------
...........................
I tried this, but i dont know how to fill all None:
df['word2'] = df['word']
df_piv = df.pivot_table(index=['word'], columns='word2',
values='vec', aggfunc='first').reset_index()
# calc cos sim
# df2 = df_piv .set_index('word')
# v = cosine_similarity(df2.values)
# done = pd.DataFrame(v, columns=df2.index.values, index=df2.index).reset_index()
---------- -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
| word| apple| banana| mango| orange| qiwi| strawberry|
---------- -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
| apple|[0.047668457, -0....| null| null| null| null| null|
| banana| null|[0.0287323, -0.03...| null| null| null| null|
| mango| null| null|[0.0030326843, -0...| null| null| null|
| orange| null| null| null|[0.0020618439, -0...| null| null|
| qiwi| null| null| null| null|[0.031051636, -0....| null|
|strawberry| null| null| null| null| null|[0.008613586, -0....|
---------- -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
CodePudding user response:
You can use cdist
from scipy.spatial.distance
:
from scipy.spatial.distance import cdist
vecs = df['vec'].to_list()
pd.DataFrame(1 - cdist(vecs, vecs, metric='cosine'),
index=df['word'], columns=df['word'])
Output:
word apple banana qiwi orange mango strawberry
word
apple 1.000000 0.992402 0.972101 0.741462 0.771779 0.800766
banana 0.992402 1.000000 0.993574 0.818384 0.844152 0.868376
qiwi 0.972101 0.993574 1.000000 0.878167 0.899404 0.918923
orange 0.741462 0.818384 0.878167 1.000000 0.998924 0.995648
mango 0.771779 0.844152 0.899404 0.998924 1.000000 0.998899
strawberry 0.800766 0.868376 0.918923 0.995648 0.998899 1.000000
CodePudding user response:
You can also use sklearn cosine_similarity module:
from sklearn.metrics.pairwise import cosine_similarity
vectors = df['vec'].to_list()
pd.DataFrame(cosine_similarity(vectors, vectors),
index=df['word'], columns=df['word'])
Output would be:
word apple banana qiwi orange mango strawberry
word
apple 1.000000 0.992402 0.972101 0.741462 0.771779 0.800766
banana 0.992402 1.000000 0.993574 0.818384 0.844152 0.868376
qiwi 0.972101 0.993574 1.000000 0.878167 0.899404 0.918923
orange 0.741462 0.818384 0.878167 1.000000 0.998924 0.995648
mango 0.771779 0.844152 0.899404 0.998924 1.000000 0.998899
strawberry 0.800766 0.868376 0.918923 0.995648 0.998899 1.000000