Here's my first dataframe df1
Id Text
1 dFn
2 fiqe
3 raUw
Here's my second dataframe df2
Id Text
1 yuw
2 dnag
Similarity Matrix, columns is Id
from df1
, rows is Id
from df2
1 2 3
1 0 0 0.66
2 0.5 0 0.25
Note:
0
value in (1,1), (2,1) and (3,2) because no letter similar
0.25
value in (3,1) is because of only 1 letter from raUw
avaliable in 4 letter `dnag' (1/4 equals 0.25)
0.5
is counted because of 2 of 4 letter similar
0.66
is counted because of 2 of 3 words similar
CodePudding user response:
IIUC, one option is to use set.intersection
in a nested list comprehension:
out = pd.DataFrame([[len(set(x.lower()) & set(y.lower())) / len(x) for y in df1['Text'].tolist()] for x in df2['Text'].tolist()])
Output:
0 1 2
0 0.0 0.0 0.666667
1 0.5 0.0 0.250000