How to calculate the cosine similarity of two string list by sklearn?-CodePudding

I have two lists with string like that,

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

I want to calculate the cosine similarity of these two list and I know how to realize it by,

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]       
b_vect = [b_vals.get(word, 0) for word in words]        

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))   
cosine = dot / (len_a * len_b) 

print(cosine)

However, if I want to use cosine_similarityin sklearn, it shows the problem:could not convert string to float: 'a' How to correct it?

from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
print(cosine_similarity(a_file, b_file))

CodePudding user response：

It seems it needs

word-vectors,
two dimentional data (list with many word-vectors)

print(cosine_similarity( [a_vect], [b_vect] ))

Full working code:

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]       
b_vect = [b_vals.get(word, 0) for word in words]        

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))   
cosine = dot / (len_a * len_b) 

print(cosine)
print(cosine_similarity([a_vect], [b_vect]))

Result:

0.2886751345948129
[[0.28867513]]

EDIT:

You can also use one list with all data (so second argument will be None)
and it will compare all pairs (a,a), (a,b), (b,a), (b,b).

print(cosine_similarity( [a_vect, b_vect] ))

Result:

[[1.         0.28867513]
 [0.28867513 1.        ]]

You can use longer list [a,b,c, ...] and it will check all possible pairs.

Documentation: cosine_similarity