Home > OS >  How to calculate a value based on the similarity of two strings
How to calculate a value based on the similarity of two strings

Time:04-18

I have two strings where I want to find a value based on the similar word occurrence among these two strings

actual = ['I', 'am', 'a', 'student', 'from', 'computer', 'science', 'department']
predicted = ['computer', 'and', 'science', 'department']

Above are sample two string I wants to compare.

Ex : There are 3 similar words occurred in predicted string when compared to actual string

Expected output I want to get is the length of the actual & predicted strings and the similar occurrences(words) which is 3 in this case

length of actual = 8
length of predicted = 4
similar word count = 3

CodePudding user response:

My idea is to convert actual and predicted into sets and then construct the intersection of both. Note however that this does not work for multiple occurrences, as sets do not contain duplicates, so the similar word count for actual = ['computer', 'computer', 'computer'] and predicted = ['computer', 'computer', 'computer'] is 1.

actual = ['I', 'am', 'a', 'student', 'from', 'computer', 'science', 'department']
predicted = ['computer', 'and', 'science', 'department']

print("length of actual = ", len(actual))
print("length of predicted = ", len(predicted))
print("similar word count = ", len(set(actual).intersection(set(predicted))))

CodePudding user response:

linuskmr's answer is pretty good if you only want to count the same word once in the similarity calculation.

However, if you want to be able to count the same word multiple times, you can use collections.Counter rather than sets. You can find the words that appear in both lists using & along with the minimum of the counts for that word between those two lists. The actual word counts are stored in the values of the counter, and we want to find the total number of common occurrences, so we use .values() and sum():

from collections import Counter

result = sum((Counter(actual) & Counter(predicted)).values())

print(result)

This outputs:

3
  • Related