Intersection / union for each set of words in any given two lists python in a for loop-CodePudding

I'm trying to define Score as the intersection / union for each set of words in any given two lists. I understand that union and intersections are only available for set type of container and I've struggled so bad to be able to set it right but haven't been able to, could someone please help?

corpus = [
    ["i","did","not","like","the","service"],
    ["the","service","was","ok"],
    ["i","was","ignored","when","i","asked","for","service"]
]
tags = ["a","b","c"]
dct_keys = {
    "a":1,
    "b":2,
    "c":3
}
corpus_tags = list(zip(corpus,tags))

from itertools import combinations
my_keys = list(combinations(tags, 2))

goal_dct = {}
for i in range(len(my_keys)):
    goal_dct[(my_keys[i])] = {"id_alpha":(dct_keys[my_keys[i][0]]),
                             "id_beta"  :(dct_keys[my_keys[i][1]]),
                             "socore" : (len(set1&set3))/(len(set1|set3))} # THIS IS WHAT I'M TRYING TO ACHIEVE HERE
print(goal_dct)

This is what I'm trying to define as score, to set the example:

set1 = {"i","did","not","like","the","service"}
set2 = {"the","service","was","ok"}
set3 = {"i","was","ignored","when","i","asked","for","service"}
(len(set1&set3))/(len(set1|set3))

CodePudding user response：

This does not do what you think it does:

(len(set1)&len(set3))/(len(set1)|len(set3))

len returns an int. You can use the & and | operators on ints, but it does bitwise operations, which is not what you're looking for. Instead, you want to use those operators on the sets, and then take the len of those resulting sets:

len(set1 & set3)/len(set1 | set3)

So a function that produces the score for any two lists of strings (sentences) would look like:

def score(s1: list[str], s2: list[str]) -> float:
    set1, set2 = set(s1), set(s2)
    return len(set1 & set2) / len(set1 | set2)

and you can use this to build scores for all the combinations in corpus:

from itertools import combinations
from string import ascii_lowercase

corpus = [
    ["i","did","not","like","the","service"],
    ["the","service","was","ok"],
    ["i","was","ignored","when","i","asked","for","service"]
]
tagged_corpus = dict(zip(ascii_lowercase, corpus))

def score(s1: list[str], s2: list[str]) -> float:
    set1, set2 = set(s1), set(s2)
    return len(set1 & set2) / len(set1 | set2)

goal = {
    (a, b): score(tagged_corpus[a], tagged_corpus[b])
    for a, b in combinations(tagged_corpus, 2)
}

print(goal)  
# ('a', 'b'): 0.25, 
# ('a', 'c'): 0.18181818181818182, 
# ('b', 'c'): 0.2222222222222222}

CodePudding user response：

Make sets from your lists.

set1 = set(some_list)
set2 = set(other_list)
common_items = set1.intersection(set2)