I have a df that looks like this:
id textcol1 textcol2 ... coln
1 blue bowl green bowl ... xxx
2 purple sheet green grass ... xxx
3 ground black pepper ground black pepper ... xxx
and so on...
I want to get the percentage of common words between textcol1 and textcol2
id textcol1 textcol2 ... coln intersection
1 blue bowl green bowl ... xxx 50
2 purple sheet green grass ... xxx 0
3 ground black pepper ground black pepper ... xxx 100
After an embarrassingly long time I've come up with the following solution
df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]
But the results are not what I would expect, for example passing "ground black pepper" twice yields 93.33333333333330.
I've gone through all the usual cleaning steps - removing whitespace, etc. - but can't figure out what the issue is here.
What am I missing?
CodePudding user response:
Consider writing a generic text comparison function first, something like text_diff
below that computes a simple token overlap between two texts (aka set
s of tokens):
def text_diff(text1, text2):
return 100 * len(text1.intersection(text2)) / min(map(len, (text1, text2)))
Then you can get the two columns you want to compare and turn them into sets of tokens, e.g.,
df2 = df.filter(like="textcol").applymap(str.split).applymap(set)
Result:
textcol1 textcol2
id
1 {bowl, blue} {bowl, green}
2 {sheet, purple} {grass, green}
3 {pepper, black, ground} {pepper, black, ground}
So you can easily apply the function by doing
>>> df2.apply(lambda row: text_diff(*row), axis=1)
id
1 50.0
2 0.0
3 100.0
dtype: float64
That way you can easily tweak and/or replace your text_diff
function. Do some research on text similarity measures, too, and use existing tools if applicable. fuzzywuzzy could be worth a shot, too.
CodePudding user response:
Here's a quick and dirty way. but might need to be adjusted based on the text, and how you define an interestion to Not a robots points.
def intersections(x):
combined = x['textcol1'].split(' ') x['textcol2'].split(' ')
total = {i:combined.count(i) for i in combined}
return sum([v for v in total.values() if v != 1]) / len(combined) * 100
df['intersections'] = df.apply(intersections, axis=1)
print(df)
textcol1 textcol2 intersections
0 blue bowl green bowl 50.0
1 purple sheet green grass 0.0
2 ground black pepper ground black pepper 100.0
CodePudding user response:
I think the other answers are good, but you want to get the percentage of common words between an row of textcol1
and textcol2
.
To obtain this we have to retrieve all tokens from row and count all occurrences between the word tokens in the row of textcol1
and textcol2
.
The percentage of common words in the first row must be 0.33, because we compare against a the set words = {bowl, blue, green}.
textcol1
and textcol2
got only one word in common, common_words : {bowl}
As a result we get: #common_words / #all_words = 1 / 3 = 0.33
An example:
from functools import reduce
from operator import add
def fun(text1, text2):
text1_tokens = text1.split(' ')
text2_tokens = text2.split(' ')
text1_set = set(text1_tokens)
text2_set = set(text2_tokens)
text_intersect = list(set.intersection(text1_set, text2_set))
all_tokens = list(set.union(text1_set, text2_set))
common_token_count = list(map(lambda x: all_tokens.count(x), text_intersect))
if len(common_token_count) > 0:
common_token_count = reduce(add, common_token_count)
return f"{common_token_count/len(all_tokens):.2f}"
else:
return 0.00
df["intersection"] = df.apply(lambda x: fun(x["text1"], x["text2"]), axis=1)
The output:
0 blue bowl green bowl 0.33
1 purple sheet green grass 0.00
2 ground black pepper ground black pepper 1.00