I currently have a data frame that contains values such as:
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 19
3 (galaxy, nexus) 14
4 (android, google) 12
There are values in there that I want to merge (like google, android and android,google) there are others like "ice, cream" and "cream, sandwich" but that's a different problem.
In order to sum up the duplicates I tried to do this:
def remove_duplicates(ngrams):
return {" ".join(sorted(key.split(" "))):ngrams[key] for key in ngrams}
freq_all_tw_pos_bg['Word'] = freq_all_tw_pos_bg['Word'].apply(remove_duplicates)
I looked around and found similar exercises which are marked as right answers but when I try to do it I get:
TypeError: tuple indices must be integers or slices, not str
Which makes sense but then I tried converting it to a string and it shuffled the bigrams in a weird way so I wonder, am I missing something that should be easier?
EDIT: The input is the first values I show. A list of bigrams some which are repeated (due to the words in them being reversed. I.e. google, android vs android,google
I want to have this same output (that is a dataframe with the bigrams) but that it sums up the frequencies of the reversed words. If I grab the same list from above and process it then it should output.
Bigram Frequency
0 (ice, cream) 23
1 (cream, sandwich) 21
2 (google, android) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
Notice how it "merged" (google, android) and (android, google) and also summed up the frequencies.
CodePudding user response:
If there ara tuples use sorted
with convert to tuples:
freq_all_tw_pos_bg['Bigram'] = freq_all_tw_pos_bg['Bigram'].apply(lambda x:tuple(sorted(x)))
print (freq_all_tw_pos_bg)
Bigram Frequency
0 (cream, ice) 23
1 (cream, sandwich) 21
2 (android, google) 31
3 (galaxy, nexus) 14
4 (apple, iPhone) 6
And then aggregate sum
:
df = freq_all_tw_pos_bg.groupby('Bigram', as_index=False)['Frequency'].sum()