Home > Blockchain >  Sum duplicate bigrams in dataframe
Sum duplicate bigrams in dataframe

Time:11-12

I currently have a data frame that contains values such as:

        Bigram      Frequency
0   (ice, cream)        23
1   (cream, sandwich)   21
2   (google, android)   19
3   (galaxy, nexus)     14
4   (android, google)   12

There are values in there that I want to merge (like google, android and android,google) there are others like "ice, cream" and "cream, sandwich" but that's a different problem.

In order to sum up the duplicates I tried to do this:

def remove_duplicates(ngrams):
    return {" ".join(sorted(key.split(" "))):ngrams[key] for key in ngrams}

freq_all_tw_pos_bg['Word'] = freq_all_tw_pos_bg['Word'].apply(remove_duplicates)

I looked around and found similar exercises which are marked as right answers but when I try to do it I get:

TypeError: tuple indices must be integers or slices, not str

Which makes sense but then I tried converting it to a string and it shuffled the bigrams in a weird way so I wonder, am I missing something that should be easier?

EDIT: The input is the first values I show. A list of bigrams some which are repeated (due to the words in them being reversed. I.e. google, android vs android,google

I want to have this same output (that is a dataframe with the bigrams) but that it sums up the frequencies of the reversed words. If I grab the same list from above and process it then it should output.

        Bigram      Frequency
0   (ice, cream)        23
1   (cream, sandwich)   21
2   (google, android)   31
3   (galaxy, nexus)     14
4   (apple, iPhone)     6

Notice how it "merged" (google, android) and (android, google) and also summed up the frequencies.

CodePudding user response:

If there ara tuples use sorted with convert to tuples:

freq_all_tw_pos_bg['Bigram'] = freq_all_tw_pos_bg['Bigram'].apply(lambda x:tuple(sorted(x)))
print (freq_all_tw_pos_bg)
              Bigram  Frequency
0       (cream, ice)         23
1  (cream, sandwich)         21
2  (android, google)         31
3    (galaxy, nexus)         14
4    (apple, iPhone)          6

And then aggregate sum:

df = freq_all_tw_pos_bg.groupby('Bigram', as_index=False)['Frequency'].sum()
  • Related