How to group-by and get most frequent ngram?-CodePudding

My dataframe looks like this:

ID topics   text
1     1        twitter is my favorite social media
2     1        favorite social media
3     2        rt twitter tomorrow
4     3        rt facebook today
5     3        rt twitter
6     4        vote for the best twitter
7     2        twitter tomorrow
8     4        best twitter

I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.

I want my output to look like this.

ID topics      text                                 biagram
1     1        twitter is my favorite social       favorite social
2     1        favorite social media               favorite  social
3     2        rt twitter tomorrow                 twitter tomorrow
4     2        twitter tomorrow                    twitter tomorrow
5     3        rt twitter                          rt twitter
6     3        vote for the best rt twitter        rt twitter 
7     4        best twitter tomorrow               best twitter
8     4        best twitter                        best twitter

Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.

This code will be run on 6M rows of data, so it needs to be fast.

What is the best way to do it using pandas? I apologize if it seems too complicated.

CodePudding user response：

You can use nltk.bigrams:

from nltk import bigrams
from collections import Counter

_bigrams = df['text'].str.split().apply(bigrams).apply(list).explode().apply(' '.join)
cnt_bigram = Counter(_bigrams)
df['bigram'] = df['text'].apply(lambda x: max([(k,v) for k,v in cnt_bigram.items() if k in x], key=lambda x: x[1])[0])
print(df)

Output:

   topics                                 text            bigram
0       1  twitter is my favorite social media   favorite social
1       1                favorite social media   favorite social
2       2                  rt twitter tomorrow        rt twitter
3       3                    rt facebook today       rt facebook
4       3                           rt twitter        rt twitter
5       4            vote for the best twitter      best twitter
6       2                     twitter tomorrow  twitter tomorrow
7       4                         best twitter      best twitter

CodePudding user response：

Update

You can use sklearn:

trom sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
                       index=df['topics'],
                       columns=vect.get_feature_names_out())
            .groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)

# Output
   ID  topics                                 text            bigram
0   1       1  twitter is my favorite social media   favorite social
1   2       1                favorite social media   favorite social
2   3       2                  rt twitter tomorrow  twitter tomorrow
3   4       3                    rt facebook today    facebook today
4   5       3                           rt twitter    facebook today
5   6       4            vote for the best twitter      best twitter
6   7       2                     twitter tomorrow  twitter tomorrow
7   8       4                         best twitter      best twitter

Old answer

You can use nltk:

import nltk

to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
                 .groupby(level=0).apply(lambda x: x.mode()[0][0]))

df['bigram'] = df['topics'].map(most_common)
print(df)

# Output
   ID  topics                                 text              bigram
0   1       1  twitter is my favorite social media  (favorite, social)
1   2       1                favorite social media  (favorite, social)
2   3       2                  rt twitter tomorrow       (rt, twitter)
3   4       3                    rt facebook today      (rt, facebook)
4   5       3                           rt twitter      (rt, facebook)
5   6       4            vote for the best twitter     (best, twitter)
6   7       2                     twitter tomorrow       (rt, twitter)
7   8       4                         best twitter     (best, twitter)