My dataframe looks like this:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook today
5 3 rt twitter
6 4 vote for the best twitter
7 2 twitter tomorrow
8 4 best twitter
I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.
I want my output to look like this.
ID topics text biagram
1 1 twitter is my favorite social favorite social
2 1 favorite social media favorite social
3 2 rt twitter tomorrow twitter tomorrow
4 2 twitter tomorrow twitter tomorrow
5 3 rt twitter rt twitter
6 3 vote for the best rt twitter rt twitter
7 4 best twitter tomorrow best twitter
8 4 best twitter best twitter
Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.
This code will be run on 6M rows of data, so it needs to be fast.
What is the best way to do it using pandas? I apologize if it seems too complicated.
CodePudding user response:
You can use nltk.bigrams
:
from nltk import bigrams
from collections import Counter
_bigrams = df['text'].str.split().apply(bigrams).apply(list).explode().apply(' '.join)
cnt_bigram = Counter(_bigrams)
df['bigram'] = df['text'].apply(lambda x: max([(k,v) for k,v in cnt_bigram.items() if k in x], key=lambda x: x[1])[0])
print(df)
Output:
topics text bigram
0 1 twitter is my favorite social media favorite social
1 1 favorite social media favorite social
2 2 rt twitter tomorrow rt twitter
3 3 rt facebook today rt facebook
4 3 rt twitter rt twitter
5 4 vote for the best twitter best twitter
6 2 twitter tomorrow twitter tomorrow
7 4 best twitter best twitter
CodePudding user response:
Update
You can use sklearn
:
trom sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
index=df['topics'],
columns=vect.get_feature_names_out())
.groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)
# Output
ID topics text bigram
0 1 1 twitter is my favorite social media favorite social
1 2 1 favorite social media favorite social
2 3 2 rt twitter tomorrow twitter tomorrow
3 4 3 rt facebook today facebook today
4 5 3 rt twitter facebook today
5 6 4 vote for the best twitter best twitter
6 7 2 twitter tomorrow twitter tomorrow
7 8 4 best twitter best twitter
Old answer
You can use nltk
:
import nltk
to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
.groupby(level=0).apply(lambda x: x.mode()[0][0]))
df['bigram'] = df['topics'].map(most_common)
print(df)
# Output
ID topics text bigram
0 1 1 twitter is my favorite social media (favorite, social)
1 2 1 favorite social media (favorite, social)
2 3 2 rt twitter tomorrow (rt, twitter)
3 4 3 rt facebook today (rt, facebook)
4 5 3 rt twitter (rt, facebook)
5 6 4 vote for the best twitter (best, twitter)
6 7 2 twitter tomorrow (rt, twitter)
7 8 4 best twitter (best, twitter)