I have a dataset of tweets and the year they were posted. I want to get a count of the most frequently occurring words each year. My dataset looks like this:
year tweet
2015 my car is blue
2015 mom is making dinner
2016 my hair is red
2016 i love my mom
I only know how to get the most frequently occurring words for the entire dataset:
pd.Series(' '.join(df['tweets']).split()).value_counts()
Which would give me this:
my 3
is 3
mom 2
car 1
blue 1
making 1
dinner 1
hair 1
red 1
i 1
love 1
So how would I get something like this?
2015
is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016
my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
CodePudding user response:
I'd do something like this:
counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)
Output:
>>> counts
year
2015 is 2
my 1
car 1
blue 1
mom 1
making 1
dinner 1
2016 my 2
hair 1
is 1
red 1
i 1
love 1
mom 1
Name: tweet, dtype: int6
To get the top, say, 5 items per year:
df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))