Home > Back-end >  How do I get the most frequent words in a column of text based on the value of another column?
How do I get the most frequent words in a column of text based on the value of another column?

Time:03-23

I have a dataset of tweets and the year they were posted. I want to get a count of the most frequently occurring words each year. My dataset looks like this:

year     tweet
2015     my car is blue
2015     mom is making dinner
2016     my hair is red
2016     i love my mom

I only know how to get the most frequently occurring words for the entire dataset:

pd.Series(' '.join(df['tweets']).split()).value_counts()

Which would give me this:

my      3
is      3
mom     2
car     1
blue    1
making  1
dinner  1
hair    1
red     1
i       1
love    1

So how would I get something like this?

2015

is      2
my      1
car     1
blue    1
mom     1
making  1
dinner  1

2016

my      2
hair    1
is      1
red     1
i       1
love    1
mom     1

CodePudding user response:

I'd do something like this:

counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)

Output:

>>> counts
year        
2015  is        2
      my        1
      car       1
      blue      1
      mom       1
      making    1
      dinner    1
2016  my        2
      hair      1
      is        1
      red       1
      i         1
      love      1
      mom       1
Name: tweet, dtype: int6

To get the top, say, 5 items per year:

df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))
  • Related