Home > Software engineering >  Counting the frequency of words in a pandas column and counting another column
Counting the frequency of words in a pandas column and counting another column

Time:12-15

I have a data frame with comments and their labels.

Comments Label
I love my Teammates Positive
We need higher pay Suggestions
I hate my boss Negative

I would like to get an output like

Word count Positive Negative Suggestions
I 2 1 1 0
my 2 1 1 0
Teammates 1 1 0 0
love 1 1 0 0
We 1 0 0 1
need 1 0 0 1
higher 1 0 0 1
pay 1 0 0 1
hate 1 0 1 0
boss 1 0 1 0

I was able to get the word count by using

df.Comments.str.split(expand=True).stack().value_counts() 

But I am not able to get the label counts. Any help will be appreciated!

CodePudding user response:

You can use:

out = (
    df['Comments'].str.split().explode().to_frame('Word').join(df['Label']).assign(value=1) \
                  .pivot_table('value', 'Word', 'Label', aggfunc='count', fill_value=0) \
                  .assign(Count=lambda x: x.sum(axis=1))
)

Output:

>>> out
Label      Negative  Positive  Suggestions  Count
Word                                             
I                 1         1            0      2
Teammates         0         1            0      1
We                0         0            1      1
boss              1         0            0      1
hate              1         0            0      1
higher            0         0            1      1
love              0         1            0      1
my                1         1            0      2
need              0         0            1      1
pay               0         0            1      1

Details:

Step 1. explode each comment into word and assign a dummy value.

out = df['Comments'].str.split().explode().to_frame('Word').join(df['Label']).assign(value=1)
print(out)

# Output:
        Word        Label  value
0          I     Positive      1
0       love     Positive      1
0         my     Positive      1
0  Teammates     Positive      1
1         We  Suggestions      1
1       need  Suggestions      1
1     higher  Suggestions      1
1        pay  Suggestions      1
2          I     Negative      1
2       hate     Negative      1
2         my     Negative      1
2       boss     Negative      1

Step 2. pivot your dataframe.

out = out.pivot_table('value', 'Word', 'Label', aggfunc='count', fill_value=0)
print(out)

# Output:
Label      Negative  Positive  Suggestions
Word                                      
I                 1         1            0
Teammates         0         1            0
We                0         0            1
boss              1         0            0
hate              1         0            0
higher            0         0            1
love              0         1            0
my                1         1            0
need              0         0            1
pay               0         0            1

Step 3.: create the count column.

out = out.assign(Count=lambda x: x.sum(axis=1))
print(out)

# Output:
Label      Negative  Positive  Suggestions  Count
Word                                             
I                 1         1            0      2
Teammates         0         1            0      1
We                0         0            1      1
boss              1         0            0      1
hate              1         0            0      1
higher            0         0            1      1
love              0         1            0      1
my                1         1            0      2
need              0         0            1      1
pay               0         0            1      1

CodePudding user response:

You can do the following

out = (
    df.assign(Word=lambda df: df.Comments.str.split())   # Create a column 'World' with the list of words 
      .explode('Word')  # explode the list of words into new rows 
      .pipe(lambda df: pd.crosstab(df.Word, df.Label)) # cross table/ pivot table between 'Word' and 'Label' columns
      .assign(Count=lambda df: df.sum(axis=1))   # Count the column's total
      .reset_index()  # 'Word' index to column
      .rename_axis(columns=None) # remove the name ('Label') of the columns axis
)

Output:

>>> out 

        Word  Negative  Positive  Suggestions  Count
0          I         1         1            0      2
1  Teammates         0         1            0      1
2         We         0         0            1      1
3       boss         1         0            0      1
4       hate         1         0            0      1
5     higher         0         0            1      1
6       love         0         1            0      1
7         my         1         1            0      2
8       need         0         0            1      1
9        pay         0         0            1      1
  • Related