I have a data frame with comments and their labels.
Comments | Label |
---|---|
I love my Teammates | Positive |
We need higher pay | Suggestions |
I hate my boss | Negative |
I would like to get an output like
Word | count | Positive | Negative | Suggestions |
---|---|---|---|---|
I | 2 | 1 | 1 | 0 |
my | 2 | 1 | 1 | 0 |
Teammates | 1 | 1 | 0 | 0 |
love | 1 | 1 | 0 | 0 |
We | 1 | 0 | 0 | 1 |
need | 1 | 0 | 0 | 1 |
higher | 1 | 0 | 0 | 1 |
pay | 1 | 0 | 0 | 1 |
hate | 1 | 0 | 1 | 0 |
boss | 1 | 0 | 1 | 0 |
I was able to get the word count by using
df.Comments.str.split(expand=True).stack().value_counts()
But I am not able to get the label counts. Any help will be appreciated!
CodePudding user response:
You can use:
out = (
df['Comments'].str.split().explode().to_frame('Word').join(df['Label']).assign(value=1) \
.pivot_table('value', 'Word', 'Label', aggfunc='count', fill_value=0) \
.assign(Count=lambda x: x.sum(axis=1))
)
Output:
>>> out
Label Negative Positive Suggestions Count
Word
I 1 1 0 2
Teammates 0 1 0 1
We 0 0 1 1
boss 1 0 0 1
hate 1 0 0 1
higher 0 0 1 1
love 0 1 0 1
my 1 1 0 2
need 0 0 1 1
pay 0 0 1 1
Details:
Step 1. explode each comment into word and assign a dummy value.
out = df['Comments'].str.split().explode().to_frame('Word').join(df['Label']).assign(value=1)
print(out)
# Output:
Word Label value
0 I Positive 1
0 love Positive 1
0 my Positive 1
0 Teammates Positive 1
1 We Suggestions 1
1 need Suggestions 1
1 higher Suggestions 1
1 pay Suggestions 1
2 I Negative 1
2 hate Negative 1
2 my Negative 1
2 boss Negative 1
Step 2. pivot your dataframe.
out = out.pivot_table('value', 'Word', 'Label', aggfunc='count', fill_value=0)
print(out)
# Output:
Label Negative Positive Suggestions
Word
I 1 1 0
Teammates 0 1 0
We 0 0 1
boss 1 0 0
hate 1 0 0
higher 0 0 1
love 0 1 0
my 1 1 0
need 0 0 1
pay 0 0 1
Step 3.: create the count
column.
out = out.assign(Count=lambda x: x.sum(axis=1))
print(out)
# Output:
Label Negative Positive Suggestions Count
Word
I 1 1 0 2
Teammates 0 1 0 1
We 0 0 1 1
boss 1 0 0 1
hate 1 0 0 1
higher 0 0 1 1
love 0 1 0 1
my 1 1 0 2
need 0 0 1 1
pay 0 0 1 1
CodePudding user response:
You can do the following
out = (
df.assign(Word=lambda df: df.Comments.str.split()) # Create a column 'World' with the list of words
.explode('Word') # explode the list of words into new rows
.pipe(lambda df: pd.crosstab(df.Word, df.Label)) # cross table/ pivot table between 'Word' and 'Label' columns
.assign(Count=lambda df: df.sum(axis=1)) # Count the column's total
.reset_index() # 'Word' index to column
.rename_axis(columns=None) # remove the name ('Label') of the columns axis
)
Output:
>>> out
Word Negative Positive Suggestions Count
0 I 1 1 0 2
1 Teammates 0 1 0 1
2 We 0 0 1 1
3 boss 1 0 0 1
4 hate 1 0 0 1
5 higher 0 0 1 1
6 love 0 1 0 1
7 my 1 1 0 2
8 need 0 0 1 1
9 pay 0 0 1 1