Home > front end >  pandas groupby and count same column value
pandas groupby and count same column value

Time:01-13

I have a DataFrame, you can have it by running:

import pandas as pd
from io import StringIO

df = """  

               case_id    scheduled_date        status_code
               1213       2021-08               success
               3444       2021-06               fail
               4566       2021-07               unknown
               12213      2021-08               unknown
               34344      2021-06               fail
               44566      2021-07               unknown
               1213       2021-08               fail
              
        """
df= pd.read_csv(StringIO(df.strip()), sep='\s\s ', engine='python')

This outputs:

  case_id   scheduled_date  status_code
0   1213    2021-08         success
1   3444    2021-06         fail
2   4566    2021-07         unknown
3   12213   2021-08         unknown
4   34344   2021-06         fail
5   44566   2021-07         unknown
6   1213    2021-08         fail

How can I count the number of success, fail, and unknown of each month?

Output should look like:

scheduled_date  num of success  num of fail  num of unknown

2021-08           1               1           1
2021-06           0               2           0
2021-07           0               0           2

CodePudding user response:

Here is a proposition with pandas.crosstab :

out = (
        pd.crosstab(df["scheduled_date"], df["status_code"])
            .rename_axis(None, axis=1)
            .add_prefix("num of ")
            .sort_index(ascending=False)
            .reset_index()
        )

# Output :

print(out)

  scheduled_date  num of fail  num of succuss  num of unknown
0        2021-08            1               1               1
1        2021-07            0               0               2
2        2021-06            2               0               0

CodePudding user response:

You can use .pivot_table() to create counts for each (month, status code) pair, and then use .fillna to replace NaNs with zero counts:

df.pivot_table(index="scheduled_date", columns="status_code", aggfunc=len).fillna(0)

This outputs:

               case_id
status_code       fail success unknown
scheduled_date
2021-06            2.0     0.0     0.0
2021-07            0.0     0.0     2.0
2021-08            1.0     1.0     1.0
  • Related