Home > OS >  pandas how to count boolean column value and the distinct count of other columns at the same time
pandas how to count boolean column value and the distinct count of other columns at the same time

Time:01-19

I have a Dataframe df, you can have it by running the following code:

import pandas as pd
from io import StringIO
    
df = """  
    month       status_review             supply_review  case_id
2023-01-01           False                    False      12             
2023-01-01           True                     True       33           
2022-12-01           False                    True       45           
2022-12-01           True                     True       45         
2022-12-01           False                    False      44
    """
df= pd.read_csv(StringIO(df.strip()), sep='\s\s ', engine='python')

How can I count how many status_reviews and supply_review are True in each month and also the number of case in each month?

The output should looks like the following:

    month       # of true status_review      # of true supply_review  # of case
2023-01-01           1                         1                        2
2022-12-01           1                         2                        2

I have tried both:

df.groupby("month").sum()
df.groupby('month').agg('sum')

But the output is:

           status_review    supply_review   case_id
month           
2022-12-01             1    2               134
2023-01-01             1    1               45

The case_id is not what I want. I want the distinct count of case_id. How can I achieve the desired output?

CodePudding user response:

You can use .groupby() and .agg():

df.groupby("month").agg({
    "status_review": "sum",
    "supply_review": "sum",
    "case_id": pd.Series.nunique
})

This outputs:

            status_review  supply_review  case_id
month
2022-12-01              1              2        2
2023-01-01              1              1        2

CodePudding user response:

You can use pd.NamedAgg to rename your columns like this:

df.groupby('month', as_index=False).agg(num_status_review=('status_review', 'sum'),
                                        num_supply_review=('supply_review', 'sum'),
                                        num_case_id=('case_id', 'nunique'))

Output:

        month  num_status_review  num_supply_review  num_case_id
0  2022-12-01                  1                  2            2
1  2023-01-01                  1                  1            2
  • Related