I have a Dataframe df, you can have it by running the following code:
import pandas as pd
from io import StringIO
df = """
month status_review supply_review case_id
2023-01-01 False False 12
2023-01-01 True True 33
2022-12-01 False True 45
2022-12-01 True True 45
2022-12-01 False False 44
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s ', engine='python')
How can I count how many status_reviews and supply_review are True in each month and also the number of case in each month?
The output should looks like the following:
month # of true status_review # of true supply_review # of case
2023-01-01 1 1 2
2022-12-01 1 2 2
I have tried both:
df.groupby("month").sum()
df.groupby('month').agg('sum')
But the output is:
status_review supply_review case_id
month
2022-12-01 1 2 134
2023-01-01 1 1 45
The case_id is not what I want. I want the distinct count of case_id. How can I achieve the desired output?
CodePudding user response:
You can use .groupby()
and .agg()
:
df.groupby("month").agg({
"status_review": "sum",
"supply_review": "sum",
"case_id": pd.Series.nunique
})
This outputs:
status_review supply_review case_id
month
2022-12-01 1 2 2
2023-01-01 1 1 2
CodePudding user response:
You can use pd.NamedAgg
to rename your columns like this:
df.groupby('month', as_index=False).agg(num_status_review=('status_review', 'sum'),
num_supply_review=('supply_review', 'sum'),
num_case_id=('case_id', 'nunique'))
Output:
month num_status_review num_supply_review num_case_id
0 2022-12-01 1 2 2
1 2023-01-01 1 1 2