Home > database >  Groupby two columns and take average categorical count in Python
Groupby two columns and take average categorical count in Python

Time:10-17

I would like to Groupby two columns and take average categorical count in Python

(We see that NY holds 3 apples, WA holds 4 cherries and TX holds 3 bbq for the date of 2/1/22 - now we take the average for those 3 containers.)

Data

CONTAINER   CLOUDS   DATE
NY          apple    2/1/2022
NY          apple    2/1/2022
NY          apple    2/1/2022
WA          cherries 2/1/2022
WA          cherries 2/1/2022
WA          cherries 2/1/2022
WA          cherries 2/1/2022
TX          bbq      2/1/2022
TX          bbq      2/1/2022
TX          bbq      2/1/2022
TX          bbq      3/1/2022
TX          bbq      3/1/2022

Desired

AVG CLOUDS PER CONTAINER:  ~3.333

Doing

df = df.groupby('CONTAINER', 'DATE') \
.agg({'CLOUDS':'size', 'CLOUDS':'mean'}) \
.reset_index()

Any suggestion is appreciated

CodePudding user response:

I'm a little unclear about what you want. If you are only interested in container on 2/1/2022:

# Measure container size on 2/1/2022 only
cond = df["DATE"] == "2/1/2022"
output = df[cond].groupby("CONTAINER").size()

# Take the average of all containers' sizes
output["AVG"] = output.mean()

CodePudding user response:

I think you can perform your first groupby on 'CONTAINER','DATE' and get the size of each category, then perform another groupby on the 'DATE' alone and calculate the mean:

df.groupby(['CONTAINER','DATE']).agg({'CLOUDS':'size'}).reset_index().groupby('DATE').agg({'CLOUDS':'mean'}).reset_index()

Result:

       DATE    CLOUDS
0  2/1/2022  3.333333
1  3/1/2022  2.000000
  • Related