I would like to Groupby two columns and take average categorical count in Python
(We see that NY holds 3 apples, WA holds 4 cherries and TX holds 3 bbq for the date of 2/1/22 - now we take the average for those 3 containers.)
Data
CONTAINER CLOUDS DATE
NY apple 2/1/2022
NY apple 2/1/2022
NY apple 2/1/2022
WA cherries 2/1/2022
WA cherries 2/1/2022
WA cherries 2/1/2022
WA cherries 2/1/2022
TX bbq 2/1/2022
TX bbq 2/1/2022
TX bbq 2/1/2022
TX bbq 3/1/2022
TX bbq 3/1/2022
Desired
AVG CLOUDS PER CONTAINER: ~3.333
Doing
df = df.groupby('CONTAINER', 'DATE') \
.agg({'CLOUDS':'size', 'CLOUDS':'mean'}) \
.reset_index()
Any suggestion is appreciated
CodePudding user response:
I'm a little unclear about what you want. If you are only interested in container on 2/1/2022:
# Measure container size on 2/1/2022 only
cond = df["DATE"] == "2/1/2022"
output = df[cond].groupby("CONTAINER").size()
# Take the average of all containers' sizes
output["AVG"] = output.mean()
CodePudding user response:
I think you can perform your first groupby on 'CONTAINER','DATE'
and get the size
of each category, then perform another groupby on the 'DATE'
alone and calculate the mean:
df.groupby(['CONTAINER','DATE']).agg({'CLOUDS':'size'}).reset_index().groupby('DATE').agg({'CLOUDS':'mean'}).reset_index()
Result:
DATE CLOUDS
0 2/1/2022 3.333333
1 3/1/2022 2.000000