I am working with a COVID-19 dataset that looks as follows:
Date | City | City ID | State | Estimated Population | Estimated Population_2019 | Confirmed Rate | Death Rate | New Confirmed | New Deaths |
---|---|---|---|---|---|---|---|---|---|
2020-03-17 | Rio Branco | 10002 | AC | 413418 | 407319 | 0.72566 | 0.01 | 3 | 0 |
2020-03-17 | Manaus | 12330 | AM | 555550 | 555964 | 0.65433 | 0.005 | 5 | 3 |
The date is my index. I have multiple cities with equal dates as seen.
Given that I have daily datapoints, I am trying to resample my data such that I have monthly points. I have tried using the resample command but I am having trouble because I want some of my columns to be the same and sum and some to be the mean. More specifically:
- City,City ID, State: Will remain the same as they are IDs
- Estimated Population and Estimated_population: I would like to take the mean for each of these columns, and these will be the new monthly values
- Confirmed Rate and Death: I would like to take the monthly mean of these and have these values to be my monthly values and I would like to create new columns giving the monthly standard deviation for my confirmed rate and death rate.
- For New Confirmed and New Deaths: I would like to add these values and have my monthly point to be the sum of new cases and deaths, on two separate columns.
How can I go about making a code that is able to differentiate which columns to add, which to take the mean, and how can I create two new columns for the standard deviations of Confirmed and Death Rates?
CodePudding user response:
You should explore a combination of groupby with .agg. Something like this should work
df_grouped=df.groupby([df.index.month,'City ID']).agg({'Estimated Population':'mean','Estimated Population_2019':'mean','Confirmed Rate':['mean','std'],'Death Rate':['mean','std'],'New Confirmed':'sum','New Deaths':'sum'})
df_grouped.index.rename(['Month','City ID'],inplace=True)