My first time posting so bear with me. I have a COVID dataset that looks like this:
date | county | confirmed
2021-05-01 Bexar 1200
2021-05-01 Travis 1500
2021-05-01 Harris 1300
2021-05-02 Bexar 1250
2021-05-02 Travis 1550
2021-05-02 Harris 1350
Where the 'confirmed' column is cumulative.
In reality it's a much bigger dataset (several dates and over 200 counties). I want to add a column to the dataset which gives the difference (new cases) each day, by the county. So that it ends up like:
date | county | confirmed | new_cases
2021-05-01 Bexar 1200 N/A
2021-05-01 Travis 1500 N/A
2021-05-01 Harris 1300 N/A
2021-05-02 Bexar 1250 50
2021-05-02 Travis 1530 30
2021-05-02 Harris 1340 40
I've tried figuring out how to loop the df.diff() command over county, and adding the result to the df each time. But I'm so new to Python that I can't figure it out.
CodePudding user response:
Assuming you are working with pandas
:
df = df.sort_values(by=['date'])
df['diff'] = df.groupby(['county'])['confirmed'].diff().fillna(0)