How do I add a column which 'differences out' a cumulative variable by category?-CodePudding

My first time posting so bear with me. I have a COVID dataset that looks like this:

date        | county | confirmed
2021-05-01    Bexar     1200
2021-05-01    Travis    1500
2021-05-01    Harris    1300
2021-05-02    Bexar     1250
2021-05-02    Travis    1550
2021-05-02    Harris    1350

Where the 'confirmed' column is cumulative.

In reality it's a much bigger dataset (several dates and over 200 counties). I want to add a column to the dataset which gives the difference (new cases) each day, by the county. So that it ends up like:

date        | county | confirmed  | new_cases 
2021-05-01    Bexar     1200         N/A
2021-05-01    Travis    1500         N/A
2021-05-01    Harris    1300         N/A
2021-05-02    Bexar     1250         50
2021-05-02    Travis    1530         30
2021-05-02    Harris    1340         40

I've tried figuring out how to loop the df.diff() command over county, and adding the result to the df each time. But I'm so new to Python that I can't figure it out.

CodePudding user response：

Assuming you are working with pandas:

df = df.sort_values(by=['date'])

df['diff'] = df.groupby(['county'])['confirmed'].diff().fillna(0)