I have a dataset which is a time series. It has several regions at once, here is a small example:
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR
71541 2022-07-19 19170.0 555.0 18484.0 YEV
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB
I have three columns for which I want to display information about how many new cases have been added in separate three columns:
date confirmed deaths recovered region_code daily_confirmed daily_deaths daily_recovered
0 2020-03-27 3.0 0.0 0.0 ARK 3.0 0.0 0.0
1 2020-03-27 4.0 0.0 0.0 BA 4.0 0.0 0.0
2 2020-03-27 1.0 0.0 0.0 BEL 1.0 0.0 0.0
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR 32.0 16.0 8.0
71541 2022-07-19 19170.0 555.0 18484.0 YEV 6.0 1.0 1.0
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB 1.0 8.0 9.0
That is, for each region, you need to get the difference between the current date and the last day in order to understand how many new cases have occurred.
The problem is that I don't know how to do this process correctly. Since there are no missing dates in the data, you can use something like this: df['daily_cases'] = df['confirmed'] - df['confirmed'].shift(fill_value=0)
. But there are many different regions here, that is, first you need to filter everything correctly somehow ... Any ideas how to do this?
CodePudding user response:
Use DataFrameGroupBy.diff
with replace first missing values by original columns add prefix to columns and cast to inetegers if necessary:
print (df)
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
3 2020-03-28 4.0 0.0 4.0 ARK
4 2020-03-28 6.0 0.0 0.0 BA
5 2020-03-28 1.0 0.0 0.0 BEL
6 2020-03-29 6.0 0.0 10.0 ARK
7 2020-03-29 8.0 0.0 0.0 BA
8 2020-03-29 5.0 0.0 0.0 BEL
cols = ['confirmed','deaths','recovered']
df1 = (df.groupby(['region_code'])[cols]
.diff()
.fillna(df[cols])
.add_prefix('daily_')
.astype(int))
print (df1)
daily_confirmed daily_deaths daily_recovered
0 3 0 0
1 4 0 0
2 1 0 0
3 1 0 4
4 2 0 0
5 0 0 0
6 2 0 6
7 2 0 0
8 4 0 0
Last append to original:
df = df.join(df1)
print (df)