Сonvert the data from summary to daily time series data (pandas)-CodePudding

I have a dataset which is a time series. It has several regions at once, here is a small example:

        date        confirmed   deaths  recovered  region_code  
0       2020-03-27  3.0         0.0     0.0        ARK  
1       2020-03-27  4.0         0.0     0.0        BA   
2       2020-03-27  1.0         0.0     0.0        BEL
..........................................................
71540   2022-07-19  164194.0    2830.0  160758.0   YAR  
71541   2022-07-19  19170.0     555.0   18484.0    YEV  
71542   2022-07-19  169603.0    2349.0  167075.0   ZAB

I have three columns for which I want to display information about how many new cases have been added in separate three columns:

        date        confirmed   deaths  recovered  region_code  daily_confirmed daily_deaths daily_recovered
0       2020-03-27  3.0         0.0     0.0        ARK          3.0             0.0          0.0 
1       2020-03-27  4.0         0.0     0.0        BA           4.0             0.0          0.0
2       2020-03-27  1.0         0.0     0.0        BEL          1.0             0.0          0.0
..........................................................
71540   2022-07-19  164194.0    2830.0  160758.0   YAR          32.0            16.0         8.0 
71541   2022-07-19  19170.0     555.0   18484.0    YEV          6.0             1.0          1.0
71542   2022-07-19  169603.0    2349.0  167075.0   ZAB          1.0             8.0          9.0

That is, for each region, you need to get the difference between the current date and the last day in order to understand how many new cases have occurred.

The problem is that I don't know how to do this process correctly. Since there are no missing dates in the data, you can use something like this: df['daily_cases'] = df['confirmed'] - df['confirmed'].shift(fill_value=0). But there are many different regions here, that is, first you need to filter everything correctly somehow ... Any ideas how to do this?

CodePudding user response：

Use DataFrameGroupBy.diff with replace first missing values by original columns add prefix to columns and cast to inetegers if necessary:

print (df)
         date  confirmed  deaths  recovered region_code
0  2020-03-27        3.0     0.0        0.0         ARK
1  2020-03-27        4.0     0.0        0.0          BA
2  2020-03-27        1.0     0.0        0.0         BEL
3  2020-03-28        4.0     0.0        4.0         ARK
4  2020-03-28        6.0     0.0        0.0          BA
5  2020-03-28        1.0     0.0        0.0         BEL
6  2020-03-29        6.0     0.0       10.0         ARK
7  2020-03-29        8.0     0.0        0.0          BA
8  2020-03-29        5.0     0.0        0.0         BEL

cols = ['confirmed','deaths','recovered']
df1 = (df.groupby(['region_code'])[cols]
         .diff()
         .fillna(df[cols])
         .add_prefix('daily_')
         .astype(int))
print (df1)
   daily_confirmed  daily_deaths  daily_recovered
0                3             0                0
1                4             0                0
2                1             0                0
3                1             0                4
4                2             0                0
5                0             0                0
6                2             0                6
7                2             0                0
8                4             0                0

Last append to original:

df = df.join(df1)
print (df)