Home > Mobile >  Vectorized way to group a time series by hour of day, on a rolling window, and assign the rolling me
Vectorized way to group a time series by hour of day, on a rolling window, and assign the rolling me

Time:06-10

We have a time series with a values column apart from the date column and for every day we have 24 rows (hours of the day). The goal is to create an additional column which contains the mean of the values for a specific hour of the day on a rolling window. For example for a rolling window of 365*24 the new column would convey the information: "The values column at this hour of the day for the last year had a mean value of X".

The input and output should look something like this for a rolling window of 24:

index    date(as datetime64)  values new_column reasoning

0   2021-02-21  00:00:00 00:00  100     -   
1   2021-02-21  01:00:00 00:00  200     -   
2   2021-02-21  02:00:00 00:00  300     -   
3   2021-02-21  03:00:00 00:00  400     -   
4   2021-02-21  04:00:00 00:00  500     -   
5   2021-02-21  05:00:00 00:00  600     -   
6   2021-02-21  06:00:00 00:00  700     -   
7   2021-02-21  07:00:00 00:00  800     -   
8   2021-02-21  08:00:00 00:00  900     -   
9   2021-02-21  09:00:00 00:00  1000    -   
10  2021-02-21  10:00:00 00:00  1100    -   
11  2021-02-21  11:00:00 00:00  1200    -   
12  2021-02-21  12:00:00 00:00  1300    -   
13  2021-02-21  13:00:00 00:00  1400    -   
14  2021-02-21  14:00:00 00:00  1500    -   
15  2021-02-21  15:00:00 00:00  1600    -   
16  2021-02-21  16:00:00 00:00  1700    -   
17  2021-02-21  17:00:00 00:00  1800    -   
18  2021-02-21  18:00:00 00:00  1900    -   
19  2021-02-21  19:00:00 00:00  2000    -   
20  2021-02-21  20:00:00 00:00  2100    -   
21  2021-02-21  21:00:00 00:00  2200    -   
22  2021-02-21  22:00:00 00:00  2300    -   
23  2021-02-21  23:00:00 00:00  2400    -   
24  2021-02-22  00:00:00 00:00  200     150   (i.e the mean of 100 and 200, the values at 00:00 for the 2 days)
25  2021-02-22  01:00:00 00:00  400     300   (i.e. mean of 200 and 400)
26  2021-02-22  02:00:00 00:00  600     450   (i.e. mean of 300 and 600)
27  2021-02-22  03:00:00 00:00  800     etc.    etc.
28  2021-02-22  04:00:00 00:00  1000        
29  2021-02-22  05:00:00 00:00  1200        
30  2021-02-22  06:00:00 00:00  1400        
31  2021-02-22  07:00:00 00:00  1600        
32  2021-02-22  08:00:00 00:00  1800        
33  2021-02-22  09:00:00 00:00  2000        
34  2021-02-22  10:00:00 00:00  2200        
35  2021-02-22  11:00:00 00:00  2400        
36  2021-02-22  12:00:00 00:00  2600        
37  2021-02-22  13:00:00 00:00  2800        
38  2021-02-22  14:00:00 00:00  3000        
39  2021-02-22  15:00:00 00:00  3200        
40  2021-02-22  16:00:00 00:00  3400        
41  2021-02-22  17:00:00 00:00  3600        
42  2021-02-22  18:00:00 00:00  3800        
43  2021-02-22  19:00:00 00:00  4000        
44  2021-02-22  20:00:00 00:00  4200        
45  2021-02-22  21:00:00 00:00  4400        
46  2021-02-22  22:00:00 00:00  4600        
47  2021-02-22  23:00:00 00:00  4800        

Before even going to making a new column, I thought of trying a groupby:

means = df.groupby(df.date.dt.hour).values.mean()

However, that has the issue of returning the mean-per-hour for the entire dataframe and not a rolling window of it while a groupby(df.date.dt.hour).values.rolling(X).mean() only considers a window of the final grouped results and not a window of the initial data to be considered before grouping, and I don't know how to assign the values to a new column to the initial dataframe since the groupby() results do not have the same index.

CodePudding user response:

Use droplevel after groupby / rolling to drop the level created by groupby:

# Or .droplevel('date')
df['mean'] = df.groupby(df['date'].dt.hour).rolling(2)['values'].mean().droplevel(0)
print(df)

# Output
                        date  values    mean
0  2021-02-21 00:00:00 00:00     100     NaN
1  2021-02-21 01:00:00 00:00     200     NaN
2  2021-02-21 02:00:00 00:00     300     NaN
3  2021-02-21 03:00:00 00:00     400     NaN
4  2021-02-21 04:00:00 00:00     500     NaN
5  2021-02-21 05:00:00 00:00     600     NaN
6  2021-02-21 06:00:00 00:00     700     NaN
7  2021-02-21 07:00:00 00:00     800     NaN
8  2021-02-21 08:00:00 00:00     900     NaN
9  2021-02-21 09:00:00 00:00    1000     NaN
10 2021-02-21 10:00:00 00:00    1100     NaN
11 2021-02-21 11:00:00 00:00    1200     NaN
12 2021-02-21 12:00:00 00:00    1300     NaN
13 2021-02-21 13:00:00 00:00    1400     NaN
14 2021-02-21 14:00:00 00:00    1500     NaN
15 2021-02-21 15:00:00 00:00    1600     NaN
16 2021-02-21 16:00:00 00:00    1700     NaN
17 2021-02-21 17:00:00 00:00    1800     NaN
18 2021-02-21 18:00:00 00:00    1900     NaN
19 2021-02-21 19:00:00 00:00    2000     NaN
20 2021-02-21 20:00:00 00:00    2100     NaN
21 2021-02-21 21:00:00 00:00    2200     NaN
22 2021-02-21 22:00:00 00:00    2300     NaN
23 2021-02-21 23:00:00 00:00    2400     NaN
24 2021-02-22 00:00:00 00:00     200   150.0
25 2021-02-22 01:00:00 00:00     400   300.0
26 2021-02-22 02:00:00 00:00     600   450.0
27 2021-02-22 03:00:00 00:00     800   600.0
28 2021-02-22 04:00:00 00:00    1000   750.0
29 2021-02-22 05:00:00 00:00    1200   900.0
30 2021-02-22 06:00:00 00:00    1400  1050.0
31 2021-02-22 07:00:00 00:00    1600  1200.0
32 2021-02-22 08:00:00 00:00    1800  1350.0
33 2021-02-22 09:00:00 00:00    2000  1500.0
34 2021-02-22 10:00:00 00:00    2200  1650.0
35 2021-02-22 11:00:00 00:00    2400  1800.0
36 2021-02-22 12:00:00 00:00    2600  1950.0
37 2021-02-22 13:00:00 00:00    2800  2100.0
38 2021-02-22 14:00:00 00:00    3000  2250.0
39 2021-02-22 15:00:00 00:00    3200  2400.0
40 2021-02-22 16:00:00 00:00    3400  2550.0
41 2021-02-22 17:00:00 00:00    3600  2700.0
42 2021-02-22 18:00:00 00:00    3800  2850.0
43 2021-02-22 19:00:00 00:00    4000  3000.0
44 2021-02-22 20:00:00 00:00    4200  3150.0
45 2021-02-22 21:00:00 00:00    4400  3300.0
46 2021-02-22 22:00:00 00:00    4600  3450.0
47 2021-02-22 23:00:00 00:00    4800  3600.0
  • Related