Home > database >  Column-wise rolling mean in pandas
Column-wise rolling mean in pandas

Time:05-27

I am working on a large dataset (in wide format) with millions of rows and 15 columns (time-series features).

I want to calculate the moving average for each observation. I can do that with the code block below. However, it requires converting the data frame into a long format. However, I think that this (converting wide to long and vice versa) will be inefficient as the dataset grows. Also, I need to deal with .groupby().

Can I calculate rolling mean in a column-wise manner?

Please find the minimum reproducible example of the current methodology attached below.

import pandas as pd
df = pd.DataFrame({'id': range(3), 
                   'Date_1': range(3, 6), 
                   'Date_2': range(4, 7), 
                   'Date_3': range(5, 8),
                   'Date_4': range(6, 9),
                   'Date_5': range(11, 14)})
df
   id  Date_1  Date_2  Date_3  Date_4  Date_5
0   0       3       4       5       6      11
1   1       4       5       6       7      12
2   2       5       6       7       8      13
df = pd.melt(df, id_vars= "id")
df = df.sort_values(["id", "variable"])
df
    id variable  value
0    0   Date_1      3
3    0   Date_2      4
6    0   Date_3      5
9    0   Date_4      6
12   0   Date_5     11
1    1   Date_1      4
4    1   Date_2      5
7    1   Date_3      6
10   1   Date_4      7
13   1   Date_5     12
2    2   Date_1      5
5    2   Date_2      6
8    2   Date_3      7
11   2   Date_4      8
14   2   Date_5     13
df["ma"] = df.groupby("id")["value"].rolling(3).mean().values
    id variable  value        ma
0    0   Date_1      3       NaN
3    0   Date_2      4       NaN
6    0   Date_3      5  4.000000
9    0   Date_4      6  5.000000
12   0   Date_5     11  7.333333
1    1   Date_1      4       NaN
4    1   Date_2      5       NaN
7    1   Date_3      6  5.000000
10   1   Date_4      7  6.000000
13   1   Date_5     12  8.333333
2    2   Date_1      5       NaN
5    2   Date_2      6       NaN
8    2   Date_3      7  6.000000
11   2   Date_4      8  7.000000
14   2   Date_5     13  9.333333

The desired output is below. The NaN columns can be dropped but that is not significant for now.

df[["id", "variable", "ma"]].pivot("id", columns = "variable", 
values = "ma")
variable  Date_1  Date_2  Date_3  Date_4    Date_5
id                                                
0            NaN     NaN     4.0     5.0  7.333333
1            NaN     NaN     5.0     6.0  8.333333
2            NaN     NaN     6.0     7.0  9.333333

Edit: Thank you @Mortz and @PanagiotisKanavo for prompts Thank you

CodePudding user response:

The rolling method takes in an axis parameter, which you can set to 1 -

import pandas as pd
df = pd.DataFrame({'id': range(3), 
                   'Date_1': range(3, 6), 
                   'Date_2': range(4, 7), 
                   'Date_3': range(5, 8),
                   'Date_4': range(6, 9),
                   'Date_5': range(11, 14)})

df = df.set_index('id')
df.rolling(3, axis=1).mean()
    Date_1  Date_2  Date_3  Date_4    Date_5
id                                          
0      NaN     NaN     4.0     5.0  7.333333
1      NaN     NaN     5.0     6.0  8.333333
2      NaN     NaN     6.0     7.0  9.333333

  • Related