I am working on a large dataset (in wide format) with millions of rows and 15 columns (time-series features).
I want to calculate the moving average for each observation. I can do that with the code block below. However, it requires converting the data frame into a long format. However, I think that this (converting wide to long and vice versa) will be inefficient as the dataset grows. Also, I need to deal with .groupby()
.
Can I calculate rolling mean in a column-wise manner?
Please find the minimum reproducible example of the current methodology attached below.
import pandas as pd
df = pd.DataFrame({'id': range(3),
'Date_1': range(3, 6),
'Date_2': range(4, 7),
'Date_3': range(5, 8),
'Date_4': range(6, 9),
'Date_5': range(11, 14)})
df
id Date_1 Date_2 Date_3 Date_4 Date_5
0 0 3 4 5 6 11
1 1 4 5 6 7 12
2 2 5 6 7 8 13
df = pd.melt(df, id_vars= "id")
df = df.sort_values(["id", "variable"])
df
id variable value
0 0 Date_1 3
3 0 Date_2 4
6 0 Date_3 5
9 0 Date_4 6
12 0 Date_5 11
1 1 Date_1 4
4 1 Date_2 5
7 1 Date_3 6
10 1 Date_4 7
13 1 Date_5 12
2 2 Date_1 5
5 2 Date_2 6
8 2 Date_3 7
11 2 Date_4 8
14 2 Date_5 13
df["ma"] = df.groupby("id")["value"].rolling(3).mean().values
id variable value ma
0 0 Date_1 3 NaN
3 0 Date_2 4 NaN
6 0 Date_3 5 4.000000
9 0 Date_4 6 5.000000
12 0 Date_5 11 7.333333
1 1 Date_1 4 NaN
4 1 Date_2 5 NaN
7 1 Date_3 6 5.000000
10 1 Date_4 7 6.000000
13 1 Date_5 12 8.333333
2 2 Date_1 5 NaN
5 2 Date_2 6 NaN
8 2 Date_3 7 6.000000
11 2 Date_4 8 7.000000
14 2 Date_5 13 9.333333
The desired output is below. The NaN columns can be dropped but that is not significant for now.
df[["id", "variable", "ma"]].pivot("id", columns = "variable",
values = "ma")
variable Date_1 Date_2 Date_3 Date_4 Date_5
id
0 NaN NaN 4.0 5.0 7.333333
1 NaN NaN 5.0 6.0 8.333333
2 NaN NaN 6.0 7.0 9.333333
Edit: Thank you @Mortz and @PanagiotisKanavo for prompts Thank you
CodePudding user response:
The rolling
method takes in an axis
parameter, which you can set to 1 -
import pandas as pd
df = pd.DataFrame({'id': range(3),
'Date_1': range(3, 6),
'Date_2': range(4, 7),
'Date_3': range(5, 8),
'Date_4': range(6, 9),
'Date_5': range(11, 14)})
df = df.set_index('id')
df.rolling(3, axis=1).mean()
Date_1 Date_2 Date_3 Date_4 Date_5
id
0 NaN NaN 4.0 5.0 7.333333
1 NaN NaN 5.0 6.0 8.333333
2 NaN NaN 6.0 7.0 9.333333