I have a dataset with a column DAY
, and a column VALUE
. I would like to create a new column, VALUE_SUB
which is made by subtracting a rolling window mean with width N of VALUE
which rolls over DAY
from VALUE
. So, I have a dataframe like this:
DAY, VALUE
1 8
2 6
4 7
5 6
8 10
10 8
And I want to subtract with a rolling mean of width 3 to get:
DAY, VALUE, VALUE_SUB
1 8 1
2 6 -1
4 7 0.5
5 6 -0.5
8 10 1
9 8 0.33
10 5 -1.5
I can do this with the following code:
out = []
for i in range(len(list(df['DAY']))):
df1 = df[(df['DAY'] >= df['DAY'].iloc[i] - 1) & (df['DAY'] <= df['DAY'].iloc[i] 1)]
out = [df['VALUE'].iloc[i] - df1['VALUE'].mean()]
df['VALUE_SUB'] = out
But my dataframe is very large so this takes forever. I would like to do it without using a for loop at all, if possible. I know pandas has a rolling mean function that works for datetime objects or for rolling over a set number of rows, but my data does not have datetime objects nor is the data evenly sampled.
CodePudding user response:
You can change the DAY column to a datetime object with pd.to_datetim
e and set as the index to be able to use the rolling.mean
with a centered window. keep only the values (to avoid mismatch on the index) and substract from the VALUE column.
# input
df = pd.DataFrame({'DAY': [1, 2, 4, 5, 8, 9, 10], 'VALUE': [8, 6, 7, 6, 10, 8, 5]})
n_day = 3
df['VALUE_SUB'] = (
df['VALUE']
- df.set_index(pd.to_datetime(df['DAY'], unit='d'))
['VALUE'].rolling(f'{n_day}d', center=True).mean().to_numpy()
)
print(df)
# DAY VALUE VALUE_SUB
# 0 1 8 1.000000
# 1 2 6 -1.000000
# 2 4 7 0.500000
# 3 5 6 -0.500000
# 4 8 10 1.000000
# 5 9 8 0.333333
# 6 10 5 -1.500000