I am working with lags for a time series model. I want to automate the creation of lags, which I already did for the training set.
for i in range(1,n 1):
column_name = 'lag_q{}'.format(i)
df_train[column_name]=df_train.groupby(by = ['strain','sex','genotype'],
dropna= False)['quantity'].shift(i)
However, for the validation set, I only want the first values to be in terms of the actual amount, and the rest to be using the prediction. Therefore, I need to fill the validation df and leave blank spaces that will later be filled with the forecasting.
These are the quantity values I have for the rows before the ones I want to fill.
quantity |
---|
26450 |
24707 |
25369 |
25193 |
27250 |
and this df would be the one I want back
lag_q1 | lag_q2 | lag_q3 | lag_q4 | lag_q5 |
---|---|---|---|---|
27250 | 25193 | 25369 | 24707 | 26450 |
27250 | 25193 | 25369 | 24707 | |
27250 | 25193 | 25369 | ||
27250 | 25193 | |||
27250 |
I was trying with some for loops but I only managed to fill the first row
for i in range(1,n 1):
column_name = 'lag_q{}'.format(i)
lags_cols.append(column_name)
df_val[column_name] = ''
df_val.loc[0,column_name] = df_train.iloc[-i]['quantity']
CodePudding user response:
You could use numpy:
N = len(df)
q = df['quantity'].to_numpy()
a = np.arange(N)
out = pd.DataFrame(np.triu(q[np.triu(a[:, None]-a-1)]),
columns=[f'lag_q{i 1}' for i in range(N)])
output:
lag_q1 lag_q2 lag_q3 lag_q4 lag_q5
0 27250 25193 25369 24707 26450
1 0 27250 25193 25369 24707
2 0 0 27250 25193 25369
3 0 0 0 27250 25193
4 0 0 0 0 27250