I'm trying to build a model for timeseries analysis and I want to create some features which contain values of previous rows.
I have a dataset as follows:
item_id | shop_id | items_sold | date_block_id |
---|---|---|---|
1 | 1 | 5 | 0 |
1 | 2 | 10 | 0 |
1 | 1 | 7 | 1 |
2 | 2 | 8 | 1 |
2 | 1 | 5 | 1 |
And I want to add a column containing the amount of items sold for each item/shop combination, from the previous date block. So something like:
item_id | shop_id | items_sold | date_block_id | lag1_items_sold |
---|---|---|---|---|
1 | 1 | 5 | 0 | 0 |
1 | 2 | 10 | 0 | 0 |
1 | 1 | 7 | 1 | 5 |
2 | 2 | 8 | 1 | 0 |
2 | 1 | 5 | 1 | 0 |
I think what makes this hard is the fact that if no sale is made within a date block, no entry of that item/shop combination is present in the table. I have tried al sorts of combinations of grouping and shifting, but I can't seem to get the result I want.
CodePudding user response:
Look at df.shift
combined with group by. Docs for shift here.
Older answer with code here. Try searching through SO next time and you might find the answer!
CodePudding user response:
Try:
df["lag1_items_sold"] = (
df.groupby(["item_id", "shop_id"])["items_sold"].shift().fillna(0)
)
print(df)
Prints:
item_id shop_id items_sold date_block_id lag1_items_sold
0 1 1 5 0 0.0
1 1 2 10 0 0.0
2 1 1 7 1 5.0
3 2 2 8 1 0.0
4 2 1 5 1 0.0