My reproducible data is ;
mydf <- structure(list(product = c("4689", "4695", "513377", "604018",
"4693", "513376", "4706", "4691", "4691", "1212", "601606", "4755",
"502659", "4679", "9934"), year = c(2018, 2018, 2018, 2018, 2019,
2019, 2019, 2019, 2019, 2019, 2021, 2021, 2021, 2021, 2021),
weeks = c(1, 2, 4, 5, 6, 7, 8, 9, 10, 8, 11, 12, 13, 14,
15), sales = c(18L, 13L, 16L, 10L, 11L, 16L, 20L, 11L, 20L,
12L, 10L, 14L, 14L, 19L, 15L)), row.names = c(NA, -15L), class = c("data.table",
"data.frame"))
I want to calculate mean of sales for three weeks before, and followed this way ;
mydf[,lookup_week := weeks - 3]
lags <- mydf[,.(lag = mean(sales)),by = weeks]
joint_table <- merge(mydf,lags,by.x = 'lookup_week',by.y = 'weeks',all.x = T)
and it returns ;
lookup_week product year weeks sales lag
1: -2 4689 2018 1 18 NA
2: -1 4695 2018 2 13 NA
3: 1 513377 2018 4 16 18
4: 2 604018 2018 5 10 13
5: 3 4693 2019 6 11 NA
6: 4 513376 2019 7 16 16
7: 5 4706 2019 8 20 10
8: 5 1212 2019 8 12 10
9: 6 4691 2019 9 11 11
10: 7 4691 2019 10 20 16
11: 8 601606 2021 11 10 16
12: 9 4755 2021 12 14 11
13: 10 502659 2021 13 14 20
14: 11 4679 2021 14 19 10
15: 12 9934 2021 15 15 14
Here are the issues:
For row 5, I need to check where weeks
equals to 3. But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks
equals to 1.
The other issue is that, there should be at most 1 year between the year of the observation that the lag came from and the year of the observation from which I calculated the lag. So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.
How can I do this ?
dplyr
solutions are also welcome.
Thanks in advance.
CodePudding user response:
I want to calculate mean of sales for three weeks before
But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.
So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.
You can do:
mydf[, lag3 := .SD[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]$V1]
which gives
> mydf[order(weeks)]
product year weeks sales lag3
1: 4689 2018 1 18 NA
2: 4695 2018 2 13 NA
3: 513377 2018 4 16 18
4: 604018 2018 5 10 13
5: 4693 2019 6 11 13
6: 513376 2019 7 16 16
7: 4706 2019 8 20 10
8: 1212 2019 8 12 10
9: 4691 2019 9 11 11
10: 4691 2019 10 20 16
11: 601606 2021 11 10 16
12: 4755 2021 12 14 11
13: 502659 2021 13 14 20
14: 4679 2021 14 19 10
15: 9934 2021 15 15 14
How it works:
- This is a join
x[i, ...]
of the columni=.(weeks-3)
withx=mydf
. - The join condition is: use the weeks column up to 52 weeks prior.
- The aggregation for the mean uses
by=.EACHI
to group by each row ofi
. x.sales
are values ofx
found in the join.mean(x.sales)
is calculated as a column with nameV1
Try running mydf[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]
CodePudding user response:
If I understand your issue correct a dplyr
solution could look like this:
mean_sales <- function(year_weeks_, data){
mean(data %>%
filter(year_week < year_weeks_,
key >= year_weeks_ ) %>%
tail(3) %>%
pull(sales), na.rm = T)
}
mydf <- mydf %>%
mutate(year_week = year * 100 weeks, key = (year 1) * 100 weeks) %>%
arrange(year_week)
mydf_n <- mydf %>%
mutate(mean_sales = map(year_week, mean_sales, data = mydf))
result would be than this:
product year weeks sales year_week key mean_sales
1 4689 2018 1 18 201801 201901 NaN
2 4695 2018 2 13 201802 201902 18
3 513377 2018 4 16 201804 201904 15.5
4 604018 2018 5 10 201805 201905 15.66667
5 4693 2019 6 11 201906 202006 NaN
6 513376 2019 7 16 201907 202007 11
7 4706 2019 8 20 201908 202008 13.5
8 4691 2019 9 11 201909 202009 16
9 4691 2019 10 20 201910 202010 14.33333
10 1212 2019 8 12 201908 202008 13.5
11 601606 2021 11 10 202111 202211 NaN
12 4755 2021 12 14 202112 202212 10
13 502659 2021 13 14 202113 202213 12
14 4679 2021 14 19 202114 202214 12.66667
15 9934 2021 15 15 202115 202215 15.66667
Row 5 is NaN as there are no sales within 1 year before 2019 Week 06