How to create a lag column with a constraint in data.table?-CodePudding

My reproducible data is ;

mydf <- structure(list(product = c("4689", "4695", "513377", "604018", 
"4693", "513376", "4706", "4691", "4691", "1212", "601606", "4755", 
"502659", "4679", "9934"), year = c(2018, 2018, 2018, 2018, 2019, 
2019, 2019, 2019, 2019, 2019, 2021, 2021, 2021, 2021, 2021), 
    weeks = c(1, 2, 4, 5, 6, 7, 8, 9, 10, 8, 11, 12, 13, 14, 
    15), sales = c(18L, 13L, 16L, 10L, 11L, 16L, 20L, 11L, 20L, 
    12L, 10L, 14L, 14L, 19L, 15L)), row.names = c(NA, -15L), class = c("data.table", 
"data.frame"))

I want to calculate mean of sales for three weeks before, and followed this way ;

mydf[,lookup_week := weeks - 3]

lags <- mydf[,.(lag = mean(sales)),by = weeks]

joint_table <- merge(mydf,lags,by.x = 'lookup_week',by.y = 'weeks',all.x = T)

and it returns ;

    lookup_week product year weeks sales lag
 1:          -2    4689 2018     1    18  NA
 2:          -1    4695 2018     2    13  NA
 3:           1  513377 2018     4    16  18
 4:           2  604018 2018     5    10  13
 5:           3    4693 2019     6    11  NA
 6:           4  513376 2019     7    16  16
 7:           5    4706 2019     8    20  10
 8:           5    1212 2019     8    12  10
 9:           6    4691 2019     9    11  11
10:           7    4691 2019    10    20  16
11:           8  601606 2021    11    10  16
12:           9    4755 2021    12    14  11
13:          10  502659 2021    13    14  20
14:          11    4679 2021    14    19  10
15:          12    9934 2021    15    15  14

Here are the issues:

For row 5, I need to check where weeks equals to 3. But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.

The other issue is that, there should be at most 1 year between the year of the observation that the lag came from and the year of the observation from which I calculated the lag. So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.

How can I do this ?

dplyr solutions are also welcome.

Thanks in advance.

CodePudding user response：

I want to calculate mean of sales for three weeks before

But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.

So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.

You can do:

mydf[, lag3 := .SD[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]$V1]

which gives

> mydf[order(weeks)]
    product year weeks sales lag3
 1:    4689 2018     1    18   NA
 2:    4695 2018     2    13   NA
 3:  513377 2018     4    16   18
 4:  604018 2018     5    10   13
 5:    4693 2019     6    11   13
 6:  513376 2019     7    16   16
 7:    4706 2019     8    20   10
 8:    1212 2019     8    12   10
 9:    4691 2019     9    11   11
10:    4691 2019    10    20   16
11:  601606 2021    11    10   16
12:    4755 2021    12    14   11
13:  502659 2021    13    14   20
14:    4679 2021    14    19   10
15:    9934 2021    15    15   14

How it works:

This is a join x[i, ...] of the column i=.(weeks-3) with x=mydf.
The join condition is: use the weeks column up to 52 weeks prior.
The aggregation for the mean uses by=.EACHI to group by each row of i.
x.sales are values of x found in the join.
mean(x.sales) is calculated as a column with name V1

Try running mydf[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]

CodePudding user response：

If I understand your issue correct a dplyr solution could look like this:

mean_sales <- function(year_weeks_, data){

 mean(data %>% 
    filter(year_week < year_weeks_, 
           key >= year_weeks_ ) %>% 
    tail(3) %>% 
    pull(sales), na.rm = T)
  
}

mydf <- mydf %>% 
  mutate(year_week = year * 100   weeks, key = (year   1) * 100   weeks) %>%
arrange(year_week)

mydf_n <- mydf %>%  
  mutate(mean_sales = map(year_week, mean_sales, data = mydf))

result would be than this:

   product year weeks sales year_week    key mean_sales
1     4689 2018     1    18    201801 201901        NaN
2     4695 2018     2    13    201802 201902         18
3   513377 2018     4    16    201804 201904       15.5
4   604018 2018     5    10    201805 201905   15.66667
5     4693 2019     6    11    201906 202006        NaN
6   513376 2019     7    16    201907 202007         11
7     4706 2019     8    20    201908 202008       13.5
8     4691 2019     9    11    201909 202009         16
9     4691 2019    10    20    201910 202010   14.33333
10    1212 2019     8    12    201908 202008       13.5
11  601606 2021    11    10    202111 202211        NaN
12    4755 2021    12    14    202112 202212         10
13  502659 2021    13    14    202113 202213         12
14    4679 2021    14    19    202114 202214   12.66667
15    9934 2021    15    15    202115 202215   15.66667

Row 5 is NaN as there are no sales within 1 year before 2019 Week 06