Lag without previous value in R-CodePudding

I have a df with 6 columns. I want to find the delta based on the date and also group by first few columns.

df <- data.frame (col1  = c('A1','A1','A1','A2','A2','A2','A1','A1','A1','A2','A2','A2','A2','A2','A2'),
                  col2 = c('B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3'),
                  col3 = c('C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3'),
                  col4 = c('D1','D2','D22','D4','D5','D6','D1','D2','D3','D4','D5','D6','D7','D8','D9'),
                  col5 = c('1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021',
                           '1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022',
                           '1/01/2022','1/01/2022','1/01/2022'),
                  col6 = c(10,20,30,40,50,60,100, 200, 300,400,500,600,60,60, 60)
                  
)
diff_na<-df%>%
  group_by(col1,col2,col3,col4) %>%
  mutate(diff = col6 - lag(col6, default = first(col6,default = 0), order_by = col5))

Expected output is :

df11 <- data.frame (col1  = c('A1','A1','A1','A2','A2','A2','A1','A1','A1','A2','A2','A2','A2','A2','A2'),
                  col2 = c('B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3'),
                  col3 = c('C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3'),
                  col4 = c('D1','D2','D22','D4','D5','D6','D1','D2','D3','D4','D5','D6','D7','D8','D9'),
                  col5 = c('1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021',
                           '1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022',
                           '1/01/2022','1/01/2022','1/01/2022'),
                  col6 = c(10,20,30,40,50,60,100, 200, 300,400,500,600,60,60, 60),
                  dfiff =c(0,0,30,0,0,0,90,180,300,360,450,540,60,60,60)
                  
)

I am facing an issue if the value in the previous col4 is not there , then it does not subtract the value. I mean it should treat the missing value as 0. I tried giving first default as 0. But somehow the last three diff values are 0 instead of 60. Please guide where I am going wrong.

Row 3 has value in Col4 as D22 which is not there for date 01/01/2022 so 30 should be there. similary row 13,14,15 does not have corresponding value for date 01/01/2021. So diff col should have 60 .

Thanks & Regards, R

CodePudding user response：

From the extra details in the comments section, it seems that we need to use some conditional logic because of the special status of the initial date in the data frame. One way to handle this is to use group_split and map

library(tidyverse)

df %>%
  mutate(rn = row_number()) %>%
  group_split(col1, col2, col3, col4) %>%
  map(~ if(nrow(.x) > 1) {
    mutate(.x, diff = c(0, diff(col6))) 
    } else {
     mutate(.x, diff = if(col5 == '1/01/2021') 0 else col6) 
    }) %>%
  bind_rows() %>%
  arrange(rn) %>%
  select(-rn)
#> # A tibble: 15 x 7
#>    col1  col2  col3  col4  col5       col6  diff
#>    <chr> <chr> <chr> <chr> <chr>     <dbl> <dbl>
#>  1 A1    B1    C1    D1    1/01/2021    10     0
#>  2 A1    B2    C2    D2    1/01/2021    20     0
#>  3 A1    B3    C3    D22   1/01/2021    30     0
#>  4 A2    B1    C1    D4    1/01/2021    40     0
#>  5 A2    B2    C2    D5    1/01/2021    50     0
#>  6 A2    B3    C3    D6    1/01/2021    60     0
#>  7 A1    B1    C1    D1    1/01/2022   100    90
#>  8 A1    B2    C2    D2    1/01/2022   200   180
#>  9 A1    B3    C3    D3    1/01/2022   300   300
#> 10 A2    B1    C1    D4    1/01/2022   400   360
#> 11 A2    B2    C2    D5    1/01/2022   500   450
#> 12 A2    B3    C3    D6    1/01/2022   600   540
#> 13 A2    B1    C1    D7    1/01/2022    60    60
#> 14 A2    B2    C2    D8    1/01/2022    60    60
#> 15 A2    B3    C3    D9    1/01/2022    60    60

^{Created on 2022-11-21 with reprex v2.0.2}

CodePudding user response：

Here's a solution without group_map. I hope this does what you want, at least it reproduces the output for your example data.

I assume you want to group only by the first two letters of col4 so I split the column into col4a and col4b grouped by col4a and arranged by col4b and col5. It seems you want to use 0 as a default value in the lag only for the values in the second year and input 0 in the first year (except when you have multiple values in the first year)

Also the way your final data is ordered is not very intuitive.

df12 <- df %>% 
  mutate(
    col4a = str_sub(col4, 1, 2),
    col4b = str_sub(col4, 3, 3)
  ) %>% 
  group_by(col1, col2, col3, col4a) %>%
  arrange(col5, col4b) %>% 
  mutate(
    dfiff = col6 - lag(col6, default=NA), 
    dfiff = coalesce(dfiff, if_else(col5=="1/01/2021" & col4b=="", 0, col6))
  ) %>% 
  ungroup() %>%
  arrange(col5, col4a, col1, col3, col2) %>% 
  select(-col4a, -col4b)

I hope this helps, some other remarks:

I highly recommend not using strings as dates, especially when the date format is one in which alphabetical sorting does not correspond to the order of the dates (here it works accidentally because you always use "1/01" of each year, but you might get some other data and then your code would produce wrong output), you could use strptime with the format "%m/%d/%Y" to parse the dates from your column (assuming the one digit number is the month and the two digit number is the day).

Please try to describe better what you wish to accomplish, just a short paragraph is enough in most cases. The example input and output helps, thanks for posting it. But it takes way less time to answer your question if one has more context and does not need to reverse engineer the whole logic from input and output and more people will answer more quickly.