Home > Net >  R Lag/Lead on Date Column Identification
R Lag/Lead on Date Column Identification

Time:09-23

I want to create a new identifier column in a dataset I have.

ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")), 
 row.names = c(NA, -9L), class = c("data.table", "data.frame"))

This identifier would be based on the serv_to_date, serv_from_date, and date_plus1 columns. The data is ordered by serv_from_date; if the next row has a ser_to_date that is equal to the previous row's serv_from_date OR a serv_to_date that is equal to the previous row's serv_from_date 1 (which is the date_plus1 column) then label those rows with 1 identifier.

The final output I'd like is:

want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
 identifier = c("1", "1", "2", 
 "2", "2", "2", "2", 
 "3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))

My first step was to create a column that identifies lag dates with the previous row's dates:

ex %>% 
  mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))

however, this code does not correctly say "yes" to serv_from_date's that match previous row's date_plus1.

Thanks in advance for any help you can provide!

CodePudding user response:

Your logic is good, you're just missing the final step: we need to do a cumulative count of the "yes" values, with cumsum.

And actually we can simplify if we skip the ifelse and leave the result as TRUE/FALSE instead of "yes"/"no", and use a nice default value to make sure the first row is TRUE.

want %>% 
  mutate(NewCol = cumsum(
    lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
      lag(serv_from_dt) == serv_to_dt)
  )
#              id serv_from_dt serv_to_dt date_plus1 identifier NewCol
# 1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1      1
# 2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1      1
# 3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      1
# 4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      2
# 5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2      2
# 6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      2
# 7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      3
# 8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3      3
# 9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3      3

CodePudding user response:

With data.table:

library(data.table)

setDT(ex)

ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE) 1))][]

              id serv_from_dt serv_to_dt date_plus1 identifier
1: 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
2: 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
3: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
4: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
5: 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
6: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
7: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
8: 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
9: 8210109300002   2020-01-15 2020-01-15 2020-01-16          3

CodePudding user response:

The following logic using cumsum will only increment when serv_to_dt is not equal to either the lag value of serv_from_dt and date_plus1. row_number() == 1 starts the cumulative sum at 1.

library(dplyr)

ex %>% 
  mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))

Output

             id serv_from_dt serv_to_dt date_plus1 identifier
1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3
  • Related