Conditional lag with dates in R - how to code skip to next date if dates are the same?-CodePudding

I'm stuck on how to conditionally lag a date variable within the same ID group, but skip the next date value and lag the one after if it is the same as the original date. To illustrate,

data have
ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20

data want
ID Value Date Lead_date
a 50 1987-01-01 1989-05-03
a 52 1987-01-01 1989-05-03
a 33 1989-05-03 NA
b 10 1979-08-28 1988-02-23
b 52 1988-02-23 1990-05-20
b 45 1988-02-23 1990-05-20
b 33 1988-02-23 1990-05-20
b 40 1990-05-20 NA

I'm not sure if there is an elegant way of doing this, or if I need a for loop?

Any advice is much appreciated!

CodePudding user response：

1) findInterval will find the position of the last Date matching the current one so just add 1 to get the position of the next Date.

library(data.table)

setDT(DF)
DF[, Lead_date := Date[findInterval(Date, Date)   1], by = ID]

DF
##    ID Value       Date  Lead_date
## 1:  a    50 1987-01-01 1989-05-03
## 2:  a    52 1987-01-01 1989-05-03
## 3:  a    33 1989-05-03       <NA>
## 4:  b    10 1979-08-28 1988-02-23
## 5:  b    52 1988-02-23 1990-05-20
## 6:  b    45 1988-02-23 1990-05-20
## 7:  b    33 1988-02-23 1990-05-20
## 8:  b    40 1990-05-20       <NA>

2) An alternative is to get the position of the match of Date to the unique dates and then take the next unique date:

library(data.table)
setDT(DF)
DF[, Lead_date := { u <- unique(Date); u[match(Date, u)   1] }, by = ID]

3) Another way is to replace duplicated dates with NA and then use na.locf0 to fill them in taking the next date.

library(data.table)
library(zoo)

setDF(DF)
DF[, Lead_Date := replace(Date, duplicated(Date), NA) |>
                    na.locf0(fromLast = TRUE) |>
                    shift(-1), by = ID]

Note

Lines <- "ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)

CodePudding user response：

do.call(rbind, lapply(split(d, d["ID"]),
                      function(x) {
                        ld <- rle(x$Date)
                        ld$values <- c(ld$values[-1], NA)
                        x$Lead_date <- inverse.rle(ld)
                        x}
                      
))

The data

d <- read.table(text="ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20", header=TRUE)

CodePudding user response：

You can implement this using a for loop, but there is a more elegant way to do this in R using the dplyr and data.table packages. Here's one way to do it:

library(dplyr)
library(data.table)

df <- data.frame(ID = c("a", "a", "a", "b", "b", "b", "b", "b"),
                 Value = c(50, 52, 33, 10, 52, 45, 33, 40),
                 Date = as.Date(c("1987-01-01", "1987-01-01", "1989-05-03", "1979-08-28", "1988-02-23", "1988-02-23", "1988-02-23", "1990-05-20")))

df_lead_dates <- df %>% 
  group_by(ID, Value) %>% 
  slice_tail(n = 1) %>% 
  ungroup() %>% 
  rename(Lead_date = Date)

df %>% 
  left_join(df_lead_dates, by = c("ID", "Value", "Date")) %>% 
  mutate(Lead_date = ifelse(is.na(Lead_date), NA_real_, Lead_date))

It first creates a new data frame df_lead_dates which contains only the last date for each ID and Value group. Then it joins this data frame with the original data frame using the left_join function. Finally, it replaces the NA values in the Lead_date column with NA_real_, which is a special type of missing value in R that indicates the absence of a real value.

CodePudding user response：

We can do a non-equi join:

data[data, Lead_Date := i.Date, on = .(ID, Date < Date)]
#        ID Value       Date  Lead_Date
#    <char> <int>     <Date>     <Date>
# 1:      a    50 1987-01-01 1989-05-03
# 2:      a    52 1987-01-01 1989-05-03
# 3:      a    33 1989-05-03       <NA>
# 4:      b    10 1979-08-28 1990-05-20
# 5:      b    52 1988-02-23 1990-05-20
# 6:      b    45 1988-02-23 1990-05-20
# 7:      b    33 1988-02-23 1990-05-20
# 8:      b    40 1990-05-20       <NA>

Data

data <- data.table::as.data.table(structure(list(ID = c("a", "a", "a", "b", "b", "b", "b", "b"), Value = c(50L, 52L, 33L, 10L, 52L, 45L, 33L, 40L), Date = structure(c(6209, 6209, 7062, 3526, 6627, 6627, 6627, 7444), class = "Date")), row.names = c(NA, -8L), class = c("data.table", "data.frame")))