I'm stuck on how to conditionally lag a date variable within the same ID group, but skip the next date value and lag the one after if it is the same as the original date. To illustrate,
data have
ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20
data want
ID Value Date Lead_date
a 50 1987-01-01 1989-05-03
a 52 1987-01-01 1989-05-03
a 33 1989-05-03 NA
b 10 1979-08-28 1988-02-23
b 52 1988-02-23 1990-05-20
b 45 1988-02-23 1990-05-20
b 33 1988-02-23 1990-05-20
b 40 1990-05-20 NA
I'm not sure if there is an elegant way of doing this, or if I need a for loop?
Any advice is much appreciated!
CodePudding user response:
1) findInterval
will find the position of the last Date matching the current one so just add 1 to get the position of the next Date.
library(data.table)
setDT(DF)
DF[, Lead_date := Date[findInterval(Date, Date) 1], by = ID]
DF
## ID Value Date Lead_date
## 1: a 50 1987-01-01 1989-05-03
## 2: a 52 1987-01-01 1989-05-03
## 3: a 33 1989-05-03 <NA>
## 4: b 10 1979-08-28 1988-02-23
## 5: b 52 1988-02-23 1990-05-20
## 6: b 45 1988-02-23 1990-05-20
## 7: b 33 1988-02-23 1990-05-20
## 8: b 40 1990-05-20 <NA>
2) An alternative is to get the position of the match of Date to the unique dates and then take the next unique date:
library(data.table)
setDT(DF)
DF[, Lead_date := { u <- unique(Date); u[match(Date, u) 1] }, by = ID]
3) Another way is to replace duplicated dates with NA and then use na.locf0 to fill them in taking the next date.
library(data.table)
library(zoo)
setDF(DF)
DF[, Lead_Date := replace(Date, duplicated(Date), NA) |>
na.locf0(fromLast = TRUE) |>
shift(-1), by = ID]
Note
Lines <- "ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)
CodePudding user response:
do.call(rbind, lapply(split(d, d["ID"]),
function(x) {
ld <- rle(x$Date)
ld$values <- c(ld$values[-1], NA)
x$Lead_date <- inverse.rle(ld)
x}
))
The data
d <- read.table(text="ID Value Date
a 50 1987-01-01
a 52 1987-01-01
a 33 1989-05-03
b 10 1979-08-28
b 52 1988-02-23
b 45 1988-02-23
b 33 1988-02-23
b 40 1990-05-20", header=TRUE)
CodePudding user response:
You can implement this using a for
loop, but there is a more elegant way to do this in R
using the dplyr
and data.table
packages. Here's one way to do it:
library(dplyr)
library(data.table)
df <- data.frame(ID = c("a", "a", "a", "b", "b", "b", "b", "b"),
Value = c(50, 52, 33, 10, 52, 45, 33, 40),
Date = as.Date(c("1987-01-01", "1987-01-01", "1989-05-03", "1979-08-28", "1988-02-23", "1988-02-23", "1988-02-23", "1990-05-20")))
df_lead_dates <- df %>%
group_by(ID, Value) %>%
slice_tail(n = 1) %>%
ungroup() %>%
rename(Lead_date = Date)
df %>%
left_join(df_lead_dates, by = c("ID", "Value", "Date")) %>%
mutate(Lead_date = ifelse(is.na(Lead_date), NA_real_, Lead_date))
It first creates a new data frame df_lead_dates
which contains only the last date for each ID
and Value
group. Then it joins this data frame with the original data frame using the left_join
function. Finally, it replaces the NA
values in the Lead_date
column with NA_real_
, which is a special type of missing value in R
that indicates the absence of a real value.
CodePudding user response:
We can do a non-equi join:
data[data, Lead_Date := i.Date, on = .(ID, Date < Date)]
# ID Value Date Lead_Date
# <char> <int> <Date> <Date>
# 1: a 50 1987-01-01 1989-05-03
# 2: a 52 1987-01-01 1989-05-03
# 3: a 33 1989-05-03 <NA>
# 4: b 10 1979-08-28 1990-05-20
# 5: b 52 1988-02-23 1990-05-20
# 6: b 45 1988-02-23 1990-05-20
# 7: b 33 1988-02-23 1990-05-20
# 8: b 40 1990-05-20 <NA>
Data
data <- data.table::as.data.table(structure(list(ID = c("a", "a", "a", "b", "b", "b", "b", "b"), Value = c(50L, 52L, 33L, 10L, 52L, 45L, 33L, 40L), Date = structure(c(6209, 6209, 7062, 3526, 6627, 6627, 6627, 7444), class = "Date")), row.names = c(NA, -8L), class = c("data.table", "data.frame")))