I am joining these two data tables on id=ID and purchase_date between start_date and endate.
here is the code:
df1 <- data.table(ID=c(1,2,3),
start_date=c("2022-01-01","2022-10-15","2022-02-03"),
end_date=c("2022-01-03","2022-01-02","2022-01-01","2022-01-01"))
df1$start_date <- as.Date(df1$start_date)
df1$end_date <- as.Date(df1$end_date )
df2 <- data.table(id=c(1,1,7,8),
date_purchase=c("2022-01-05","2022-01-02","2022-01-01","2022-01-01"))
df2$date_purchase<-as.Date(df2$date_purchase)
The code to do this is the following:
df2[df1,.(x.date_purchase,id,ID,start_date,end_date), on=.(id=ID, date_purchase>=start_date, date_purchase<=end_date), nomatch=0]
I want to modify this code in a way that I only have to specify x.date_purchase, and add something that tells data.table I also want all remaining columns, and maybe specifiy all remaining from df1 or from df2. Is there a way to do this?
Expected output is this:
ID | start_date | end_date | date_purchase |
---|---|---|---|
1 | 2022-01-01 | 2022-01-03 | 2022-01-02 |
CodePudding user response:
Fixed data:
df1 <- read.table(text="ID start_date end_date
1 2022-01-01 2022-01-03
2 2022-10-15 2022-10-18
3 2022-02-03 2022-02-05",header=T) |> as.data.table()
df1[, c("start_date", "end_date") := lapply(.SD, as.Date), .SDcols = c("start_date", "end_date")]
df2 <- read.table(text="id date_purchase
1 2022-01-05
1 2022-01-02
7 2022-01-01
8 2022-01-01",header=T) |> as.data.table()
df2[, date_purchase := as.Date(date_purchase)]
The issue is that on non-equi joins such as this, data.table
is reassigning the value of the singular value (RHS) to the names of the non-equi (LHS) variables. Also, the RHS singular column can be omitted (as here). Frustrating, I don't know why.
df1[df2, on = .(ID == id, start_date <= date_purchase, end_date >= date_purchase), nomatch = 0]
# ID start_date end_date
# <int> <Date> <Date>
# 1: 1 2022-01-02 2022-01-02
Note (1) both start/end values are that of the desired date_purchases
, and (2) we have no real date_purchase
.
One workaround is to copy the start/end columns (LHS) to new names.
df1[, c("start", "end") := .(start_date, end_date)]
out <- df1[df2, on = .(ID == id, start <= date_purchase, end >= date_purchase), nomatch = 0]
out
# ID start_date end_date start end
# <int> <Date> <Date> <Date> <Date>
# 1: 1 2022-01-01 2022-01-03 2022-01-02 2022-01-02
While we don't have date_purchase
by-name, the start
and end
columns both contain its value, so we can rename one and drop the other.
setnames(out, 3, "date_purchase")[, end := NULL]
out
# ID start_date date_purchase start
# <int> <Date> <Date> <Date>
# 1: 1 2022-01-01 2022-01-03 2022-01-02
Raw data:
df1 <- data.table::as.data.table(structure(list(ID = 1:3, start_date = structure(c(18993, 19280, 19026), class = "Date"), end_date = structure(c(18995, 19283, 19028), class = "Date")), row.names = c(NA, -3L), class = c("data.table", "data.frame")))
df2 <- data.table::as.data.table(structure(list(id = c(1L, 1L, 7L, 8L), date_purchase = structure(c(18997, 18994, 18993, 18993), class = "Date")), row.names = c(NA, -4L), class = c("data.table", "data.frame")))
CodePudding user response:
A bit ugly, but you could use foverlaps
which sadly needs two ranges (a "start" and "end") in both tables.
You could simply create a duplicate column by reference.
df2[, dummy := date_purchase]
setkey(df1, ID, start_date, end_date)
setkey(df2, id, date_purchase, dummy)
foverlaps(df2, df1, nomatch = 0)[, dummy := NULL][]
# id start_date end_date date_purchase
# 1: 1 2022-01-01 2022-01-03 2022-01-02