Home > Back-end >  Show selected/all columns when joining in data.table
Show selected/all columns when joining in data.table

Time:12-16

I am joining these two data tables on id=ID and purchase_date between start_date and endate.

here is the code:

df1 <- data.table(ID=c(1,2,3),
                 start_date=c("2022-01-01","2022-10-15","2022-02-03"),
                 end_date=c("2022-01-03","2022-01-02","2022-01-01","2022-01-01"))

df1$start_date <- as.Date(df1$start_date)
df1$end_date <- as.Date(df1$end_date )

df2 <- data.table(id=c(1,1,7,8),
                 date_purchase=c("2022-01-05","2022-01-02","2022-01-01","2022-01-01"))

df2$date_purchase<-as.Date(df2$date_purchase)

The code to do this is the following:


df2[df1,.(x.date_purchase,id,ID,start_date,end_date), on=.(id=ID, date_purchase>=start_date, date_purchase<=end_date), nomatch=0]

I want to modify this code in a way that I only have to specify x.date_purchase, and add something that tells data.table I also want all remaining columns, and maybe specifiy all remaining from df1 or from df2. Is there a way to do this?

Expected output is this:

ID start_date end_date date_purchase
1 2022-01-01 2022-01-03 2022-01-02

CodePudding user response:

Fixed data:

df1 <- read.table(text="ID   start_date  end_date
1   2022-01-01  2022-01-03
2   2022-10-15  2022-10-18
3   2022-02-03  2022-02-05",header=T) |> as.data.table()
df1[, c("start_date", "end_date") := lapply(.SD, as.Date), .SDcols = c("start_date", "end_date")]

df2 <- read.table(text="id   date_purchase
1   2022-01-05
1   2022-01-02
7   2022-01-01
8   2022-01-01",header=T) |> as.data.table()
df2[, date_purchase := as.Date(date_purchase)]

The issue is that on non-equi joins such as this, data.table is reassigning the value of the singular value (RHS) to the names of the non-equi (LHS) variables. Also, the RHS singular column can be omitted (as here). Frustrating, I don't know why.

df1[df2, on = .(ID == id, start_date <= date_purchase, end_date >= date_purchase), nomatch = 0]
#       ID start_date   end_date
#    <int>     <Date>     <Date>
# 1:     1 2022-01-02 2022-01-02

Note (1) both start/end values are that of the desired date_purchases, and (2) we have no real date_purchase.

One workaround is to copy the start/end columns (LHS) to new names.

df1[, c("start", "end") := .(start_date, end_date)]
out <- df1[df2, on = .(ID == id, start <= date_purchase, end >= date_purchase), nomatch = 0]
out
#       ID start_date   end_date      start        end
#    <int>     <Date>     <Date>     <Date>     <Date>
# 1:     1 2022-01-01 2022-01-03 2022-01-02 2022-01-02

While we don't have date_purchase by-name, the start and end columns both contain its value, so we can rename one and drop the other.

setnames(out, 3, "date_purchase")[, end := NULL]
out
#       ID start_date date_purchase      start
#    <int>     <Date>        <Date>     <Date>
# 1:     1 2022-01-01    2022-01-03 2022-01-02

Raw data:

df1 <- data.table::as.data.table(structure(list(ID = 1:3, start_date = structure(c(18993, 19280, 19026), class = "Date"), end_date = structure(c(18995, 19283, 19028), class = "Date")), row.names = c(NA, -3L), class = c("data.table", "data.frame")))
df2 <- data.table::as.data.table(structure(list(id = c(1L, 1L, 7L, 8L), date_purchase = structure(c(18997, 18994, 18993, 18993), class = "Date")), row.names = c(NA, -4L), class = c("data.table", "data.frame")))

CodePudding user response:

A bit ugly, but you could use foverlaps which sadly needs two ranges (a "start" and "end") in both tables.

You could simply create a duplicate column by reference.

df2[, dummy := date_purchase]

setkey(df1, ID, start_date, end_date)
setkey(df2, id, date_purchase, dummy)

foverlaps(df2, df1, nomatch = 0)[, dummy := NULL][]

# id start_date   end_date date_purchase
# 1:  1 2022-01-01 2022-01-03    2022-01-02
  • Related