I have these two tables
library(data.table)
d = seq(0.1,1,by = 0.1)
n = length(d)
dtBig = data.table(id = rep(letters[1:2],each = n/2),
d1 = d,
d2 = d 0.2,
i = 1:n)
dtSmall = data.table(id = rep(letters[1:2],each = 2),
d_start = c(0.2,0.65,0.15,1.1),
d_end = c(0.65,0.85,0.8,1.5))
I would like to do an efficient merge on id
with two inequality conditions d1 >= d_start
and d2 <= d_end
.
The first method is time consuming when tables have many rows:
dtAll = merge(dtSmall, dtBig, by = "id", allow.cartesian = T)[d1 >= d_start & d2 <= d_end]
So I use 'on' operator:
dtAll2 = dtBig[dtSmall, on = .(id, d1 >= d_start, d2 <= d_end),nomatch = 0]
However, d1 takes values of d_start and d2 of d_end and I loose the values of d1 and d2.
So I made these commands:
dtAll2 = dtBig[dtSmall, on = .(id, d1 >= d_start, d2 <= d_end),nomatch = 0]
dtAll2[,`:=`(d_start = d1, d_end = d2)]
dtAll2[,`:=`(d1 = NULL, d2 = NULL)]
dtAll2 = dtAll2[dtBig[,.(i,d1,d2)],on = .(i == i),nomatch = 0]
verify that dtAll and dtAll2 are the same:
setcolorder(dtAll, names(dtAll2))
setkey(dtAll,i)
setkey(dtAll2,i)
all.equal(dtAll,dtAll2)
But I am sure there is a better way, any ideas ?
CodePudding user response:
You can use foverlaps
that is available data.table
for this, and from your d1 >= d_start & d2 <= d_end
we can tell that you are interested in those records in dtBig that are within the range of start/end in dtSmall, which can be provided in the type
argument. You must use setkey
on the y
(second table, dtSmall). You do not have to use by.y
as it defaults to your keys in y.
setkey(dtSmall, id, d_start, d_end)
dtAllF <- foverlaps(dtBig, dtSmall, by.x = c("id", "d1", "d2"), type = "within", nomatch = 0)
Results
dtAllF
# id d_start d_end d1 d2 i
# 1: a 0.20 0.65 0.2 0.4 2
# 2: a 0.20 0.65 0.3 0.5 3
# 3: a 0.20 0.65 0.4 0.6 4
# 4: b 0.15 0.80 0.6 0.8 6
Check for equality
setcolorder(dtAllF, c("id", "i"))
identical(dtAll2, dtAllF)
# [1] TRUE