I would like to do an interval join with an additional key. The simplest way in dplyr is quite slow
intervalDf <- tibble(id = rep(seq(1, 100000, 1), 10),
k1 = rep(seq(1, 1000, 1), 1000),
startTime = sample(seq(as.Date('1995/01/01'), as.Date('1999/06/01'), by="day"), 1000000, replace = TRUE),
endTime = startTime sample.int(180, 1000000, replace = TRUE))
eventDf <- tibble(k1 = rep(seq(1, 1000, 1), 200),
points = sample.int(10, 200000, replace = TRUE),
date = sample(seq(as.Date('1995/01/01'), as.Date('2000/01/01'), by="day"), 200000, replace = TRUE))
testDf <- inner_join(intervalDf, eventDf, by = "k1") %>%
filter(date >= startTime,
date <= endTime) %>%
count(id, startTime, endTime, wt = points, name = "points")
I was hoping to use interval_join in the fuzzyjoin package, but that function does not allow for an additional join key besides the interval https://rdrr.io/cran/fuzzyjoin/man/interval_join.html.
I'm guessing there's some slick way with rolling join in data.table or something, but I can't figure it out.
CodePudding user response:
we should use a non-equi join here, with slight modification to the approach outlined in the comments:
library(data.table)
setDT(eventDf)
setDT(intervalDf)
eventDf[intervalDf, on=.(k1, date>=startTime, date<=endTime)][, sum(points), .(id, startTime=date, endTime=date.1)]