The dev version of dplyr allows for rolling joins.
# devtools::install_github("/tidyverse/dplyr")
library(dplyr)
a <- tibble(
person = rep("a", 10),
time = 1:10
)
b <- tibble(
person = rep("a", 3),
time = c(1, 8, 5),
pref = c("good", "terrible", "not so good")
)
left_join(a, b, by = join_by(person, closest(time >= time)))
#> # A tibble: 10 × 4
#> person time.x time.y pref
#> <chr> <int> <dbl> <chr>
#> 1 a 1 1 good
#> 2 a 2 1 good
#> 3 a 3 1 good
#> 4 a 4 1 good
#> 5 a 5 5 not so good
#> 6 a 6 5 not so good
#> 7 a 7 5 not so good
#> 8 a 8 8 terrible
#> 9 a 9 8 terrible
#> 10 a 10 8 terrible
However, the same approach does not (at least not yet) work with dbplyr.
library(dbplyr)
library(dplyr)
a <- memdb_frame(
person = rep("a", 10),
time = 1:10
)
b <- memdb_frame(
person = rep("a", 3),
time = c(1, 8, 5),
pref = c("good", "terrible", "not so good")
)
left_join(a, b, by = join_by(person, closest(time >= time)))
#> Error in `dplyr::common_by()`:
#> ! `by` must be a (named) character vector, list, or NULL for natural
#> joins (not recommended in production code), not a <dplyr_join_by> object.`
How would you do this type of rolling join with database tables in R?
CodePudding user response:
One option is to make use of sql_on
argument in the left_join
library(dbplyr)
library(dplyr)
left_join(a, b, sql_on = "LHS.person = RHS.person and LHS.time >= RHS.time") %>%
select(-person.y) %>%
group_by(person.x, time.x) %>%
slice_max(time.y) %>%
ungroup
-output
# Source: SQL [10 x 4]
# Database: sqlite 3.39.2 [:memory:]
person.x time.x time.y pref
<chr> <int> <dbl> <chr>
1 a 1 1 good
2 a 2 1 good
3 a 3 1 good
4 a 4 1 good
5 a 5 5 not so good
6 a 6 5 not so good
7 a 7 5 not so good
8 a 8 8 terrible
9 a 9 8 terrible
10 a 10 8 terrible