I have two data frames with different data for the same people. Data frame 1 (dfx) has unique ids and dates people had appointments and data frame 2 has unique ids and a start and end date.
It looks something like below:
c1 <- c("1", "1", "1", "1", "1", "2", "2", "2", "2", "2")
d1 <- c("2017", "2018", "2019", "2020", "2021", "2019", "2019", "2019", "2020", "2021")
dfx <- data.frame(c1,d1)
c2 <- c("1", "1", "2")
ds <- c("2017", "2020", "2017")
de <- c("2018", "2021", "2018")
dfy <- data.frame(c2,ds,de)
I'm working with data frame 2 and I want to know if dates in data frame 1 is within the start and end dates in data frame 2. I am trying to get an output in dfy saying TRUE or FALSE for overlap.
For this example, the output should return TRUE, TRUE, FALSE.
I've tried working with this on dplyr and not getting the result I'm looking. I'll appreciate any help. Thanks.
dplyr code:
overlap <- dfy %>%
group_by(c2) %>%
mutate (on_hold = any(mapply(function(id, start, end) any(id == dfx$c1 & dfx$d1 > start & dfx$d1 < end), c2, ds, de))) %>%
arrange(c2, ds, de, on_hold)
CodePudding user response:
solution
ranges <- dfx %>%
group_by(c1) %>%
summarise(range = list(unique(d1)))
left_join(dfy, ranges, by = c("c2" = "c1")) %>%
rowwise() %>%
mutate(in_range = ds %in% range & de %in% range)
output
# A tibble: 3 x 5
# Rowwise:
c2 ds de range in_range
<fct> <fct> <fct> <list> <lgl>
1 1 2017 2018 <fct [5]> TRUE
2 1 2020 2021 <fct [5]> TRUE
3 2 2017 2018 <fct [3]> FALSE
data as provided by OP
c1 <- c("1", "1", "1", "1", "1", "2", "2", "2", "2", "2")
d1 <- c("2017", "2018", "2019", "2020", "2021", "2019", "2019", "2019", "2020", "2021")
dfx <- data.frame(c1,d1)
c2 <- c("1", "1", "2")
ds <- c("2017", "2020", "2017")
de <- c("2018", "2021", "2018")
dfy <- data.frame(c2,ds,de)
CodePudding user response:
I think what you might want is fuzzyjoin
. It would also be much easier if your data was numeric (or a date), but I think this would still work with the character values as the package probably coerces them. I’m assuming you just have the year so numeric is fine, but this would also work the same way dates. fuzzy_left_join
allows you to join within a range and we can then check if the expected values are missing after the join.
If you just want the original dfy
with TRUE
or FALSE
, this would be a way with fuzzyjoin
:
library(fuzzyjoin)
library(dplyr)
dfx <- dfx %>% mutate_all(as.numeric)
dfy <- dfy %>% mutate_all(as.numeric)
df <-
fuzzy_left_join(dfy, dfx,
by = c("c2" = "c1",
"ds" = "d1",
"de" = "d1"),
match_fun = list(`==`, `<=`, `>=`)) %>%
mutate(appt_in_range = ifelse(!is.na(d1), TRUE, FALSE)) %>%
select(-d1, -c1) %>%
unique() %>%
inner_join(., dfy)
#> Joining, by = c("c2", "ds", "de")
df
#> c2 ds de appt_in_range
#> 1 1 2017 2018 TRUE
#> 2 1 2020 2021 TRUE
#> 3 2 2017 2018 FALSE
To check if any date range contains the id:
df <-
fuzzy_left_join(dfx, dfy,
by = c("c1" = "c2",
"d1" = "ds",
"d1" = "de"),
match_fun = list(`==`, `>=`, `<=`)) %>%
mutate(appt_in_range = ifelse(!is.na(ds), TRUE, FALSE)) %>%
select(-c2)
df
#> c1 d1 ds de appt_in_range
#> 1 1 2017 2017 2018 TRUE
#> 2 1 2018 2017 2018 TRUE
#> 3 1 2019 NA NA FALSE
#> 4 1 2020 2020 2021 TRUE
#> 5 1 2021 2020 2021 TRUE
#> 6 2 2019 NA NA FALSE
#> 7 2 2019 NA NA FALSE
#> 8 2 2019 NA NA FALSE
#> 9 2 2020 NA NA FALSE
#> 10 2 2021 NA NA FALSE
Created on 2022-02-08 by the reprex package (v2.0.1)