Home > Blockchain >  Comparing dates across different data frames and return output
Comparing dates across different data frames and return output

Time:02-11

I have two data frames with different data for the same people. Data frame 1 (dfx) has unique ids and dates people had appointments and data frame 2 has unique ids and a start and end date.

It looks something like below:

c1 <- c("1", "1", "1", "1", "1", "2", "2", "2", "2", "2")
d1 <- c("2017", "2018", "2019", "2020", "2021", "2019", "2019", "2019", "2020", "2021")

dfx <- data.frame(c1,d1)
c2 <- c("1", "1", "2")
ds <- c("2017", "2020", "2017")
de <- c("2018", "2021", "2018")

dfy <- data.frame(c2,ds,de) 

I'm working with data frame 2 and I want to know if dates in data frame 1 is within the start and end dates in data frame 2. I am trying to get an output in dfy saying TRUE or FALSE for overlap.

For this example, the output should return TRUE, TRUE, FALSE.

I've tried working with this on dplyr and not getting the result I'm looking. I'll appreciate any help. Thanks.

dplyr code:

overlap <- dfy %>%
  group_by(c2) %>%
  mutate (on_hold = any(mapply(function(id, start, end) any(id == dfx$c1 & dfx$d1 > start & dfx$d1 < end), c2, ds, de))) %>%
  arrange(c2, ds, de, on_hold)

CodePudding user response:

solution

ranges <- dfx %>%
  group_by(c1) %>%
  summarise(range = list(unique(d1)))

left_join(dfy, ranges, by = c("c2" = "c1")) %>%
  rowwise() %>%
  mutate(in_range = ds %in% range & de %in% range)

output

# A tibble: 3 x 5
# Rowwise: 
  c2    ds    de    range     in_range
  <fct> <fct> <fct> <list>    <lgl>   
1 1     2017  2018  <fct [5]> TRUE    
2 1     2020  2021  <fct [5]> TRUE    
3 2     2017  2018  <fct [3]> FALSE 

data as provided by OP

c1 <- c("1", "1", "1", "1", "1", "2", "2", "2", "2", "2")
d1 <- c("2017", "2018", "2019", "2020", "2021", "2019", "2019", "2019", "2020", "2021")

dfx <- data.frame(c1,d1)

c2 <- c("1", "1", "2")
ds <- c("2017", "2020", "2017")
de <- c("2018", "2021", "2018")

dfy <- data.frame(c2,ds,de)

CodePudding user response:

I think what you might want is fuzzyjoin. It would also be much easier if your data was numeric (or a date), but I think this would still work with the character values as the package probably coerces them. I’m assuming you just have the year so numeric is fine, but this would also work the same way dates. fuzzy_left_join allows you to join within a range and we can then check if the expected values are missing after the join.

If you just want the original dfy with TRUE or FALSE, this would be a way with fuzzyjoin:

library(fuzzyjoin)
library(dplyr)

dfx <- dfx %>% mutate_all(as.numeric)
dfy <- dfy %>% mutate_all(as.numeric)


df <- 
  fuzzy_left_join(dfy, dfx,
                  by = c("c2" = "c1",
                         "ds" = "d1",
                         "de" = "d1"),
                  match_fun = list(`==`, `<=`, `>=`)) %>%
  mutate(appt_in_range = ifelse(!is.na(d1), TRUE, FALSE)) %>%
  select(-d1, -c1) %>%
  unique() %>%
  inner_join(., dfy)
#> Joining, by = c("c2", "ds", "de")

df
#>   c2   ds   de appt_in_range
#> 1  1 2017 2018          TRUE
#> 2  1 2020 2021          TRUE
#> 3  2 2017 2018         FALSE

To check if any date range contains the id:

df <- 
  fuzzy_left_join(dfx, dfy,
                  by = c("c1" = "c2",
                         "d1" = "ds",
                         "d1" = "de"),
                  match_fun = list(`==`, `>=`, `<=`)) %>%
  mutate(appt_in_range = ifelse(!is.na(ds), TRUE, FALSE)) %>%
  select(-c2)

df
#>    c1   d1   ds   de appt_in_range
#> 1   1 2017 2017 2018          TRUE
#> 2   1 2018 2017 2018          TRUE
#> 3   1 2019   NA   NA         FALSE
#> 4   1 2020 2020 2021          TRUE
#> 5   1 2021 2020 2021          TRUE
#> 6   2 2019   NA   NA         FALSE
#> 7   2 2019   NA   NA         FALSE
#> 8   2 2019   NA   NA         FALSE
#> 9   2 2020   NA   NA         FALSE
#> 10  2 2021   NA   NA         FALSE

Created on 2022-02-08 by the reprex package (v2.0.1)

  • Related