Home > Enterprise >  How to left join two data frames conditionally - by rows that fall within a date range - and by two
How to left join two data frames conditionally - by rows that fall within a date range - and by two

Time:05-14

I have two simulated data frames:

  1. d, created below, which has all the rows of longitudinal data for two different people. Each row has a start and end date. Some rows have a visit date associated with them because the person had a visit during that time frame.
  2. d2, created below (printed in second block of code), which has only one row per visit date per person (this is the 'backbone' or 'finder file' data frame) onto which I want to merge new rows.

Problem: I want to merge ONLY the rows from d that contain time that falls one month before each visit date for each person to the rows in d2. To get the data to merge properly, I need to merge by both the person and the visit date. In the end, I want a data set that contains only the rows whose start and end dates contain the one month before each visit date, and I want to keep all columns. I need a solution using dplyr.

Below, I've tried this 4 different ways, and annotated the error message associated with each.

#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)

#Create data frame 1
#Create variables for data frame 1 
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)

#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)

#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')

######

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')

#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <=  : 
 # the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
  left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1

#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
  select *
  from back left join d on 
  (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
  AND d2.person_2 = d.person
  And d2.visit_2 = d.visit
  ')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
      FROM d2
      LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
  FROM d2 
  LEFT JOIN
  d on 
      d2.person_2 = d.person and
      d2.visit_2 = d.visit and 
      (d$visit %m-% months(1) >= d$start) and 
      (d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
      
d
#>   person row      start        end      visit
#> 1      1   1 2016-06-17 2016-09-30       <NA>
#> 2      1   2 2016-10-01 2016-12-31       <NA>
#> 3      1   3 2017-01-01 2017-01-14       <NA>
#> 4      1   4 2017-01-15 2017-06-04 2017-01-15
#> 5      1   5 2017-06-05 2017-09-03 2017-08-01
#> 6      2   1 2014-12-14 2014-12-31       <NA>
#> 7      2   2 2015-01-01 2015-01-18       <NA>
#> 8      2   3 2015-01-19 2015-07-03 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#>   person_2    visit_2
#> 1        1 2017-01-15
#> 2        1 2017-08-01
#> 3        2 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)

CodePudding user response:

d2 %>%
  mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
  fuzzyjoin::fuzzy_left_join(
    d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
    match_fun = list(`<=`, `>=`))
#   person row      start        end      visit person_2    visit_2 visit_2_m1
# 1      1   1 2016-06-17 2016-09-30       <NA>       NA       <NA>       <NA>
# 2      1   2 2016-10-01 2016-12-31       <NA>        1 2017-01-15 2016-12-15
# 3      1   3 2017-01-01 2017-01-14       <NA>       NA       <NA>       <NA>
# 4      1   4 2017-01-15 2017-06-04 2017-01-15       NA       <NA>       <NA>
# 5      1   5 2017-06-05 2017-09-03 2017-08-01        1 2017-08-01 2017-07-01
# 6      2   1 2014-12-14 2014-12-31       <NA>       NA       <NA>       <NA>
# 7      2   2 2015-01-01 2015-01-18       <NA>       NA       <NA>       <NA>
# 8      2   3 2015-01-19 2015-07-03 2015-02-22        2 2015-02-22 2015-01-22

CodePudding user response:

The description of your problem is unclear and I'm not sure I've understood exactly what you're trying to achieve. How many rows should be returned in the example above? Do you just want to join a subset of the rows in d to d2?

Here is a solution using dplyr only which may be what you are looking for:

  d %>% 
  mutate(
    date_interval = start %--% end,
    visit_within = visit %m-% months(1) %within% date_interval
  ) %>% 
  filter(visit_within == TRUE) %>% 
  left_join(d2, ., by = c("person_2" = "person"))
  • Related