I'm fairly new to R, and have been sifting through other questions all morning trying to figure this out, but can't find anything related enough or my knowledge of R is not good enough to understand some of the suggested solutions to my problem.
I have two data frames, table A with a list of non-unique identifiers and a date, and table B with the same identifier field, and a start and end date outlining a 3 month date range. In my real data, I have 1.7m records in table A, and 1.6k records in table A (as well many other fields that i'll use for the final analysis). I am expecting the vast majority of records in table A to be unnecessary.
What I want to achieve is to join the two tables together, joining on the identifier, and then only joining if the date in table A falls inside the date range in table B. I want this as an inner join so I lose the unnecessary data.
Example tables:
a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))
b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))
Expected result (we lose person 3 because the date falls outside the range, person 2 has two records because they had two entries in table b with corresponding dates):
c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))
I've been basing my solution on this web page https://exploratory.io/note/exploratory/How-to-join-two-data-frames-with-date-ranges-moq8hEQ6, however my example adds in the requirement to also join the identifier.
My solution that logically makes sense to me based on my (limited) ability in R:
Joined <- fuzzy_inner_join(b, a, by = c("numberb"="numbera", "datex"="date1", "datex"="date2"),
match_fun = list("=", ">=", "<="))
However I get this error message:
Error in which(m) : argument to 'which' is not logical
Thank you in advance for any help here :)
CodePudding user response:
With backticks for the match_fun
and ==
for the number:
library(tidyverse)
library(fuzzyjoin)
a <- data.frame(
numbera = c("1", "2", "3", "1"),
date1 = as.Date(c("10/04/2021", "21/06/2021", "02/10/2021", "17/02/2022"),
format = "%d/%m/%Y"
),
date2 = as.Date(c("10/07/2021", "21/09/2021", "02/01/2022", "17/05/2022"),
format = "%d/%m/%Y"
)
)
b <- data.frame(
numberb = c("1", "2", "2", "3", "1"),
datex = as.Date(c("16/05/2021", "01/08/2021", "03/08/2021", "02/09/2021", "17/03/2022"),
format = "%d/%m/%Y"
)
)
fuzzy_inner_join(b, a,
by = c("numberb" = "numbera", "datex" = "date1", "datex" = "date2"),
match_fun = list(`==`, `>=`, `<=`)
)
#> numberb datex numbera date1 date2
#> 1 1 2021-05-16 1 2021-04-10 2021-07-10
#> 2 2 2021-08-01 2 2021-06-21 2021-09-21
#> 3 2 2021-08-03 2 2021-06-21 2021-09-21
#> 4 1 2022-03-17 1 2022-02-17 2022-05-17
Created on 2022-07-08 by the reprex package (v2.0.1)
CodePudding user response:
Alternate solution using data.table
:
library(data.table)
dt <- merge(a,b, by.x = "numbera", by.y = "numberb")
setDT(dt)
dt[date1 <= datex & date2 >= datex]
numbera date1 date2 datex
1: 1 2021-04-10 2021-07-10 2021-05-16
2: 1 2022-02-17 2022-05-17 2022-03-17
3: 2 2021-06-21 2021-09-21 2021-08-01
4: 2 2021-06-21 2021-09-21 2021-08-03
CodePudding user response:
data.table supports non-equi joins that are the fastest I've found. The syntax is a bit different from the tidyverse but I think it's worth it for these more complicated joins.
a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))
b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))
c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))
library(data.table)
dt_a <- setDT(a) # change the data frame to a data table
dt_b <- setDT(b) # change the data frame to a data table
Joined <- dt_b[ # take table b
dt_a, # join table a
.(numberb, datex = x.datex, numbera, date1, date2), # selecting these columns
on = .(numberb == numbera, datex>=date1, datex<=date2), # joining on these columns
nomatch = NULL # remove non-matches for an inner join
] |>
setDF() # change it back to a data frame for comparison
identical(c, Joined) # TRUE