I have been working on this and looking for a solution but can't find anything that gets me exactly what I want. I have two data frames. The first data frame is structured like this
Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)
My second dataframe is structed like this
Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
I would like to remove observations in df1 that based on "ID" and "Collection" do not occur within 7 days ( /-) from any observation matching the same "ID" in df2
So ideally my output from the two examples would be in a new dataframe (df3) and look like this since my second observation (ID: 123456) in df1 is not within 7 days of of either Complete_Date with that ID in df2
Name ID Collection
Doe, John 123456 2021-01-03
Smith, John 345678 2022-06-14
CodePudding user response:
The typical way to do this would be using a "non-equi join," which is not currently supported in dplyr but is in the dev version. We can get around that (with some loss of efficiency) using a cartesian join that first connects each ID with all its matches in the 2nd table, then filters to just the ones within 7 days.
Prepare fake data
Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)
Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
df2<-data.frame(Number, ID, Complete_Date)
df1$Collection = as.Date(df1$Collection)
df2$Complete_Date = as.Date(df2$Complete_Date)
Cartesian join and filter using current CRAN dplyr 1.0.10:
df1 %>%
left_join(df2, by = "ID") %>%
group_by(Name, ID) %>%
filter(abs(Collection - Complete_Date) <= 7) %>%
ungroup() %>%
distinct(Name, ID, Collection)
In the current development version of dplyr (I'm using 1.0.99.9000), there's support for non-equi joins like this, which I expect will be more efficient if each ID has many Complete_Date matches.
# devtools::install_github("tidyverse/dplyr")
df1 %>%
mutate(Collection_early = Collection - 7,
Collection_late = Collection 7) %>%
left_join(df2, join_by(ID,
Collection_early >= Complete_Date,
Collection_late <= Complete_Date)) %>%
distinct(Name, ID, Collection)
CodePudding user response:
So, in base R you can use by
:
You should be more specific in wether the - 7 days is inclusive or exclusive.
#DF1
Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)
#DF2
Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
df2 <- data.frame(Number, ID, Complete_Date)
df3 <- df1[
by(df1, 1:nrow(df1) , function(d)
any(d$ID == df2$ID
& (as.Date(d$Collection) < as.Date(df2$Complete_Date) 7
& as.Date(d$Collection) > as.Date(df2$Complete_Date) - 7 ))),]
df3
#> Name ID Collection
#> 1 Doe, John 123456 2021-01-03
#> 3 Smith, John 345678 2022-06-14