How to subset one dataframe based on the number of days from a second dataframe in R-CodePudding

I have been working on this and looking for a solution but can't find anything that gets me exactly what I want. I have two data frames. The first data frame is structured like this

Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

My second dataframe is structed like this

Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")

I would like to remove observations in df1 that based on "ID" and "Collection" do not occur within 7 days ( /-) from any observation matching the same "ID" in df2

So ideally my output from the two examples would be in a new dataframe (df3) and look like this since my second observation (ID: 123456) in df1 is not within 7 days of of either Complete_Date with that ID in df2

Name              ID        Collection     
Doe, John         123456    2021-01-03
Smith, John       345678    2022-06-14

CodePudding user response：

The typical way to do this would be using a "non-equi join," which is not currently supported in dplyr but is in the dev version. We can get around that (with some loss of efficiency) using a cartesian join that first connects each ID with all its matches in the 2nd table, then filters to just the ones within 7 days.

Prepare fake data

Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
df2<-data.frame(Number, ID, Complete_Date)

df1$Collection = as.Date(df1$Collection)
df2$Complete_Date = as.Date(df2$Complete_Date)

Cartesian join and filter using current CRAN dplyr 1.0.10:

df1 %>%
  left_join(df2, by = "ID") %>%
  group_by(Name, ID) %>%
  filter(abs(Collection - Complete_Date) <= 7) %>%
  ungroup() %>%
  distinct(Name, ID, Collection)

In the current development version of dplyr (I'm using 1.0.99.9000), there's support for non-equi joins like this, which I expect will be more efficient if each ID has many Complete_Date matches.

# devtools::install_github("tidyverse/dplyr")
df1 %>%
  mutate(Collection_early = Collection - 7, 
         Collection_late  = Collection   7) %>%
  left_join(df2, join_by(ID,
                         Collection_early >= Complete_Date,
                         Collection_late  <= Complete_Date)) %>%
  distinct(Name, ID, Collection)

CodePudding user response：

So, in base R you can use by:

You should be more specific in wether the - 7 days is inclusive or exclusive.

#DF1
Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

#DF2
Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
df2 <- data.frame(Number, ID, Complete_Date)

df3 <- df1[
  by(df1, 1:nrow(df1) , function(d) 
    any(d$ID == df2$ID  
        & (as.Date(d$Collection) < as.Date(df2$Complete_Date)   7  
           & as.Date(d$Collection) > as.Date(df2$Complete_Date) - 7 ))),]

df3
#>          Name     ID Collection
#> 1   Doe, John 123456 2021-01-03
#> 3 Smith, John 345678 2022-06-14