Home > OS >  How to subset dataframes based on date range in another dataframe that has multiple matching ID'
How to subset dataframes based on date range in another dataframe that has multiple matching ID'

Time:01-28

I have looked at a few solutions but have not been able to find anything that includes observations in which the ID occurs more than once. My data looks like this in data frame 1 (df1)

Name <- c("Doe, John","Doe, John","Doe, John", "Doe, Jane", "Doe, Jane","Doe, Jane","Parker, Peter","Parker, Peter","Parker, Peter", "Stark, Tony","Stark, Tony","Stark, Tony")
Accession <- c(123, 234, 345, 456, 567, 678, 789, 8910, 1023, 1134, 1567, 1769)
MRN <-c(55555, 55555, 55555, 66666, 66666, 66666, 77777, 77777, 77777, 88888, 88888, 88888)
Collected <-c("2022-02-05", "2022-02-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10", "2022-01-11", "2022-02-12", "2022-01-13", "2022-01-04", "2022-01-15", "2022-01-16")
Result <-c("Detected", "Detected", "Detected", "Detected", "Detected","Detected", "Detected", "Detected", "Detected", "Detected", "Detected", "Detected")


df1 <- data.frame(Name, Accession, MRN, Collected, Result)

The data for dataframe 2 (df2) looks likes this

Name <- c("Doe, John","Doe, John","Doe, John", "Doe, Jane", "Doe, Jane","Doe, Jane","Parker, Peter","Parker, Peter","Parker, Peter", "Stark, Tony","Stark, Tony","Stark, Tony")
Accession <- c(123, 234, 345, 456, 567, 678, 789, 8910, 1023, 1134, 1567, 1769)
MRN <-c(55555, 55555, 55555, 66666, 66666, 66666, 77777, 77777, 77777, 88888, 88888, 88888)
Collected <-c("2022-01-22", "2022-01-20", "2022-01-07", "2022-01-28", "2022-01-12", "2022-01-15", "2022-01-17", "2022-01-31", "2022-01-16", "2022-01-20", "2022-01-25", "2022-01-26")
Result <-c("Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive","Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive")


df2 <- data.frame(Name, Accession, MRN, Collected, Result)

I would like to subset df1 by df2 and keep all observations in df1 that have a Collected date /- 7 days from the Collected date in df2. My issue is that I want all observations even if the MRN is duplicated. I would like it to look like this

Name            Accession             MRN               Collected        Result
Doe, John       345                  55555              2022-01-07       Detected
Doe, Jane       456                  66666              2022-01-08       Detected
Doe, Jane       567                  66666              2022-01-09       Detected
Doe, Jane       678                  66666              2022-01-10       Detected
Parker, Peter   789                  77777              2022-01-11       Detected
Parker, Peter   1023                 77777              2022-01-13       Detected
Stark, Tony     1567                 88888              2022-01-15       Detected
Stark, Tony     1769                 88888              2022-01-16       Detected

In this final data based on the /- 7 days for any observation the observations for Doe, John (123 and 234), Parker, Peter (8910) and Stark, Tony (1134) would be excluded because they do not occur in the timeframe.

CodePudding user response:

An approach using left_join and filter.

library(dplyr)

left_join(df1, df2 %>% select(-Result), c("Name", "Accession", "MRN")) %>% 
  group_by(MRN) %>% 
  filter(sapply(as.Date(Collected.x), function(x) 
    any(abs(x - as.Date(Collected.y)) <= 7))) %>% 
  ungroup() %>% 
  select(-ends_with(".y")) %>% 
  rename(Collected = Collected.x)
# A tibble: 8 × 5
  Name          Accession   MRN Collected  Result  
  <chr>             <dbl> <dbl> <chr>      <chr>   
1 Doe, John           345 55555 2022-01-07 Detected
2 Doe, Jane           456 66666 2022-01-08 Detected
3 Doe, Jane           567 66666 2022-01-09 Detected
4 Doe, Jane           678 66666 2022-01-10 Detected
5 Parker, Peter       789 77777 2022-01-11 Detected
6 Parker, Peter      1023 77777 2022-01-13 Detected
7 Stark, Tony        1567 88888 2022-01-15 Detected
8 Stark, Tony        1769 88888 2022-01-16 Detected
  • Related