I have looked at a few solutions but have not been able to find anything that includes observations in which the ID occurs more than once. My data looks like this in data frame 1 (df1)
Name <- c("Doe, John","Doe, John","Doe, John", "Doe, Jane", "Doe, Jane","Doe, Jane","Parker, Peter","Parker, Peter","Parker, Peter", "Stark, Tony","Stark, Tony","Stark, Tony")
Accession <- c(123, 234, 345, 456, 567, 678, 789, 8910, 1023, 1134, 1567, 1769)
MRN <-c(55555, 55555, 55555, 66666, 66666, 66666, 77777, 77777, 77777, 88888, 88888, 88888)
Collected <-c("2022-02-05", "2022-02-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10", "2022-01-11", "2022-02-12", "2022-01-13", "2022-01-04", "2022-01-15", "2022-01-16")
Result <-c("Detected", "Detected", "Detected", "Detected", "Detected","Detected", "Detected", "Detected", "Detected", "Detected", "Detected", "Detected")
df1 <- data.frame(Name, Accession, MRN, Collected, Result)
The data for dataframe 2 (df2) looks likes this
Name <- c("Doe, John","Doe, John","Doe, John", "Doe, Jane", "Doe, Jane","Doe, Jane","Parker, Peter","Parker, Peter","Parker, Peter", "Stark, Tony","Stark, Tony","Stark, Tony")
Accession <- c(123, 234, 345, 456, 567, 678, 789, 8910, 1023, 1134, 1567, 1769)
MRN <-c(55555, 55555, 55555, 66666, 66666, 66666, 77777, 77777, 77777, 88888, 88888, 88888)
Collected <-c("2022-01-22", "2022-01-20", "2022-01-07", "2022-01-28", "2022-01-12", "2022-01-15", "2022-01-17", "2022-01-31", "2022-01-16", "2022-01-20", "2022-01-25", "2022-01-26")
Result <-c("Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive","Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive", "Presumptive")
df2 <- data.frame(Name, Accession, MRN, Collected, Result)
I would like to subset df1 by df2 and keep all observations in df1 that have a Collected date /- 7 days from the Collected date in df2. My issue is that I want all observations even if the MRN is duplicated. I would like it to look like this
Name Accession MRN Collected Result
Doe, John 345 55555 2022-01-07 Detected
Doe, Jane 456 66666 2022-01-08 Detected
Doe, Jane 567 66666 2022-01-09 Detected
Doe, Jane 678 66666 2022-01-10 Detected
Parker, Peter 789 77777 2022-01-11 Detected
Parker, Peter 1023 77777 2022-01-13 Detected
Stark, Tony 1567 88888 2022-01-15 Detected
Stark, Tony 1769 88888 2022-01-16 Detected
In this final data based on the /- 7 days for any observation the observations for Doe, John (123 and 234), Parker, Peter (8910) and Stark, Tony (1134) would be excluded because they do not occur in the timeframe.
CodePudding user response:
An approach using left_join
and filter
.
library(dplyr)
left_join(df1, df2 %>% select(-Result), c("Name", "Accession", "MRN")) %>%
group_by(MRN) %>%
filter(sapply(as.Date(Collected.x), function(x)
any(abs(x - as.Date(Collected.y)) <= 7))) %>%
ungroup() %>%
select(-ends_with(".y")) %>%
rename(Collected = Collected.x)
# A tibble: 8 × 5
Name Accession MRN Collected Result
<chr> <dbl> <dbl> <chr> <chr>
1 Doe, John 345 55555 2022-01-07 Detected
2 Doe, Jane 456 66666 2022-01-08 Detected
3 Doe, Jane 567 66666 2022-01-09 Detected
4 Doe, Jane 678 66666 2022-01-10 Detected
5 Parker, Peter 789 77777 2022-01-11 Detected
6 Parker, Peter 1023 77777 2022-01-13 Detected
7 Stark, Tony 1567 88888 2022-01-15 Detected
8 Stark, Tony 1769 88888 2022-01-16 Detected