I have a large dataframes with patient data (14000 cases; df.A). We have a separate data file (df.B) including long term outcome and want to match the cases based on admission date, as a first step.
The first dataframe (df.A) includes estimates of admission date for which we have included a margin of error (2 days before and 7 days after). The second data frame is the large dataframe of hospital data (400k entries; df.B). The data frames could look like this.
IDA <- c(1:40)
AdmissionDtA <- seq(as.Date("2014-01-01"), as.Date("2014-02-09"), by="days")
FirstDtA <- AdmissionDt -2
LastDtA <- AdmissionDt 7
df.A <- data.frame(IDA, AdmissionDtA, FirstDtA, LastDtA)
IDB <- c(41:80)
AdmissionDtB <- seq(as.Date("2014-01-16"), as.Date("2014-02-24"), by="days")
df.B <- data.frame(IDB, AdmissionDtB)
We would like to extract potential matches when the AdmissionDtB lies between FirstDtA and LastDtA. The goal is to add a column to df.A, including a list of potential matching IDB's per case (IDA).
Result would look something like this:
IDA | AdmissionDtA | FirstDtA | LastDtA | MatchingIDB |
---|---|---|---|---|
8 | 2014-01-08 | 2014-01-06 | 2014-01-15 | no match |
9 | 2014-01-09 | 2014-01-07 | 2014-01-16 | 41 |
10 | 2014-01-10 | 2014-01-08 | 2014-01-17 | 41;42 |
11 | 2014-01-11 | 2014-01-09 | 2014-01-18 | 41;42;43 |
We have tried this with excel with INDEX and AGGREGATE, which worked but the data files were to big to handle for excel. There are around 2k matching IDBs per case (IDA).
We are relatively new to R and we could not make it work with "merge" and "join" etc.
Any help would be much appreciated. Thanks in advance!
CodePudding user response:
You can use between
inside a map
call:
library(purrr)
library(dplyr)
df.A %>%
mutate(MatchingIDB = map2(df.A$FirstDtA, df.A$LastDtA, ~ c(df.B$IDB[between(df.B$AdmissionDtB, .x, .y)])))
output
> df.A[8:11, ]
IDA AdmissionDtA FirstDtA LastDtA MatchingIDB
8 8 2014-01-08 2014-01-06 2014-01-15
9 9 2014-01-09 2014-01-07 2014-01-16 41
10 10 2014-01-10 2014-01-08 2014-01-17 41, 42
11 11 2014-01-11 2014-01-09 2014-01-18 41, 42, 43
CodePudding user response:
An alternative approach using data.table
and non-equi
joins. Might improve performance with very large datasets.
library(data.table)
library(magrittr)
setDT(df.B)[setDT(df.A),
.(IDA, AdmissionDtA, FirstDtA, LastDtA, IDB),
on = c("AdmissionDtB >= FirstDtA", "AdmissionDtB <= LastDtA")] %>%
.[j = .(matchingIDB = paste0(IDB, collapse = ",")),
by = names(df.A)]
#> IDA AdmissionDtA FirstDtA LastDtA matchingIDB
#> 1: 1 2014-01-01 2013-12-30 2014-01-08 NA
#> 2: 2 2014-01-02 2013-12-31 2014-01-09 NA
#> 3: 3 2014-01-03 2014-01-01 2014-01-10 NA
#> 4: 4 2014-01-04 2014-01-02 2014-01-11 NA
#> 5: 5 2014-01-05 2014-01-03 2014-01-12 NA
#> 6: 6 2014-01-06 2014-01-04 2014-01-13 NA
#> 7: 7 2014-01-07 2014-01-05 2014-01-14 NA
#> 8: 8 2014-01-08 2014-01-06 2014-01-15 NA
#> 9: 9 2014-01-09 2014-01-07 2014-01-16 41
#> 10: 10 2014-01-10 2014-01-08 2014-01-17 41,42
#> 11: 11 2014-01-11 2014-01-09 2014-01-18 41,42,43
#> 12: 12 2014-01-12 2014-01-10 2014-01-19 41,42,43,44
#> 13: 13 2014-01-13 2014-01-11 2014-01-20 41,42,43,44,45
#> 14: 14 2014-01-14 2014-01-12 2014-01-21 41,42,43,44,45,46
#> 15: 15 2014-01-15 2014-01-13 2014-01-22 41,42,43,44,45,46,47
...