Extracting list of IDs matching multiple conditions from seperate data frame in R-CodePudding

I have a large dataframes with patient data (14000 cases; df.A). We have a separate data file (df.B) including long term outcome and want to match the cases based on admission date, as a first step.

The first dataframe (df.A) includes estimates of admission date for which we have included a margin of error (2 days before and 7 days after). The second data frame is the large dataframe of hospital data (400k entries; df.B). The data frames could look like this.

IDA <- c(1:40)
AdmissionDtA <- seq(as.Date("2014-01-01"), as.Date("2014-02-09"), by="days")
FirstDtA <- AdmissionDt -2
LastDtA <- AdmissionDt  7
df.A <- data.frame(IDA, AdmissionDtA, FirstDtA, LastDtA)

IDB <- c(41:80)
AdmissionDtB <- seq(as.Date("2014-01-16"), as.Date("2014-02-24"), by="days")
df.B <- data.frame(IDB, AdmissionDtB)

We would like to extract potential matches when the AdmissionDtB lies between FirstDtA and LastDtA. The goal is to add a column to df.A, including a list of potential matching IDB's per case (IDA).

Result would look something like this:

IDA	AdmissionDtA	FirstDtA	LastDtA	MatchingIDB
8	2014-01-08	2014-01-06	2014-01-15	no match
9	2014-01-09	2014-01-07	2014-01-16	41
10	2014-01-10	2014-01-08	2014-01-17	41;42
11	2014-01-11	2014-01-09	2014-01-18	41;42;43

We have tried this with excel with INDEX and AGGREGATE, which worked but the data files were to big to handle for excel. There are around 2k matching IDBs per case (IDA).

We are relatively new to R and we could not make it work with "merge" and "join" etc.

Any help would be much appreciated. Thanks in advance!

CodePudding user response：

You can use between inside a map call:

library(purrr)
library(dplyr)
df.A %>% 
  mutate(MatchingIDB = map2(df.A$FirstDtA, df.A$LastDtA, ~ c(df.B$IDB[between(df.B$AdmissionDtB, .x, .y)])))

output

> df.A[8:11, ]
   IDA AdmissionDtA   FirstDtA    LastDtA MatchingIDB
8    8   2014-01-08 2014-01-06 2014-01-15            
9    9   2014-01-09 2014-01-07 2014-01-16          41
10  10   2014-01-10 2014-01-08 2014-01-17      41, 42
11  11   2014-01-11 2014-01-09 2014-01-18  41, 42, 43

CodePudding user response：

An alternative approach using data.table and non-equi joins. Might improve performance with very large datasets.

library(data.table)
library(magrittr)

setDT(df.B)[setDT(df.A), 
            .(IDA, AdmissionDtA, FirstDtA, LastDtA, IDB), 
            on = c("AdmissionDtB >= FirstDtA", "AdmissionDtB <= LastDtA")] %>%
  .[j = .(matchingIDB = paste0(IDB, collapse = ",")), 
    by = names(df.A)]

#>     IDA AdmissionDtA   FirstDtA    LastDtA                   matchingIDB
#>  1:   1   2014-01-01 2013-12-30 2014-01-08                            NA
#>  2:   2   2014-01-02 2013-12-31 2014-01-09                            NA
#>  3:   3   2014-01-03 2014-01-01 2014-01-10                            NA
#>  4:   4   2014-01-04 2014-01-02 2014-01-11                            NA
#>  5:   5   2014-01-05 2014-01-03 2014-01-12                            NA
#>  6:   6   2014-01-06 2014-01-04 2014-01-13                            NA
#>  7:   7   2014-01-07 2014-01-05 2014-01-14                            NA
#>  8:   8   2014-01-08 2014-01-06 2014-01-15                            NA
#>  9:   9   2014-01-09 2014-01-07 2014-01-16                            41
#> 10:  10   2014-01-10 2014-01-08 2014-01-17                         41,42
#> 11:  11   2014-01-11 2014-01-09 2014-01-18                      41,42,43
#> 12:  12   2014-01-12 2014-01-10 2014-01-19                   41,42,43,44
#> 13:  13   2014-01-13 2014-01-11 2014-01-20                41,42,43,44,45
#> 14:  14   2014-01-14 2014-01-12 2014-01-21             41,42,43,44,45,46
#> 15:  15   2014-01-15 2014-01-13 2014-01-22          41,42,43,44,45,46,47
...