Home > front end >  How to join two dataframes containing time varying variables in R
How to join two dataframes containing time varying variables in R

Time:02-27

This seems like a simple enough thing but I can't figure it out nor find an answer online - apologies if it something obvious. I have two seperate dataframes containing the same patients with the same unique identifier. Both datasets have time varying variables - one continuous and one categorical and the time to each reading is different in the sets but have a common start point at time 1. I have tried to modify the tmerge function from survival package but without luck as I don't have a dichotomous outcome variable nor a single data set with one row per patient.

Reprex for creating the datasets below (df1 and df2) and an example of my desired combined output table for a single patient (ID 3), output gets very long if done for all 4 patients

Thanks for any possible help

df1 <- structure(list(tstart = c(1, 1, 1, 1426, 1, 560, 567), tstop = c(2049, 
3426, 1426, 1707, 560, 567, 4207), category = structure(c(1L, 
1L, 1L, 2L, 1L, 4L, 2L), .Label = c("none", "high", "low", "moderate"
), class = "factor"), id = c(1L, 2L, 3L, 3L, 4L, 4L, 4L)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))


df2 <- structure(list(tstart = c(1, 365, 730, 1, 365, 730, 1096, 2557, 
1, 365, 730, 1096, 1826, 2557, 3652, 1), tstop = c(365, 730, 
1096, 365, 730, 1096, 2557, 2582, 365, 730, 1096, 1826, 2557, 
3652, 4864, 365), egfr = c(66, 62, 58, 54, 50, 43, 49, 51, 106, 
103, 80, 92, 97, 90, 81, 51), id = c(1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L)), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

df_example_patient_3 <- structure(list(id = c(3L, 3L, 3L,
3L, 3L, 3L,3L, 3L, 3L), tstart = c(1, 365, 730, 1096, 1426, 1707, 1826, 2557, 3652), tstop = c(365, 730, 
1096, 1426, 1707, 1826, 2557, 3652, 4864), egfr = c(106, 103, 80, 92, 92, 92, 97, 90, 81), category = c("none", "none", "none", "none", "high", "high", "high", "high", "high")), row.names = c(NA, -9L), class = c("tbl_df", 
"tbl", "data.frame"))
# DF1
  tstart tstop category    id
   <dbl> <dbl> <fct>    <int>
1      1  2049 none         1
2      1  3426 none         2
3      1  1426 none         3
4   1426  1707 high         3
5      1   560 none         4
6    560   567 moderate     4
7    567  4207 high         4

# DF2
   tstart tstop  egfr    id
    <dbl> <dbl> <dbl> <int>
 1      1   365    66     1
 2    365   730    62     1
 3    730  1096    58     1
 4      1   365    54     2
 5    365   730    50     2
 6    730  1096    43     2
 7   1096  2557    49     2
 8   2557  2582    51     2
 9      1   365   106     3
10    365   730   103     3
11    730  1096    80     3
12   1096  1826    92     3
13   1826  2557    97     3
14   2557  3652    90     3
15   3652  4864    81     3
16      1   365    51     4
# Combined set
     id tstart tstop  egfr category
  <int>  <dbl> <dbl> <dbl> <chr>   
1     3      1   365   106 none    
2     3    365   730   103 none    
3     3    730  1096    80 none    
4     3   1096  1426    92 none    
5     3   1426  1707    92 high    
6     3   1707  1826    92 high    
7     3   1826  2557    97 high    
8     3   2557  3652    90 high    
9     3   3652  4864    81 high 


CodePudding user response:

I had to do it this way to really work out the details. First, i construct a full df1 with all the timestamps, including those of df2.

then i proceed with multiple merges. This is not elegant, but it works:

library(data.table)
library(zoo)

# Proper data.tables
setDT(df1, key = c("id", "tstart"))
setDT(df2, key = c("id", "tstart"))

timestamps_by_id <- unique(rbind(
  df1[, .(id, tstart)],
  df1[, .(id, tstop)],
  df2[, .(id, tstart)],
  df2[, .(id, tstop)],
  use.names = F
))
setorder(timestamps_by_id, id, tstart)

# Merge to construct full df1
df1_full <- df1[timestamps_by_id]
df1_full[, category := na.locf(category), by = id]
df1_full[, tstop := shift(tstart, -1), by = id]
setkey(df1_full, id, tstart)

# Merge with df2
result <- na.omit(df2[df1_full, roll = T])
result[, tstop := i.tstop]
print(result[id == 3, .(id, tstart, tstop, egfr, category)])

Or a more data.tabley solution using the more arcane foverlaps:

library(data.table)

# Proper data.tables
setDT(df1, key = c("id", "tstart", "tstop"))
setDT(df2, key = c("id", "tstart", "tstop"))

# We add an infinite upper range
proper_df1 <- rbind(
  df1,
  df1[, .SD[which.max(tstop)], by = .(id)][, .(id, tstart = tstop, tstop = Inf, category), ]
)

setkey(proper_df1, id, tstart, tstop)

overlaps <- foverlaps(df2, proper_df1, type = "any") # Overlap join

overlaps[
tstart            
  • Related