Home > Blockchain >  Simplify time-dependent data created with tmerge
Simplify time-dependent data created with tmerge

Time:12-13

I have a large data.table containing many time-dependent variables(50 ) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.

The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.

For example I would want to convert the data.table example into results.

library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))

This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.

The best I can come up with is:

example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL

This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal. This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)

CodePudding user response:

this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)

example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]), 
        by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
#    patid tstart tstop x y
# 1:     1      0     2 0 0
# 2:     1      2     3 1 1
# 3:     2      0     1 1 2
# 4:     2      1     3 2 3

CodePudding user response:

Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:

cols=c("x","y")

cbind(
  example[, id:=rleidv(.SD), .SDcols  = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
  example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]

Output:

   patid tstart tstop x y
1:     1      0     2 0 0
2:     1      2     3 1 1
3:     2      0     1 1 2
4:     2      1     3 2 3

CodePudding user response:

Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.

example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)), 
        by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)

I would imagine this could be simplified, but I think it does what is needed for more complex examples.

  • Related