Adding an aggregated field to a data.table from another table-CodePudding

I have a small table with subject IDs and a big table that has "events" for those subjects (i.e. each record contains event date, subject ID and many other details) Something like this:

set.seed(8)

n <- 20L
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))

N <- 1e7L
dt.events <- data.table(
  sbjID  =sample(LETTERS,N,T),
  relDate=sample(seq_len(1000),N,replace=T)
  )
dt.events[, evDate:=as.Date('1990-01-01')   relDate]

#    sbjID relDate     evDate
# 1:     L     975 1992-09-02
# 2:     G     231 1990-08-20
# 3:     H     379 1991-01-15
# 4:     S     916 1992-07-05
# 5:     F     619 1991-09-12
# 6:     G     200 1990-07-20

Now I want to add some calculated field (say, last event for each subject) to the first table. In the early days when I was not very familiar with data.table syntax, I used merge():

dt.subjects <- merge(
  dt.subjects,
  dt.events[,.(lastDate=max(evDate)), by=sbjID],
  by = 'sbjID',
  all.x=T
)

it does the job but looks quite ugly, and also causes troubles if lastDate column is already there. Nowadays I use the following:

dt.subjects[, 
        lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']), 
        by=sbjID]

I think this lastDate:= reads much more intuitively than merge(). But... it is much slower! (5 times slower when I use the dummy datasets given here, 20 times slower in my actual dataset) Is there any solution that would be faster than my second one but less ugly than the first one?

CodePudding user response：

If we want a faster option than the second one,

system.time({
   dt.subjects2[dt.Events[, .(lastDate = max(evDate)), by = sbjID], 
              lastDate := lastDate, on = .(sbjID)]


})
user  system elapsed 
  0.339   0.016   0.355 

dt.subjects2$lastDate <- NULL
system.time({
dt.subjects2[dt.Events, lastDate := max(evDate), on = .(sbjID), by = .EACHI]
 })
   user  system elapsed 
 17.683   0.078  17.705 

system.time({
 out <- merge(
  dt.subjects,
  dt.Events[,.(lastDate=max(evDate)), by=sbjID],
  by = 'sbjID',
  all.x=TRUE
)


})
 user  system elapsed 
  0.335   0.022   0.357

The efficiency can be improved if we use fmax (from collapse)

library(collapse)
 system.time({
   tmp <- fmax(dt.Events$evDate, g = dt.Events$sbjID)
   tmp <- data.table(sbjID = names(tmp), 
           lastDate = as.Date(tmp, origin = '1970-01-01'))
   
   dt.subjects2[tmp, lastDate := lastDate, on = .(sbjID)]
 
 
 })
   user  system elapsed 
  0.143   0.005   0.147

data

n <- 20L
set.seed(24)
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))

N <- 1e7L
dt.Events <- data.table(
  sbjID  =sample(LETTERS,N,T),
  relDate=sample(seq_len(1000),N,replace=T)
  )
dt.Events[, evDate:=as.Date('1990-01-01')   relDate]



dt.subjects2 <- copy(dt.subjects)

CodePudding user response：

The OP's code does one join for each group:

system.time(
    dt.subjects[, 
        lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']), 
    by=sbjID] 
)
#    user  system elapsed 
#    2.82    0.50    1.31

Using by=.EACHI we can drop this to a single join:

system.time(
    dt.subjects[, 
        lastDate := dt.events[.SD, max(evDate), on='sbjID', by=.EACHI]$V1
    ] 
)
#    user  system elapsed 
#    0.18    0.05    0.08

I wrote more details on this idiom in another answer.