I have a small table with subject IDs and a big table that has "events" for those subjects (i.e. each record contains event date, subject ID and many other details) Something like this:
set.seed(8)
n <- 20L
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))
N <- 1e7L
dt.events <- data.table(
sbjID =sample(LETTERS,N,T),
relDate=sample(seq_len(1000),N,replace=T)
)
dt.events[, evDate:=as.Date('1990-01-01') relDate]
# sbjID relDate evDate
# 1: L 975 1992-09-02
# 2: G 231 1990-08-20
# 3: H 379 1991-01-15
# 4: S 916 1992-07-05
# 5: F 619 1991-09-12
# 6: G 200 1990-07-20
Now I want to add some calculated field (say, last event for each subject) to the first table.
In the early days when I was not very familiar with data.table syntax, I used merge()
:
dt.subjects <- merge(
dt.subjects,
dt.events[,.(lastDate=max(evDate)), by=sbjID],
by = 'sbjID',
all.x=T
)
it does the job but looks quite ugly, and also causes troubles if lastDate column is already there. Nowadays I use the following:
dt.subjects[,
lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']),
by=sbjID]
I think this lastDate:=
reads much more intuitively than merge()
.
But... it is much slower! (5 times slower when I use the dummy datasets given here, 20 times slower in my actual dataset)
Is there any solution that would be faster than my second one but less ugly than the first one?
CodePudding user response:
If we want a faster option than the second one,
system.time({
dt.subjects2[dt.Events[, .(lastDate = max(evDate)), by = sbjID],
lastDate := lastDate, on = .(sbjID)]
})
user system elapsed
0.339 0.016 0.355
dt.subjects2$lastDate <- NULL
system.time({
dt.subjects2[dt.Events, lastDate := max(evDate), on = .(sbjID), by = .EACHI]
})
user system elapsed
17.683 0.078 17.705
system.time({
out <- merge(
dt.subjects,
dt.Events[,.(lastDate=max(evDate)), by=sbjID],
by = 'sbjID',
all.x=TRUE
)
})
user system elapsed
0.335 0.022 0.357
The efficiency can be improved if we use fmax
(from collapse
)
library(collapse)
system.time({
tmp <- fmax(dt.Events$evDate, g = dt.Events$sbjID)
tmp <- data.table(sbjID = names(tmp),
lastDate = as.Date(tmp, origin = '1970-01-01'))
dt.subjects2[tmp, lastDate := lastDate, on = .(sbjID)]
})
user system elapsed
0.143 0.005 0.147
data
n <- 20L
set.seed(24)
dt.subjects <- data.table(sbjID=c(sample(LETTERS,n/2),sample(letters,n/2)))
N <- 1e7L
dt.Events <- data.table(
sbjID =sample(LETTERS,N,T),
relDate=sample(seq_len(1000),N,replace=T)
)
dt.Events[, evDate:=as.Date('1990-01-01') relDate]
dt.subjects2 <- copy(dt.subjects)
CodePudding user response:
The OP's code does one join for each group:
system.time(
dt.subjects[,
lastDate:=(dt.events[sbjID,max(evDate), on='sbjID']),
by=sbjID]
)
# user system elapsed
# 2.82 0.50 1.31
Using by=.EACHI
we can drop this to a single join:
system.time(
dt.subjects[,
lastDate := dt.events[.SD, max(evDate), on='sbjID', by=.EACHI]$V1
]
)
# user system elapsed
# 0.18 0.05 0.08
I wrote more details on this idiom in another answer.