I'm trying to do a join in data.table where within a set of columns that match another table I want all values in a third column from that table that don't exist in the initial table. Here's an example:
library(foreach)
library(data.table)
tmp <- data.table(mgr=c('1','1','1','1',
'2','2',
'3','3'),
year=rep(2022,8),
stocks=c('A','B','C','D',
'A','B',
'C','D'))
The input table looks like this:
mgr year stocks
1: 1 2022 A
2: 1 2022 B
3: 1 2022 C
4: 1 2022 D
5: 2 2022 A
6: 2 2022 B
7: 3 2022 C
8: 3 2022 D
I then get all unique stock-year combinations. Then for each mgr-year I want the stocks that are not in tmp
for that particular mgr-year pair:
allstocks <- unique(tmp[,.(year,stocks)])
myloop <- unique(tmp[,.(mgr,year)])
out <- foreach(myidx=1:nrow(myloop)) %do% {
tmp1 <- tmp[myloop[myidx],on=.NATURAL]
tmp1 <- allstocks[!tmp1,on=.(year,stocks)]
tmp1 <- merge(myloop[myidx],tmp1,by=c('year'))
tmp1
}
finalout <- rbindlist(out)
finalout
This gives:
year mgr stocks
1: 2022 2 C
2: 2022 2 D
3: 2022 3 A
4: 2022 3 B
Since mgr 1 has all stocks there is no row for that manager. I'm also okay with there being a row with NA in stocks for that particular mgr-year.
Effectively what I want is for each stock-year to obtain a list of the stocks that each mgr does not hold. This example works, but the actual data is ~100mm rows so I'm wondering is there is a way to do this with pure data.table in an efficient manner.
Thanks.
CodePudding user response:
In data.table
there is a !
operator for exclusions.
setkey(tmp, mgr, year, stocks)
all.stocks <- tmp[, .(stocks=unique(tmp$stocks)), by=.(mgr, year)]
setkey(all.stocks, mgr, year, stocks)
missing <- all.stocks[!tmp]
This should be pretty fast.