data.table non-join on one column-CodePudding

I'm trying to do a join in data.table where within a set of columns that match another table I want all values in a third column from that table that don't exist in the initial table. Here's an example:

library(foreach)
library(data.table)

tmp <- data.table(mgr=c('1','1','1','1',
                        '2','2',
                        '3','3'),
                  year=rep(2022,8),
                  stocks=c('A','B','C','D',
                           'A','B',
                           'C','D'))

The input table looks like this:

   mgr year stocks
1:   1 2022      A
2:   1 2022      B
3:   1 2022      C
4:   1 2022      D
5:   2 2022      A
6:   2 2022      B
7:   3 2022      C
8:   3 2022      D

I then get all unique stock-year combinations. Then for each mgr-year I want the stocks that are not in tmp for that particular mgr-year pair:

allstocks <- unique(tmp[,.(year,stocks)])

myloop <- unique(tmp[,.(mgr,year)])

out <- foreach(myidx=1:nrow(myloop)) %do% {
    tmp1 <- tmp[myloop[myidx],on=.NATURAL]
    tmp1 <- allstocks[!tmp1,on=.(year,stocks)]
    tmp1 <- merge(myloop[myidx],tmp1,by=c('year'))
    tmp1
}

finalout <- rbindlist(out)

finalout

This gives:

   year mgr stocks
1: 2022   2      C
2: 2022   2      D
3: 2022   3      A
4: 2022   3      B

Since mgr 1 has all stocks there is no row for that manager. I'm also okay with there being a row with NA in stocks for that particular mgr-year.

Effectively what I want is for each stock-year to obtain a list of the stocks that each mgr does not hold. This example works, but the actual data is ~100mm rows so I'm wondering is there is a way to do this with pure data.table in an efficient manner.

Thanks.

CodePudding user response：

In data.table there is a ! operator for exclusions.

setkey(tmp, mgr, year, stocks)
all.stocks <- tmp[, .(stocks=unique(tmp$stocks)), by=.(mgr, year)]
setkey(all.stocks, mgr, year, stocks)
missing <- all.stocks[!tmp]

This should be pretty fast.