R (data.table) Keep rows in DT1 with values of a given variable not present in DT2 for given matched-CodePudding

I have two data.tables. The first one has information of all colleague ids of a given individual id at a given month. The data is presented in long format (i.e., if an individual has 3 colleagues at a given month, we have 3 rows for that same individual-month).

The second dataset informs the list of all colleagues of a given individual for a given month. There is only a single row for each id-month.

All id-month contained in the first dataset can be found in this second dataset. However, not all colleague ids informed in the first dataset can be found within the list of colleagues for the respective id-month. I would like to keep the rows of the first dataset which inform ids not included in the list of colleagues for a given id-month.

Below is a simple example.

library(data.table)

pairs <- data.table(
  id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
  month = c(1, 1, 1, 1, 1, 2, 2, 1, 1, 2),
  colleague_id = c(2, 3, 4, 5, 10, 2, 4, 1, 11, 12)
)

list_colleagues <- data.table(
  id = c(1, 1, 2, 2),
  month = c(1, 2, 1, 2),
  colleague_ids = list(c(2, 4, 5), c(2, 4), c(1,10), 12)
)

The code below works fine. However, I would like to make it more efficient. Is there a way to code this in a way that I don't have to attach the list of colleagues to the first dataset as a prior step to then check if the colleague ids is in that list.

keep_excluded <- list_colleagues[pairs, on=.(id, month)]

keep_excluded <- keep_excluded[, 
                               keep := !colleague_id %in% colleague_ids[[1]], 1:nrow(keep_excluded)][
  keep==TRUE][, 
              keep:=NULL][, 
                          colleague_ids:=NULL]

Output

   id month colleague_id
1:  1     1            3
2:  1     1           10
3:  2     1           11

CodePudding user response：

We could use .EACHI on the join step itself

library(data.table)
setnames(list_colleagues[pairs, lapply(colleague_ids, \(x) 
   colleague_id[!colleague_id %in% x]), on = .(id, month), 
     by = .EACHI], 'V1', 'colleague_id')[]

-output

     id month colleague_id
   <num> <num>        <num>
1:     1     1            3
2:     1     1           10
3:     2     1           11

Or may also do

list_colleagues[pairs, .(colleague_id = setdiff(colleague_id, 
      unlist(colleague_ids))), on = .(id, month), by = .EACHI]
      id month colleague_id
   <num> <num>        <num>
1:     1     1            3
2:     1     1           10
3:     2     1           11

NOTE: Both solution does in a single join

CodePudding user response：

First convert to list_colleagues to a friendlier format then perform anti-join:

lc2 <- list_colleagues[, .(colleague_id = unlist(colleague_ids)), by = .(id, month)]

pairs[!lc2, on = names(pairs)]

#       id month colleague_id
#    <num> <num>        <num>
# 1:     1     1            3
# 2:     1     1           10
# 3:     2     1           11