I have two data.tables. The first one has information of all colleague ids
of a given individual id
at a given month
. The data is presented in long format (i.e., if an individual has 3 colleagues at a given month, we have 3 rows for that same individual-month).
The second dataset informs the list
of all colleagues of a given individual for a given month. There is only a single row for each id-month.
All id-month contained in the first dataset can be found in this second dataset. However, not all colleague ids informed in the first dataset can be found within the list of colleagues for the respective id-month. I would like to keep the rows of the first dataset which inform ids not included in the list of colleagues for a given id-month.
Below is a simple example.
library(data.table)
pairs <- data.table(
id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
month = c(1, 1, 1, 1, 1, 2, 2, 1, 1, 2),
colleague_id = c(2, 3, 4, 5, 10, 2, 4, 1, 11, 12)
)
list_colleagues <- data.table(
id = c(1, 1, 2, 2),
month = c(1, 2, 1, 2),
colleague_ids = list(c(2, 4, 5), c(2, 4), c(1,10), 12)
)
The code below works fine. However, I would like to make it more efficient. Is there a way to code this in a way that I don't have to attach the list of colleagues to the first dataset as a prior step to then check if the colleague ids is in that list.
keep_excluded <- list_colleagues[pairs, on=.(id, month)]
keep_excluded <- keep_excluded[,
keep := !colleague_id %in% colleague_ids[[1]], 1:nrow(keep_excluded)][
keep==TRUE][,
keep:=NULL][,
colleague_ids:=NULL]
Output
id month colleague_id
1: 1 1 3
2: 1 1 10
3: 2 1 11
CodePudding user response:
We could use .EACHI
on the join step itself
library(data.table)
setnames(list_colleagues[pairs, lapply(colleague_ids, \(x)
colleague_id[!colleague_id %in% x]), on = .(id, month),
by = .EACHI], 'V1', 'colleague_id')[]
-output
id month colleague_id
<num> <num> <num>
1: 1 1 3
2: 1 1 10
3: 2 1 11
Or may also do
list_colleagues[pairs, .(colleague_id = setdiff(colleague_id,
unlist(colleague_ids))), on = .(id, month), by = .EACHI]
id month colleague_id
<num> <num> <num>
1: 1 1 3
2: 1 1 10
3: 2 1 11
NOTE: Both solution does in a single join
CodePudding user response:
First convert to list_colleagues
to a friendlier format then perform anti-join:
lc2 <- list_colleagues[, .(colleague_id = unlist(colleague_ids)), by = .(id, month)]
pairs[!lc2, on = names(pairs)]
# id month colleague_id
# <num> <num> <num>
# 1: 1 1 3
# 2: 1 1 10
# 3: 2 1 11