For each row in a data.table, get random index for matching rows in another data.table-CodePudding

I have a data.table (dt1) with several columns of factors e.g.

ID    Factor1    Factor2    Factor3
01          A          X          J
02          B          X          L
03          C          Y          J

Each row has a unique ID, but there may be multiple rows with the same combinations of Factor variables. I have another table (dt2) which has exactly the same Factor variables, but is much larger, and has no ID column. Every combination of Factors in dt1 will appear multiple times in dt2. e.g.

Factor1    Factor2    Factor3
      A          Z          K
      A          X          J
      A          X          J
      B          J          L
      B          X          L
      C          X          J
      C          Y          J

I want to return, for each row in dt1, the index of a random row in dt2 which has the same combination of Factor variables.

I figure I can use .I[sample(.N,1)] to return a random index, but can't figure out how to group by the relevant factors for each row. Do I need to make a function that takes each row in dt1 in turn, or is there a vectorized way to achieve this?

The desired output would either be a vector of length nrow(dt1), or it could be an additional column in dt1. It would contain the row index from dt2. e.g. something like this (the Index for ID 01 could be 2 or 3):

ID    Factor1    Factor2    Factor3    Index
01          A          X          J        3
02          B          X          L        5
03          C          Y          J        7

Any help much appreciated.

CodePudding user response：

# create index in dt2
dt2[, index := .I]
# set unique key dt1
setkey(dt1, ID)
# get factor columns
cols <- grep("Factor", names(dt1), value = TRUE)
#build code to eval/parse in a string
run.text <- paste0("dt1[dt1, index := sample(dt2[", 
                   paste0(cols, " == i.", cols, collapse = " & "), 
                   ", ]$index, 1), by = .EACHI]")
#eval/parse the string
eval(parse(text = run.text))
#    ID Factor1 Factor2 Factor3 index
# 1:  1       A       X       J     2
# 2:  2       B       X       L     5
# 3:  3       C       Y       J     1
# 4:  4       A       X       J     3

sample data used (with duplicate in dt1)

library(data.table)
dt1 <- fread("ID    Factor1    Factor2    Factor3
01          A          X          J
02          B          X          L
03          C          Y          J
04          A          X          J")

dt2 <- fread("Factor1    Factor2    Factor3
      A          Z          K
      A          X          J
      A          X          J
      B          J          L
      B          X          L
      C          X          J
      C          Y          J")

CodePudding user response：

UPD
I think you want to do something like this:

dt2[, Index := as.character(.I)]
cols = c("Factor1", "Factor2", "Factor3")
dt = dt2[, lapply(.SD, list), keyby = cols]
dt = merge(dt1, dt, by = cols, all.x = T)
dt[, .(Index = sample(Index[[1]], 1)), keyby = c("ID", cols)]

CodePudding user response：

You can also try

add index
merge dt1 and dt2 then sample with sample(.N,1)

cols = names(dt2)
dt2[,index := .I]
dt2[dt1, on = (cols)][,.SD[sample(.N,1)],.(ID)]

or one row answer

dt2[,index := .I][dt1, on = (cols)][,.SD[sample(.N,1)],.(ID)]

CodePudding user response：

library(data.table)

# set up the data
set.seed(94)
dt1 <- setnames(as.data.table(matrix(sample(3, 9, TRUE), 3)), paste0("Factor", 1:3))
dt2 <- dt1[sample(3, 10, TRUE)]
dt1
#>    Factor1 Factor2 Factor3
#> 1:       2       1       2
#> 2:       2       3       3
#> 3:       2       3       2
dt2
#>     Factor1 Factor2 Factor3
#>  1:       2       3       2
#>  2:       2       3       3
#>  3:       2       3       2
#>  4:       2       3       2
#>  5:       2       3       2
#>  6:       2       1       2
#>  7:       2       3       2
#>  8:       2       3       2
#>  9:       2       3       2
#> 10:       2       3       3

# create helper columns and do a rolling join
dt2[, R := seq_len(.N)/.N, by = names(dt2)][, Index := .I]
print(dt2[dt1[, R := runif(.N)], on = names(dt1), roll = -Inf][, R := NULL])
#>    Factor1 Factor2 Factor3 Index
#> 1:       2       1       2     6
#> 2:       2       3       3     2
#> 3:       2       3       2     3