Get indices of matches with a column in a second data.table-CodePudding

I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.

I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.

Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.

I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.

I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.

Here is some example data:

# Load library
library(data.table)

# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"), 
                  firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"), 
                  lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))

# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4), 
                  firstName = c("Maria", "Jim", "Jack", "Anne"), 
                  lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
                  result = c("pos", "neg", "neg", "pos"))

My desired output would look like this:

# Output:
> dt2
   lid firstName lastName result fn_match ln_match casematch caseid
1:   1     Maria  Antonis    pos       NA       NA        NA   <NA>
2:   2       Jim    Dutto    neg      1,4        4         4     c4
3:   3      Jack    Blogs    neg       NA       NA        NA   <NA>
4:   4      Anne    Mcfee    pos      3,5        3         3     c3

CodePudding user response：

A possible solution:

dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
       ,.(casematch =.( id)),by=.(lid,firstName)]

     lid firstName casematch
   <num>    <char>    <list>
1:     1     Maria        NA
2:     2       Jim       1,4
3:     3      Jack        NA
4:     4      Anne       3,5

CodePudding user response：

We could use

library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]

-output

  firstName casematch   lid
      <char>    <char> <num>
1:     Maria        NA     1
2:       Jim    c1, c4     2
3:      Jack        NA     3
4:      Anne    c3, c5     4

Or with row index

dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
   firstName casematch   lid
      <char>    <char> <num>
1:     Maria      <NA>     1
2:       Jim      1, 4     2
3:      Jack      <NA>     3
4:      Anne      3, 5     4

CodePudding user response：

Using .EACHI and adding the resulting list column by reference. The actual join takes place in j.

dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
#    lid firstName res
# 1:   1     Maria  NA
# 2:   2       Jim 1,4
# 3:   3      Jack  NA
# 4:   4      Anne 3,5

CodePudding user response：

Another data.table option

> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
   firstName    cid lid
1:     Maria   <NA>   1
2:       Jim c1, c4   2
3:      Jack   <NA>   3
4:      Anne c3, c5   4

CodePudding user response：

I found another way to perform exact matching on multiple columns at once, which also works very well and can be executed in one single line of code:

# Get case IDs for matches of both firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]

# Output:
> dt2
   lid firstName lastName result caseid
1:   1     Maria  Antonis    pos   <NA>
2:   2       Jim    Dutto    neg     c4
3:   3      Jack    Blogs    neg   <NA>
4:   4      Anne    Mcfee    pos     c3