Home > Software design >  How to check for string matches in data.table in R
How to check for string matches in data.table in R

Time:11-14

library(data.table)
tab1 <- data.table(id1 = c(1, 2, 100, 100),
                   class = c(2, 2, 1, 5),
                   number = c(96, 100, 55, 55),
                   code1 = c("123", "125", "999", "999"))
tab2 <- data.table(id2 = c(100, 200, 200, 205, 205, 205),
                   class = c(2, 2, 1, 5, 2, 2),
                   max_number = c(100, 80, 80, 95, 95, 95),
                   min_number = c(0, 50, 50, 10, 10, 10),
                   code2 = c("123,999", "100,105", "100,105", "555,999", "234,999", "123,999,555"))

> tab1
   id1 class number code1
1:   1     2     96   123
2:   2     2    100   125
3: 100     1     55   999
4: 100     5     55   999
> tab2
   id2 class max_number min_number       code2
1: 100     2        100          0     123,999
2: 200     2         80         50     100,105
3: 200     1         80         50     100,105
4: 205     5         95         10     555,999
5: 205     2         95         10     234,999
6: 205     2         95         10 123,999,555

I have two tables. For each id1, I would like to match on class, min_number <= number <= max_number and see if code1 matches anything (partially) in code2. That is, "123" would be a match for "123, 999, 555".

My attempt is below. The column V1 contains the id2s that matched.

> tab2[tab1, on = c("class", "max_number >= number", "min_number <= number"), .(list(id2)), by = .EACHI]
   class max_number min_number  V1
1:     2         96         96 100
2:     2        100        100 100
3:     1         55         55 200
4:     5         55         55 205

Just matching on class and min_number <= number <= max_number works fine. But trying to do the partial string match with grepl gives the following error:

> tab2[tab1, on = c("class", "max_number >= number", "min_number <= number", grepl("code2", "code1")), .(list(id2)), by = .EACHI]
Error in colnamesInt(x, names(on), check_dups = FALSE) : 
  argument specifying columns specify non existing column(s): cols[4]='FALSE'

The desired output is something like this (where V1 contains the id2s that matches each row in tab1):

   id1 class max_number min_number      V1
1:   1     2         96         96     100
2:   2     2        100        100    <NA>
3: 100     1         55         55    <NA>
4: 100     5         55         55 205,205

CodePudding user response:

You can't use grepl in the join condition, but you could calculate it using mapply:

tab2[tab1, on = c("class", "max_number >= number", "min_number <= number"), .(code1,code2,id1,id2,select = mapply(grepl,code1,code2)), by = .EACHI][
     ,.(list(fifelse(select,id2,NA))),by=.(id1,class,max_number,min_number)]

     id1 class max_number min_number     V1
   <num> <num>      <num>      <num> <list>
1:     1     2         96         96    100
2:     2     2        100        100     NA
3:   100     1         55         55     NA
4:   100     5         55         55    205

CodePudding user response:

Probably I have to use strsplit instead of grepl here

tab2[,
  .(code2 = unlist(strsplit(code2, ","))),
  by = id2:min_number
][
  tab1, 
  .(id1, class, min_number, max_number, id2),
  on = c("class", "min_number <= number", "max_number >= number", "code2 == code1")
]

which gives

   id1 class min_number max_number id2
1:   1     2         96         96 100
2:   2     2        100        100  NA
3: 100     1         55         55  NA
4: 100     5         55         55 205
  • Related