Home > Software engineering >  Using data.table to match multiple patterns against multiple strings in R
Using data.table to match multiple patterns against multiple strings in R

Time:09-07

library(data.table)
dat1 <- data.table(id1 = c(1, 1, 2),
          pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.table(id2 = c(1174, 1231),
          description = c("apple is sweet", "apple is a computer"),
          memo = c("bananas, sweet yes", "bananas, sweetyes"))
> dat1
   id1        pattern
1:   1          apple
2:   1      applejack
3:   2 bananas, sweet
> dat2
    id2         description               memo
1: 1174      apple is sweet bananas, sweet yes
2: 1231 apple is a computer  bananas, sweetyes

I have two data.tables, dat1 and dat2. I want to search for each pattern in dat2 against the description and memo columns in dat2 and store the corresponding id2s.

The final output table should look something like this:

   id1        pattern description_match memo_match
1:   1          apple         1174,1231       <NA>
2:   1      applejack              <NA>       <NA>
3:   2 bananas, sweet              <NA>       1174

The regular expression I want to use is \\b[pattern]\\b. Below is my attempt:

dat1[, description_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$description), .(id2 = paste(id2, collapse = ","))]]
dat1[, memo_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$memo), .(id2 = paste(id2, collapse = ","))]]

However, both give me the error that grepl can only use the first pattern.

CodePudding user response:

We group by row sequence, create the match columns from 'dat2', by pasteing the 'id2' extracted from the logical output from grepl

library(data.table)
dat1[, c("description_match", "memo_match") := {
    pat <- sprintf('\\b(%s)\\b', paste(pattern, collapse = "|"))
    .(toString(dat2$id2[grepl(pat, dat2$description)]), 
    toString(dat2$id2[grepl(pat, dat2$memo)]))
   }, seq_along(id1)]
dplyr::na_if(dat1, "")
    id1        pattern description_match memo_match
   <num>         <char>            <char>     <char>
1:     1          apple        1174, 1231       <NA>
2:     1      applejack              <NA>       <NA>
3:     2 bananas, sweet              <NA>       1174

According to ?sprintf

The string fmt contains normal characters, which are passed through to the output string, and also conversion specifications which operate on the arguments provided through .... The allowed conversion specifications start with a % and end with one of the letters in the set aAdifeEgGosxX%

s - Character string. Character NAs are converted to "NA".


Or use a for loop

for(nm in names(dat2)[-1]) 
   dat1[, paste0(nm, "_match") :=  
     toString(dat2$id2[grepl(paste0("\\b", pattern, "\\b"), 
       dat2[[nm]])]), seq_along(id1)][]
  • Related