Count Occurrences of Substrings in One Column in Second Column R data.table-CodePudding

How do I split column a (by the space character) and count how many times any of the substrings in column a exist in column b?

library(data.table)
library(stringr)

dt = data.table(
    a = c('one', 'one two', 'one two three'),
    b = c('zero', 'none_or One?' , 'onetwothree')
)

I failed when I tried:

dt[ , .(
    str_count( b,
        pattern = str_split( a , pattern = ' ' )
    )
) ]

Error in UseMethod("type") : 
  no applicable method for 'type' applied to an object of class "list"

I expect:

CodePudding user response：

dict = paste0(unique(unlist(strsplit(dt$a, " "))), collapse="|")
f <- function(s,dict) {
  res = gregexpr(dict,s)[[1]]
  return(ifelse(res[1]==-1,as.integer(0),length(res)))
}
dt[,f(b, dict), by=.(1:nrow(dt))][,.(V1)]


   V1
1:  0
2:  1
3:  3

CodePudding user response：

The first step is to create a regex, then apply the regex to each object.

Try this:

dt [, parsed:= str_split(a, " ")]  
dt [, regex := lapply(parsed, function(x) paste0(x, collapse = "|"))]
dt [, V1    := mapply (function(x,y) {str_extract_all(x,y)[[1]] |> length()}, b, regex)]
dt [, .(V1)]

CodePudding user response：

Another couple of variations:

With just data.table:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) grepl(x, txt))), strsplit(a, " "), b) ]
##[1] 0 1 3

With data.table and stringr:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) str_detect(txt, x))), strsplit(a, " "), b) ]
##[1] 0 1 3