How do I split column a (by the space character) and count how many times any of the substrings in column a exist in column b?
library(data.table)
library(stringr)
dt = data.table(
a = c('one', 'one two', 'one two three'),
b = c('zero', 'none_or One?' , 'onetwothree')
)
I failed when I tried:
dt[ , .(
str_count( b,
pattern = str_split( a , pattern = ' ' )
)
) ]
Error in UseMethod("type") :
no applicable method for 'type' applied to an object of class "list"
I expect:
V1
1: 0
2: 1
3: 3
CodePudding user response:
dict = paste0(unique(unlist(strsplit(dt$a, " "))), collapse="|")
f <- function(s,dict) {
res = gregexpr(dict,s)[[1]]
return(ifelse(res[1]==-1,as.integer(0),length(res)))
}
dt[,f(b, dict), by=.(1:nrow(dt))][,.(V1)]
V1
1: 0
2: 1
3: 3
CodePudding user response:
The first step is to create a regex, then apply the regex to each object.
Try this:
dt [, parsed:= str_split(a, " ")]
dt [, regex := lapply(parsed, function(x) paste0(x, collapse = "|"))]
dt [, V1 := mapply (function(x,y) {str_extract_all(x,y)[[1]] |> length()}, b, regex)]
dt [, .(V1)]
CodePudding user response:
Another couple of variations:
With just data.table:
dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) grepl(x, txt))), strsplit(a, " "), b) ]
##[1] 0 1 3
With data.table and stringr:
dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) str_detect(txt, x))), strsplit(a, " "), b) ]
##[1] 0 1 3