Home > database >  How do I run a matrix regex or grep on the outer 'product' of two string vectors in R with
How do I run a matrix regex or grep on the outer 'product' of two string vectors in R with

Time:08-02

Let's say I have a vector of strings, and a second vector of standard words that I'm interested in finding inside those strings. For example:

 a = c("aspirin 20mg", "ibuprofen 200mg", "diclofenac 50mg x 2", "phenobarbital 100mg")
 b = c("aspirin", "acetaminophen", "morphine", "ibuprofen", "warfarin")

I want to get back a TRUE-FALSE matrix of a regex of the a vector, looking for the standard substrings in the b vector. I would love if this was a matrix of length(a) X length(b). What I naively thought would work is:

 outer(a, b, grepl)

I know that I could create a function that does a nested sapply, e.g.

 sapply(a, function(x) sapply(b, function(y) grepl(y,x)))

...but I feel like R should have something simpler that is related to the outer command. mapply feels stupid because I'd have to rep and wrap the outputs back into a matrix.

CodePudding user response:

I am not sure you need to nest your sapply() statements. Without nesting you can do:

sapply(b, \(x) grepl(x, a))
#      aspirin acetaminophen morphine ibuprofen warfarin
# [1,]    TRUE         FALSE    FALSE     FALSE    FALSE
# [2,]   FALSE         FALSE    FALSE      TRUE    FALSE
# [3,]   FALSE         FALSE    FALSE     FALSE    FALSE
# [4,]   FALSE         FALSE    FALSE     FALSE    FALSE

Admittedly it is then a little cumbersome to add which string they match:

sapply(b, \(x) grepl(x, a))  |>
    data.frame()  |>
    cbind(a)
#   aspirin acetaminophen morphine ibuprofen warfarin                   a
# 1    TRUE         FALSE    FALSE     FALSE    FALSE        aspirin 20mg
# 2   FALSE         FALSE    FALSE      TRUE    FALSE     ibuprofen 200mg
# 3   FALSE         FALSE    FALSE     FALSE    FALSE diclofenac 50mg x 2
# 4   FALSE         FALSE    FALSE     FALSE    FALSE phenobarbital 100mg

However, I like the idea of using outer(). You could combine that with stringi::stri_count_fixed and setNames():

outer(
    setNames(a, a),
    setNames(b,b), 
    stringi::stri_count_fixed
)
#                     aspirin acetaminophen morphine ibuprofen warfarin
# aspirin 20mg              1             0        0         0        0
# ibuprofen 200mg           0             0        0         1        0
# diclofenac 50mg x 2       0             0        0         0        0
# phenobarbital 100mg       0             0        0         0        0

CodePudding user response:

To use outer, you need a function that takes vectorized inputs along both the input and the match pattern. grepl only accepts vectorized input, not vectorized patterns. However, stringr::str_detect does:

outer(a, b, stringr::str_detect)
#>       [,1]  [,2]  [,3]  [,4]  [,5]
#> [1,]  TRUE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE  TRUE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE
  • Related