Let's say I have a vector of strings, and a second vector of standard words that I'm interested in finding inside those strings. For example:
a = c("aspirin 20mg", "ibuprofen 200mg", "diclofenac 50mg x 2", "phenobarbital 100mg")
b = c("aspirin", "acetaminophen", "morphine", "ibuprofen", "warfarin")
I want to get back a TRUE-FALSE
matrix of a regex of the a
vector, looking for the standard substrings in the b
vector. I would love if this was a matrix of length(a) X length(b)
. What I naively thought would work is:
outer(a, b, grepl)
I know that I could create a function that does a nested sapply
, e.g.
sapply(a, function(x) sapply(b, function(y) grepl(y,x)))
...but I feel like R should have something simpler that is related to the outer
command. mapply
feels stupid because I'd have to rep
and wrap the outputs back into a matrix.
CodePudding user response:
I am not sure you need to nest your sapply()
statements. Without nesting you can do:
sapply(b, \(x) grepl(x, a))
# aspirin acetaminophen morphine ibuprofen warfarin
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
Admittedly it is then a little cumbersome to add which string they match:
sapply(b, \(x) grepl(x, a)) |>
data.frame() |>
cbind(a)
# aspirin acetaminophen morphine ibuprofen warfarin a
# 1 TRUE FALSE FALSE FALSE FALSE aspirin 20mg
# 2 FALSE FALSE FALSE TRUE FALSE ibuprofen 200mg
# 3 FALSE FALSE FALSE FALSE FALSE diclofenac 50mg x 2
# 4 FALSE FALSE FALSE FALSE FALSE phenobarbital 100mg
However, I like the idea of using outer()
. You could combine that with stringi::stri_count_fixed
and setNames()
:
outer(
setNames(a, a),
setNames(b,b),
stringi::stri_count_fixed
)
# aspirin acetaminophen morphine ibuprofen warfarin
# aspirin 20mg 1 0 0 0 0
# ibuprofen 200mg 0 0 0 1 0
# diclofenac 50mg x 2 0 0 0 0 0
# phenobarbital 100mg 0 0 0 0 0
CodePudding user response:
To use outer
, you need a function that takes vectorized inputs along both the input and the match pattern. grepl
only accepts vectorized input, not vectorized patterns. However, stringr::str_detect
does:
outer(a, b, stringr::str_detect)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] TRUE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE TRUE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE