I have a list of strings and a list of regex motifs that I want to match in R. If there is a match, I would like to see what each of the characters matched exactly.
e.g. the string TAPQQATD
and motif "P.Q.{2}D"
can be matched with str_match but it only produces this as an output:
> str_match('TAPQQATD', "P.Q.{2}D")
[,1]
[1,] "PQQATD"
Now, I know that I can edit each motifs to contain capture groups around each character (like "(P)(.)(Q)(.{2})(D)"
) , but I would prefer not to, due to their number. So can I produce something like this in R (maybe with an other function) BUT with the expression of "P.Q.{2}D"
?
> str_match('TAPQQATD', "(P)(.)(Q)(.{2})(D)")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "PQQATD" "P" "Q" "Q" "AT" "D"
Thank you!
CodePudding user response:
We can use str_match_all
from the stringr
library:
x <- "TAPQQATD"
str_match_all(x, "(P)(.)(Q)(.{2})(D)")
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "PQQATD" "P" "Q" "Q" "AT" "D"
Or in base R, regmatches
:
regmatches(x, gregexpr("(P)(.)(Q)(.{2})(D)", x))
CodePudding user response:
You can try to add the brackets using gsub
.
stringr::str_match('TAPQQATD',
gsub("(.\\{\\d ?\\}|.)", "(\\1)", "P.Q.{2}D", perl=TRUE))
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] "PQQATD" "P" "Q" "Q" "AT" "D"