Home > Enterprise >  R regex - How to list every part of a string to what it matched exactly?
R regex - How to list every part of a string to what it matched exactly?

Time:08-16

I have a list of strings and a list of regex motifs that I want to match in R. If there is a match, I would like to see what each of the characters matched exactly. e.g. the string TAPQQATD and motif "P.Q.{2}D" can be matched with str_match but it only produces this as an output:

> str_match('TAPQQATD', "P.Q.{2}D")
     [,1]    
[1,] "PQQATD"

Now, I know that I can edit each motifs to contain capture groups around each character (like "(P)(.)(Q)(.{2})(D)") , but I would prefer not to, due to their number. So can I produce something like this in R (maybe with an other function) BUT with the expression of "P.Q.{2}D"?

> str_match('TAPQQATD', "(P)(.)(Q)(.{2})(D)")  
     [,1]     [,2] [,3] [,4] [,5] [,6]  
[1,] "PQQATD" "P"  "Q"  "Q"  "AT" "D"  

Thank you!

CodePudding user response:

We can use str_match_all from the stringr library:

x <- "TAPQQATD"
str_match_all(x, "(P)(.)(Q)(.{2})(D)")

[[1]]
     [,1]     [,2] [,3] [,4] [,5] [,6]
[1,] "PQQATD" "P"  "Q"  "Q"  "AT" "D" 

Or in base R, regmatches:

regmatches(x, gregexpr("(P)(.)(Q)(.{2})(D)", x))

CodePudding user response:

You can try to add the brackets using gsub.

stringr::str_match('TAPQQATD',
                   gsub("(.\\{\\d ?\\}|.)", "(\\1)", "P.Q.{2}D", perl=TRUE))
#     [,1]     [,2] [,3] [,4] [,5] [,6]
#[1,] "PQQATD" "P"  "Q"  "Q"  "AT" "D"
  • Related