Home > database >  How can I edit my regex so that it captures only the substring between (and not including) quotation
How can I edit my regex so that it captures only the substring between (and not including) quotation

Time:12-15

I'm a total novice to regex, and have a hard time wrapping my head around it. Right now I have a column filled with strings, but the only relevant text to my analysis is between quotation marks. I've tried this:

response$text <-  stri_extract_all_regex(response$text, '"\\S "')

but when I view response$text, the output comes out like this:

"\"caring\""

How do I change my regex expression so that instead the output reads:

caring

CodePudding user response:

Have a look at this cheat sheet. https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

You're currently searching for non white spaces so change from \S to say \d

CodePudding user response:

You can use

library(stringi)
response$text <- stri_extract_all_regex(response$text, '(?<=")[^\\s"] (?=")')

Or, with stringr:

library(stringr)
response$text <- str_extract_all(response$text, '(?<=")[^\\s"] (?=")')

However, with several words inside quotes, I'd rather use stringr::str_match_all:

library(stringr)
matches <- str_match_all(response$text, '"([^\\s"] )"')
response$text <- lapply(matches, function(x) x[,2])

See this regex demo.

With the capturing group approach used in "([^\\s"] )" it becomes possible to avoid overlapping matches between quoted substrings, and str_match_all becomes handy since the matches it returns contain the captured substrings as well (unlike *extract* functions).

  • Related