I'm a total novice to regex, and have a hard time wrapping my head around it. Right now I have a column filled with strings, but the only relevant text to my analysis is between quotation marks. I've tried this:
response$text <- stri_extract_all_regex(response$text, '"\\S "')
but when I view response$text, the output comes out like this:
"\"caring\""
How do I change my regex expression so that instead the output reads:
caring
CodePudding user response:
Have a look at this cheat sheet. https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf
You're currently searching for non white spaces so change from \S
to say \d
CodePudding user response:
You can use
library(stringi)
response$text <- stri_extract_all_regex(response$text, '(?<=")[^\\s"] (?=")')
Or, with stringr
:
library(stringr)
response$text <- str_extract_all(response$text, '(?<=")[^\\s"] (?=")')
However, with several words inside quotes, I'd rather use stringr::str_match_all
:
library(stringr)
matches <- str_match_all(response$text, '"([^\\s"] )"')
response$text <- lapply(matches, function(x) x[,2])
See this regex demo.
With the capturing group approach used in "([^\\s"] )"
it becomes possible to avoid overlapping matches between quoted substrings, and str_match_all
becomes handy since the matches it returns contain the captured substrings as well (unlike *extract*
functions).