I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.
For instance:
String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"
Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"
I have this vector:
x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
"HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")
and I am trying to use this code
str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")
but I cant seem to get it to work to properly give me this:
"THIS IS WHAT I WANT, DO YOU SEE IT?"
"I WANT, THIS"
"xxxx,"
Thank you guys so much for your time and help
CodePudding user response:
You were close!
str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"
# [3] "xxxx,"
The look-behind stays the same, [^,]*
matches anything but a comma, then ,
matches exactly one comma, then [^,]*
again for anything but a comma.
CodePudding user response:
Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:
- vector to tibble
- separate twice by first
to the
then by,
- paste together
- pull for vector output.
library(tidyverse)
as_tibble(x) %>%
separate(value, c("a", "b"), sep = 'to the ') %>%
separate(b, c("a", "c"), sep =",") %>%
mutate(x = paste0(a, ",", c), .keep="unused") %>%
pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"
[3] "xxxx,"