Home > Back-end >  Extract string between exact word and pattern using stringr
Extract string between exact word and pattern using stringr

Time:02-10

I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.

For instance:

String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"

Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"

I have this vector:

x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
     "HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")

and I am trying to use this code

str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")

but I cant seem to get it to work to properly give me this:

"THIS IS WHAT I WANT, DO YOU SEE IT?" 
"I WANT, THIS"           
"xxxx," 

Thank you guys so much for your time and help

CodePudding user response:

You were close!

str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"                       
# [3] "xxxx,"      

The look-behind stays the same, [^,]* matches anything but a comma, then , matches exactly one comma, then [^,]* again for anything but a comma.

CodePudding user response:

Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:

  1. vector to tibble
  2. separate twice by first to the then by ,
  3. paste together
  4. pull for vector output.
library(tidyverse)

as_tibble(x) %>% 
  separate(value, c("a", "b"), sep = 'to the ') %>% 
  separate(b, c("a", "c"), sep =",") %>% 
  mutate(x = paste0(a, ",", c), .keep="unused") %>% 
  pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"                       
[3] "xxxx,"
  • Related