Home > Enterprise >  split strings by pattern without deleting pattern strings
split strings by pattern without deleting pattern strings

Time:07-13

For a pattern that starts with "pr" following with multiple "r", e.g., pr, prr, pr...r. I would like to split the non-pattern string and ALL pattern strings, without deleting the pattern. strsplit() does the job but deletes all pr..r. However, stringr::str_extract_all extracts patterned strings but non-pattern strings gone.

Is there a way to simply keep all strings but single out patterned strings?

x<-c("zprzzzprrrrrzpzr")

"z" "pr" "zzz" "prrrrr" "zpzr" # desired output; keep original charater order

CodePudding user response:

This is a bit hacky but you can do one replacement to separate out the values you want with some separator character and then split on that separator character. For example

unlist(strsplit(gsub("(pr )","~\\1~", x), "~"))
# [1] "z"      "pr"     "zzz"    "prrrrr" "zpzr" 

which will work fine if you don't have "~" in your string.

CodePudding user response:

Here is a way using stringr. I would hope there is a way to make this a bit more concise.

  • Locate the pattern with str_locate_all()
  • Add one to all the end positions, so that we have split locations
  • Add the start and end positions to the vector to split correctly
  • Use the vectorized str_sub() to extract them all
library(stringr)

x <- c("zprzzzprrrrrzpzr")

locs <- str_locate_all(x, "(pr )")[[1]]
locs[,2] <- locs[,2]   1

locs_all <- sort(c(1, locs, nchar(x)   1))

str_sub(x, head(locs_all, -1), tail(locs_all, -1))
# [1] "zp"      "prz"     "zzzp"    "prrrrrz" "zpzr"   
  • Related