For a pattern that starts with "pr" following with multiple "r", e.g., pr, prr, pr...r.
I would like to split the non-pattern string and ALL pattern strings, without deleting the pattern. strsplit()
does the job but deletes all pr..r
. However, stringr::str_extract_all
extracts patterned strings but non-pattern strings gone.
Is there a way to simply keep all strings but single out patterned strings?
x<-c("zprzzzprrrrrzpzr")
"z" "pr" "zzz" "prrrrr" "zpzr" # desired output; keep original charater order
CodePudding user response:
This is a bit hacky but you can do one replacement to separate out the values you want with some separator character and then split on that separator character. For example
unlist(strsplit(gsub("(pr )","~\\1~", x), "~"))
# [1] "z" "pr" "zzz" "prrrrr" "zpzr"
which will work fine if you don't have "~" in your string.
CodePudding user response:
Here is a way using stringr
. I would hope there is a way to make this a bit more concise.
- Locate the pattern with
str_locate_all()
- Add one to all the end positions, so that we have split locations
- Add the start and end positions to the vector to split correctly
- Use the vectorized
str_sub()
to extract them all
library(stringr)
x <- c("zprzzzprrrrrzpzr")
locs <- str_locate_all(x, "(pr )")[[1]]
locs[,2] <- locs[,2] 1
locs_all <- sort(c(1, locs, nchar(x) 1))
str_sub(x, head(locs_all, -1), tail(locs_all, -1))
# [1] "zp" "prz" "zzzp" "prrrrrz" "zpzr"