I tried to apply the below rules:
Chop the string by ;
to reach maximum length n
.
For example,
n <- 4
string <- c("a;a;aabbbb;ccddee;ff")
output <- c("a;a;", "aabb", "bb;", "ccdd", "ee;", "ff")
For "aabb"
, since the chop length "aabbbb"
exceed n
= 4, thus we chop by length, 4
.
For "bb;"
, since the chop length "bb;"
< 4, we next consider "bb;ccddee"
. However, the length of next chop exceed 4, and we already have ;
exist in the string. Thus, we chop by ;
.
Currently, I can achieved or
by using the Regex
.
num <- 4
splitvar <- ";"
## splits pattern
pattern <- paste0("(?<=.{", num, "}|", splitvar, ")")
> pattern
[1] "(?<=.{4}|;)"
string <- c("a;a;aabbbb;ccddee;ff")
strsplit(string, pattern, perl = TRUE)
[[1]]
[1] "a;" "a;" "aabb" "bb;" "ccdd" "ee;" "ff"
As you can see, we don't actually need to chop "a;"
and "a;"
, since the length doesn't exceed the n
(2 2 = 4).
Do anyone have solution on this? Thank you.
CodePudding user response:
Your regex matches either a splitvar
or a location that is preceded by at least any num
chars.
The pattern you seek is a regex matching either any one, two or three chars and then a splitvar
or any num
chars other than splitvar
char.
So, you can use
num <- 4
splitvar <- ";"
pattern <- paste0(".{1,", num-1, "}(?:",splitvar,"|$)|[^",splitvar,"]{",num,"}")
pattern ## => .{1,3}(?:;|$)|[^;]{4}
string <- c("a;a;aabbbb;ccddee;ff")
unlist(regmatches(string, gregexpr(pattern, string)))
## => "a;a;" "aabb" "bb;" "ccdd" "ee;" "ff"
With stringr
:
library(stringr)
unlist(str_extract_all(string, pattern))
See the R demo online. See the regex demo.
Details:
.{1,3}(?:;|$)
- one, two, or three chars (other than line break chars if you usestringr
) as many as possible, and then a;
char or end of string|
- or[^;]{4}
- any four chars other than a;
char.