Home > Enterprise >  Apply strsplit by conditional
Apply strsplit by conditional

Time:05-23

I tried to apply the below rules:

Chop the string by ; to reach maximum length n.

For example,

n <- 4
string <- c("a;a;aabbbb;ccddee;ff")
output <- c("a;a;", "aabb", "bb;", "ccdd", "ee;", "ff")

For "aabb", since the chop length "aabbbb" exceed n = 4, thus we chop by length, 4.

For "bb;", since the chop length "bb;" < 4, we next consider "bb;ccddee". However, the length of next chop exceed 4, and we already have ; exist in the string. Thus, we chop by ;.

Currently, I can achieved or by using the Regex.

num <- 4
splitvar <- ";"

## splits pattern
pattern <- paste0("(?<=.{", num, "}|", splitvar, ")")

> pattern
[1] "(?<=.{4}|;)"

string <- c("a;a;aabbbb;ccddee;ff")
strsplit(string, pattern, perl = TRUE)
[[1]]
[1] "a;"   "a;"   "aabb" "bb;"  "ccdd" "ee;"  "ff"  

As you can see, we don't actually need to chop "a;" and "a;", since the length doesn't exceed the n (2 2 = 4).

Do anyone have solution on this? Thank you.

CodePudding user response:

Your regex matches either a splitvar or a location that is preceded by at least any num chars.

The pattern you seek is a regex matching either any one, two or three chars and then a splitvar or any num chars other than splitvar char.

So, you can use

num <- 4
splitvar <- ";"
pattern <- paste0(".{1,", num-1, "}(?:",splitvar,"|$)|[^",splitvar,"]{",num,"}")
pattern ## => .{1,3}(?:;|$)|[^;]{4}
string <- c("a;a;aabbbb;ccddee;ff")
unlist(regmatches(string, gregexpr(pattern, string)))
## => "a;a;" "aabb" "bb;"  "ccdd" "ee;"  "ff" 

With stringr:

library(stringr)
unlist(str_extract_all(string, pattern))

See the R demo online. See the regex demo.

Details:

  • .{1,3}(?:;|$) - one, two, or three chars (other than line break chars if you use stringr) as many as possible, and then a ; char or end of string
  • | - or
  • [^;]{4} - any four chars other than a ; char.
  • Related