R splitting string on predefined location-CodePudding

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.

My idea was to find colons with

stringr::str_locate_all(test, ":") %>% 
  unlist()

then find commas

stringr::str_locate_all(test, ",") %>% 
  unlist()

and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.

Here is example string:

dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"

Here is what result should be

dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037", 
"ATs:0.0024", "ATo:0.5678")

CodePudding user response：

Perehaps we can use regmatches like below

> regmatches(test, gregexpr("(\\w ,?) :[0-9.] ", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037"        "ATs:0.0024"
[5] "ATo:0.5678"

CodePudding user response：

here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d ) with a new delimiter using gsub and then split with strsplit in base R

result1 <- strsplit(gsub("([0-9]\\.[0-9] ),", "\\1;", test), ";")[[1]]

-checking

> identical(result, result1)
[1] TRUE

If the number of characters are fixed, use a regex lookaround

result1 <-  strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]