I have some amino acid modifications, something like:

example <- c('_(Acetyl (Protein N-term))DDDIAAM(Oxidation (M))CK_')

I would like to split such a sequence into a state similar to the following：

example2 <- c('_','(Acetyl (Protein N-term))','D','D','D','I','A','A','M','(Oxidation (M))','C','K','_')

But I don't know how to split such a string while keeping the content inside the brackets, is there any function or code that can help me do this?

Thanks, LeeLee

CodePudding user response：

Update

Borrowing ideas from @benson23 by inserting a special character, e.g., @, we can try the the code below using strsplit nested (g)sub's

unlist(
  lapply(
    unlist(
      strsplit(
        sub(
          "(.*)\\)", "\\1)@",
          sub(
            "\\(", "@(",
            gsub("(\\))([^()] )(\\()", "\\1@\\2@\\3", example)
          )
        ), "@"
      )
    ),
    function(s) {
      if (startsWith(s, "(")) {
        s
      } else {
        strsplit(s, "")
      }
    }
  )
)

Here is a bulky implementation to find the paired brackets and do the split

# split string by characters
v <- unlist(strsplit(example, ""))

# positions of "(" and ")"
a <- which(v == "(")
b <- which(v == ")")

# split as per the position of ")"
lst1 <- split(v, cumsum(replace(rep(0, length(v)), 1   by(b, findInterval(b, a), max), 1)))

# split as per the position of "("
lst2 <- unlist(lapply(lst1, function(x) split(x, cumsum(x == "(") > 0)), recursive = FALSE)

# output
res <- unlist(
  lapply(
    lst2,
    function(s) {
      if (s[1] == "(") {
        paste0(s, collapse = "")
      } else {
        s
      }
    }
  ),
  use.names = FALSE
)

Test

Let's try a little tricky exmaple example <- c("_(Acetyl (Protein (N-term)) XXX) DDDIAAM(Oxidation (M))CK_"), and we will see res as

 [1] "_"                               "(Acetyl (Protein (N-term)) XXX)"
 [3] " "                               "D"
 [5] "D"                               "D"
 [7] "I"                               "A"
 [9] "A"                               "M"
[11] "(Oxidation (M))"                 "C"
[13] "K"

CodePudding user response：

First insert a special character (here I choose "@") before and after brackets that should be kept together. Then strsplit on the special character. This will get an intermediate example_tmp vector.

example_tmp <- gsub("(?<=\\w)(?=\\()", "@", example, perl = T) %>% 
  gsub("(?<=\\))(?=\\w)", "@", ., perl = T) %>% 
  strsplit("@") %>% 
  unlist()

example_tmp
[1] "_"                         "(Acetyl (Protein N-term))"
[3] "DDDIAAM"                   "(Oxidation (M))"          
[5] "CK_"

Then use sapply to loop through the vector, and strsplit on strings that do not contain any brackets.

example2 <- unname(unlist(sapply(example_tmp, \(x) if (!grepl("\\(", x)) strsplit(x, "") else x)))

example2
 [1] "_"                         "(Acetyl (Protein N-term))"
 [3] "D"                         "D"                        
 [5] "D"                         "I"                        
 [7] "A"                         "A"                        
 [9] "M"                         "(Oxidation (M))"          
[11] "C"                         "K"                        
[13] "_"

CodePudding user response：

Here's a solution with tidyverse:

library(tidyverse)
data.frame(example) %>%
  mutate(
         # extract the strings with multiple uppercase letters:
         XX = paste0(unlist(str_extract_all(example, "[A-Z]{2,}")), collapse = "|"),
         # remove these strings from `example`:
         example = str_remove_all(example, XX),
         # split the multiple uppercase letter strings into single letters:
         XX = paste0(unlist(str_split(sub("\\|", "", XX), "(?<!^)(?!$)")), collapse = ","),
         # split `example` as appropriate:
         example = paste0(unlist(str_split(example, "(?<=\\)\\)|_)")), collapse = ","),
         # put everything together:
         res = paste0(example, XX, collapse = ",")
         ) %>%
  # remove obsolete columns:
  select(-c(example, XX))
                                                              res
1 _,(Acetyl (Protein N-term)),(Oxidation (M)),_,D,D,D,I,A,A,M,C,K