I have some amino acid modifications, something like:
example <- c('_(Acetyl (Protein N-term))DDDIAAM(Oxidation (M))CK_')
I would like to split such a sequence into a state similar to the following:
example2 <- c('_','(Acetyl (Protein N-term))','D','D','D','I','A','A','M','(Oxidation (M))','C','K','_')
But I don't know how to split such a string while keeping the content inside the brackets, is there any function or code that can help me do this?
Thanks, LeeLee
CodePudding user response:
Update
Borrowing ideas from @benson23 by inserting a special character, e.g., @
, we can try the the code below using strsplit
nested (g)sub
's
unlist(
lapply(
unlist(
strsplit(
sub(
"(.*)\\)", "\\1)@",
sub(
"\\(", "@(",
gsub("(\\))([^()] )(\\()", "\\1@\\2@\\3", example)
)
), "@"
)
),
function(s) {
if (startsWith(s, "(")) {
s
} else {
strsplit(s, "")
}
}
)
)
Here is a bulky implementation to find the paired brackets and do the split
# split string by characters
v <- unlist(strsplit(example, ""))
# positions of "(" and ")"
a <- which(v == "(")
b <- which(v == ")")
# split as per the position of ")"
lst1 <- split(v, cumsum(replace(rep(0, length(v)), 1 by(b, findInterval(b, a), max), 1)))
# split as per the position of "("
lst2 <- unlist(lapply(lst1, function(x) split(x, cumsum(x == "(") > 0)), recursive = FALSE)
# output
res <- unlist(
lapply(
lst2,
function(s) {
if (s[1] == "(") {
paste0(s, collapse = "")
} else {
s
}
}
),
use.names = FALSE
)
Test
Let's try a little tricky exmaple example <- c("_(Acetyl (Protein (N-term)) XXX) DDDIAAM(Oxidation (M))CK_")
, and we will see res
as
[1] "_" "(Acetyl (Protein (N-term)) XXX)"
[3] " " "D"
[5] "D" "D"
[7] "I" "A"
[9] "A" "M"
[11] "(Oxidation (M))" "C"
[13] "K"
CodePudding user response:
First insert a special character (here I choose "@") before and after brackets that should be kept together. Then strsplit
on the special character. This will get an intermediate example_tmp
vector.
example_tmp <- gsub("(?<=\\w)(?=\\()", "@", example, perl = T) %>%
gsub("(?<=\\))(?=\\w)", "@", ., perl = T) %>%
strsplit("@") %>%
unlist()
example_tmp
[1] "_" "(Acetyl (Protein N-term))"
[3] "DDDIAAM" "(Oxidation (M))"
[5] "CK_"
Then use sapply
to loop through the vector, and strsplit
on strings that do not contain any brackets.
example2 <- unname(unlist(sapply(example_tmp, \(x) if (!grepl("\\(", x)) strsplit(x, "") else x)))
example2
[1] "_" "(Acetyl (Protein N-term))"
[3] "D" "D"
[5] "D" "I"
[7] "A" "A"
[9] "M" "(Oxidation (M))"
[11] "C" "K"
[13] "_"
CodePudding user response:
Here's a solution with tidyverse
:
library(tidyverse)
data.frame(example) %>%
mutate(
# extract the strings with multiple uppercase letters:
XX = paste0(unlist(str_extract_all(example, "[A-Z]{2,}")), collapse = "|"),
# remove these strings from `example`:
example = str_remove_all(example, XX),
# split the multiple uppercase letter strings into single letters:
XX = paste0(unlist(str_split(sub("\\|", "", XX), "(?<!^)(?!$)")), collapse = ","),
# split `example` as appropriate:
example = paste0(unlist(str_split(example, "(?<=\\)\\)|_)")), collapse = ","),
# put everything together:
res = paste0(example, XX, collapse = ",")
) %>%
# remove obsolete columns:
select(-c(example, XX))
res
1 _,(Acetyl (Protein N-term)),(Oxidation (M)),_,D,D,D,I,A,A,M,C,K