I have this character vector:
protein = "ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
I want to fragment it according to the occurrence of the letter R.
peptide_fragments <- str_split(protein, "(?<=[R])")
Now from the resulting fragments, I want to omit the substrings that:
- don't contain the letter K
Then from the remaining substrings to omit:
- those whose character length is less than 6.
CodePudding user response:
Using a pure base R regex approach we can try:
protein <- "ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
parts <- strsplit(protein, "(?<=R)", perl=TRUE)[[1]]
output <- grep("^(?=.*K).{6,}$", parts, value=TRUE, perl=TRUE)
output
[1] "TKQTAR" "KSTGGKAPR"
[3] "KQLATKAAR" "KSAPATGGVKKPHR"
[5] "YQKSTELLIR" "KLPFQR"
[7] "EIAQDFKTDLR" "FQSSAVMALQEACEAYLVGLFEDTNLCAIHAKR"
[9] "VTIMPKDIQLAR"
CodePudding user response:
If you want to split after "R":
temp <- unlist(str_split(protein, "(?<=R)"))
res <- temp[grepl("K", temp) & !nchar(temp) < 6]
Result:
res
[1] "TKQTAR" "KSTGGKAPR"
[3] "KQLATKAAR" "KSAPATGGVKKPHR"
[5] "YQKSTELLIR" "KLPFQR"
[7] "EIAQDFKTDLR" "FQSSAVMALQEACEAYLVGLFEDTNLCAIHAKR"
[9] "VTIMPKDIQLAR"