Home > Software engineering >  Splitting sequence of letters, whilst retaining original sequence position
Splitting sequence of letters, whilst retaining original sequence position

Time:05-05

I need to split the following sequence of letters into distinct chunks

SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC

I have used the following code provided from a previous user to achieve what I initially wanted, which was to split the sequence after every C.

library(dplyr)

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"

Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist 

df <- data.frame(Fragment = Test) %>%
  mutate("position" = cumsum(nchar(Test)))

This allowed me to split the sequence after every C and retain it's position in the sequence, for example C at position 2, 11 etc.

Now I need to split the same sequence at different locations, which I can do using the following to split after P,A,G or S:

Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist

This is fine if I want it to split after a given character, but if I try to split it before a character for example D, I cannot seem to retain the D in the fragment. I can only have it retained if it is split after the D.

I have tried every combination of look behind or look ahead I can think of, the following cuts before and after every D which isn't that useful.

Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist

Also is there a way to retain the exact position of every C in the original sequence?

So if I were to split the test sequence after the initial K, I'd have a fragment that was SCDK, could I have a separate column that tells me where the C was in the original sequence. Just as a second example, the next fragment would be SFNRGECSCDK and in that separate column it would say the C was originally in position 11.

CodePudding user response:

Zero-length matches that result from the use of lookahead only patterns used in strsplit are not handled properly.

In this case, you need to "anchor" the matches on the left, too. Either use a non-word boundary, or a lookbehind that disallows the match at the start of string:

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"  
 
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC"          "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"  

See the online R demo.

The \B(?=D) pattern matches a location that is immediately preceded with a word char and is immediately followed with D.

The (?<!^)(?=D) pattern matches a location that is not immediately preceded with a start of string location (i.e. not at the start of string) and is immediately followed with D.

Also, note that [P,A,G,S] matches P, A, G, S and a comma. You should use [PAGS] to match one of the letters.

  • Related