Tracing where the original character was-CodePudding

I am trying to create a function where I can split a sequence of letters such as that shown below.

SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC

I want to be able to split the sequence after every C and can do that using the following code:

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"

test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE )

printing provides the following after unlisting : "SC" "DKSFNRGEC" "SC" "DKSFNRGEC" "SC" "DKSFNRGEC"

However I would like to be able to trace the output C back to it's location in the original sequence, it would be useful for example if every letter had a number I could relate back to, like the initial SC, I'd be able to say that C was the first C in the total sequence, the next SC will have a C that is third in the sequence and so on.

Can anyone think of a way of being able to trace back where the split characters were in the original sequence? I'm sure there is a better way than I have suggested above.

CodePudding user response：

along these lines?

library(dplyr)

TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"

fragments <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist

data.frame(fragment = fragments) %>%
  mutate(position = cumsum(nchar(fragment)))

output:

##    fragment position
## 1        SC        2
## 2 DKSFNRGEC       11
## 3        SC       13
## 4 DKSFNRGEC       22
## 5        SC       24
## 6 DKSFNRGEC       33