I am trying to create a function where I can split a sequence of letters such as that shown below.
SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC
I want to be able to split the sequence after every C and can do that using the following code:
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE )
printing provides the following after unlisting :
"SC" "DKSFNRGEC" "SC" "DKSFNRGEC" "SC" "DKSFNRGEC"
However I would like to be able to trace the output C back to it's location in the original sequence, it would be useful for example if every letter had a number I could relate back to, like the initial SC, I'd be able to say that C was the first C in the total sequence, the next SC will have a C that is third in the sequence and so on.
Can anyone think of a way of being able to trace back where the split characters were in the original sequence? I'm sure there is a better way than I have suggested above.
CodePudding user response:
along these lines?
library(dplyr)
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
fragments <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist
data.frame(fragment = fragments) %>%
mutate(position = cumsum(nchar(fragment)))
output:
## fragment position
## 1 SC 2
## 2 DKSFNRGEC 11
## 3 SC 13
## 4 DKSFNRGEC 22
## 5 SC 24
## 6 DKSFNRGEC 33