My question might not be clear, so I'll explain my problem using simple example.
For example, there is character x = "AAATTTGGAA"
.
What I want to achieve is, from x
, split x
by consecutive letters, "AAA", "TTT", "GG", "AA"
.
Then, unique letters of each chunk is "A", "T", "G", "A"
, so expected output is ATGA
.
How should I get this?
I apologize if this is duplicated, but I cannot find about this problem.
CodePudding user response:
Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different