Home > Mobile >  How to extract unique letters among word of consecutive letters?
How to extract unique letters among word of consecutive letters?

Time:03-23

My question might not be clear, so I'll explain my problem using simple example.

For example, there is character x = "AAATTTGGAA".

What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".

Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.

How should I get this?

I apologize if this is duplicated, but I cannot find about this problem.

CodePudding user response:

Here is a useful regex trick approach:

x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out

[1] "AAA" "TTT" "GG"  "AA"

The regex pattern used here says to split at any boundary where the preceding and following characters are different.

(?<=(.))  lookbehind and also capture preceding character in \1
(?!\\1)   then lookahead and assert that following character is different
  •  Tags:  
  • r
  • Related