I am trying to find a regular expression that will allow me to know if there is a dinucleotide(Two letters) that appears 2 times in a row in my sequence. I give you an example:
Let's suppose I have this sequence (The character ;
is to make clear that I am talking about dinucleotides):
"AT;GC;TA;CC;AG;AG;CC;CA;TA;TA"
The result I expect is that it matches the pattern AGAG
and TATA
.
I have tried this already but it fails because it gives me any pair of dinucleotides, not the same pair :
([ATGC]{2}){2}
CodePudding user response:
You will need to use backreferences.
Start with matching one pair:
[ATGC]{2}
will match any pair of two of the four letters.
You need to put that in capturing parentheses and refer to the contents of the parentheses with \1
, like so:
([ATGC]{2});\1
CodePudding user response:
Suppose the string were
"TA;TA;GC;TA;CC;AG;AG;CC;CA;TA;TA"
^^ ^^ ^^ ^^ ^^ ^^
If you wish to match "TA"
twice (and "AG"
once) you could apply @Andy's solution.
If you wish to match "TA"
just once, no matter the number of instances of "TA;TA"
in the string, you could match
([ATGC]{2});\1(?!.*\1;\1)
and retrieve the contents of capture group 1.
The expression can be broken down as follows.
([ATGC]{2}) # match two characters, each from the character class,
# and save to capture group 1
;\1 # match ';' followed by the content of capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1;\1 # match the content of capture group 1 followed by ';'
# followed by the content of capture group 1
) # end negative lookahead