Home > front end >  Regular Expression Nucleotide Search
Regular Expression Nucleotide Search

Time:01-25

I am trying to find a regular expression that will allow me to know if there is a dinucleotide(Two letters) that appears 2 times in a row in my sequence. I give you an example:

Let's suppose I have this sequence (The character ; is to make clear that I am talking about dinucleotides):

"AT;GC;TA;CC;AG;AG;CC;CA;TA;TA"

The result I expect is that it matches the pattern AGAG and TATA.

I have tried this already but it fails because it gives me any pair of dinucleotides, not the same pair :

([ATGC]{2}){2}

CodePudding user response:

You will need to use backreferences.

Start with matching one pair:

[ATGC]{2}

will match any pair of two of the four letters.

You need to put that in capturing parentheses and refer to the contents of the parentheses with \1, like so:

([ATGC]{2});\1

CodePudding user response:

Suppose the string were

"TA;TA;GC;TA;CC;AG;AG;CC;CA;TA;TA"
 ^^ ^^          ^^ ^^       ^^ ^^

If you wish to match "TA" twice (and "AG" once) you could apply @Andy's solution.

If you wish to match "TA" just once, no matter the number of instances of "TA;TA" in the string, you could match

([ATGC]{2});\1(?!.*\1;\1)

and retrieve the contents of capture group 1.

Demo

The expression can be broken down as follows.

([ATGC]{2}) # match two characters, each from the character class,
            # and save to capture group 1
;\1         # match ';' followed by the content of capture group 1 
(?!         # begin a negative lookahead
  .*        # match zero or more characters
  \1;\1     # match the content of capture group 1 followed by ';'
            # followed by the content of capture group 1
)           # end negative lookahead
  •  Tags:  
  • Related