Home > Software design >  Repeating patterns without any specific subpatterns in them
Repeating patterns without any specific subpatterns in them

Time:12-06

I'm working on DNA sequences, and I would like to find sequences that will code for a protein. Such sequences begin with "TAC" and end with "ATT", "ATC" or "ACT". I also would like to have at least 10 triplets between the first triplet and the last one. But these 10 triplets must not be "TAC", "ATT", "ATC" or "ACT"...

I created this regex : ".*(TAC(...){10,}(ATT|ATC|ACT)).*" But it's obviously not enough. For example "TACTTCATCGATAGGAGAGGGCCCATTTAACCCATC" matches and I don't want to. It matches because there are 10 triplets between "TAC" and the second "ATC". But I don't want this extra "ATC" in-between.

CodePudding user response:

Try:

TAC(?:(?!TAC|ATT|ATC|ACT)...){10,}(?:ATT|ATC|ACT)

Regex demo.


TAC - match TAC

(?:(?!TAC|ATT|ATC|ACT)...){10,} - match 3 characters 10 times. These 3 characters cannot be any of TAC or ATT or ATC or ACT

(?:ATT|ATC|ACT) - match ATT or ATC or ACT at the end.

CodePudding user response:

Try: TAC((?!TAC|ATT|ATC|ACT).){10}(ATT|ATC|ACT) This regex matches sequences that begin with "TAC" and have at least 10 triplets between the first and last triplets that are not "TAC", "ATT", "ATC", or "ACT". The regex uses a negative lookahead assertion ((?!...)) to ensure that the 10 triplets between the first and last triplets do not match any of the specified triplets.

  • Related