I'm working on DNA sequences, and I would like to find sequences that will code for a protein. Such sequences begin with "TAC" and end with "ATT", "ATC" or "ACT". I also would like to have at least 10 triplets between the first triplet and the last one. But these 10 triplets must not be "TAC", "ATT", "ATC" or "ACT"...
I created this regex : ".*(TAC(...){10,}(ATT|ATC|ACT)).*"
But it's obviously not enough.
For example "TACTTCATCGATAGGAGAGGGCCCATTTAACCCATC" matches and I don't want to. It matches because there are 10 triplets between "TAC" and the second "ATC". But I don't want this extra "ATC" in-between.
CodePudding user response:
Try:
TAC(?:(?!TAC|ATT|ATC|ACT)...){10,}(?:ATT|ATC|ACT)
TAC
- match TAC
(?:(?!TAC|ATT|ATC|ACT)...){10,}
- match 3 characters 10 times. These 3 characters cannot be any of TAC
or ATT
or ATC
or ACT
(?:ATT|ATC|ACT)
- match ATT
or ATC
or ACT
at the end.
CodePudding user response:
Try:
TAC((?!TAC|ATT|ATC|ACT).){10}(ATT|ATC|ACT)
This regex matches sequences that begin with "TAC" and have at least 10 triplets between the first and last triplets that are not "TAC", "ATT", "ATC", or "ACT". The regex uses a negative lookahead assertion ((?!...)) to ensure that the 10 triplets between the first and last triplets do not match any of the specified triplets.