Simplify and Improve for Multi-If-Statement-CodePudding

I am trying to randomly generate multiple short 5 base-pair DNA sequences. Among them, I want to pick the sequences that meet the following conditions:

If the first letter is A then the last letter cannot be T
If the first letter is T then the last letter cannot be A
If the first letter is C then the last letter cannot be G
If the first letter is G then the last letter cannot be C

The same requirements are repeated for the second and the second to the last letters.

I am currently using a very long If-Statement to make the first-last letter work, but I was wondering if there is a simple way to achieve the same result so I don't have to repeat the long statement for making the second-second-to-the-last letter work? If so, how should I change the code? Thank you.

import itertools

a = "ATCG"

for output in itertools.product(a, repeat=5):
    if((output[0] == 'A') and (output[4] != 'T')) or ((output[0] == 'T') and (output[4] != 'A')) or ((output[0] == 'C') and (output[4] != 'G')) or ((output[0] == 'G') and (output[4] != "C")):
        list = "".join(output)
        print(list)

'''

CodePudding user response：

Here is a dictionary containing the forbidden opposite:

forbidden = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C',
}

Now you can check that the character at index -1 - i is not the forbidden opposite of the one at i by doing a simple lookup. The trick is to loop only over the first half of the string:

def check(s):
    for i in range(len(s) // 2):
        if s[-1 - i] == forbidden[s[i]]:
            return False
    return True

Incidentally, this will work correctly on both even and odd string lengths.

for sequence in map(''.join, itertools.product(forbidden.keys(), repeat=5)):
    if check(sequence):
        print(sequence)

All that being as it may, it's a bit inefficient to generate a bunch of extra sequences when you only want ones matching a specific pattern. The pattern is that the first half of your string is constrained to 4 options, while the second half is to 3. You can therefore generate only matching patterns with something like this:

def generate(n=5):
    first = random.choices('ATCG', k=(n   1) // 2)
    second = random.choices('ATC', k = n // 2)
    second = ['G' if s == forbidden[f] else s for f, s in zip(first, second)]
    return ''.join(first   second[::-1])

Given that only one character is forbidden, you can generate any three characters for the second half, and replace forbidden ones with the missing. The second half then gets reversed because of how you actually want to compare the halves.

CodePudding user response：

Are you looking for something like this? You can define regular expression to filter your outputs.

To learn more about regular expression: https://docs.python.org/3/library/re.html

import itertools
import re

a = "ATCG"

case1 = ["(^[A].{3}[^T]$)",
         "(^[T].{3}[^A]$)",
         "(^[C].{3}[^G]$)",
         "(^[G].{3}[^C]$)"]

case2 = ["(^.[A].[^T].$)",
         "(^.[T].[^A].$)",
         "(^.[C].[^G].$)",
         "(^.[G].[^C].$)"]

case1_filter = '|'.join(case1)
case2_filter = '|'.join(case2)

for output in itertools.product(a, repeat=5):
    sequence = ''.join(output)
    if re.match(case1_filter, sequence) and re.match(case2_filter, sequence):
        print(''.join(output))

CodePudding user response：

I'd use sets:

disallowed = [{'A', 'T'},
              {'C', 'G'}]

for output in itertools.product(a, repeat=5):
    first_last = {output[0], output[4]}
    second_fourth = {output[1], output[3]}
    pairs = (first_last, second_fourth)
    if all(pair not in disallowed for pair in pairs):
        list = "".join(output)
        print(list)