I have a DNA sequence which is variable only at specific locations and need to find all possible scenarios:
DNA_seq='ANGK' #N can be T or C and K can be A or G
N=['T','C']
K=['A','G']
Results:
['ATGA','ATGG','ACGA','ACGG']
The offered solution by @vladimir works perfectly for simple cases like the example above but for complicated scenarios as below runs quickly out of memory. For the example below, even running with 120G of memory ended with out-of-memory error. This is surprising because the total number of combinations would be around 500K of 33bp strings which I assume should not consume more than 100G of RAM. Are my assumptions wrong? Any suggestions?
N=['A','T','C','G']
K=['G','T']
dev_seq=[f'{N1}{N2}{K1}{N3}{N4}{K2}{N5}{N6}{K3}TCC{N7}{N8}{K4}CTG{N9}{N10}{K5}CTG{N11}{N12}{K6}{N13}{N14}{K7}{N15}{N16}{K8}' for \
N1,N2,K1,N3,N4,K2,N5,N6,K3,N7,N8,K4,N9,N10,K5,N11,N12,K6,N13,N14,K7,N15,N16,K8 in \
product(N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K)]
CodePudding user response:
Use itertools.product
:
from itertools import product
result = [f'A{n}G{k}' for n, k in product(N, K)]
Result:
['ATGA', 'ATGG', 'ACGA', 'ACGG']
EDIT
If you don't want to store the whole list in memory at one time, and would rather process the strings sequentially as they come, you can use a generator:
g = (f'A{n}G{k}' for n, k in product(N, K))