I have a code that takes a DNA string, in which only 4 characters are found: A, C, T and G, for example "ATACAAG", and for each character if finds the 3 other possible characters. The code includes a loop for the string and another loop for a list of the possible characters. The issue is that the strings are very long: up to hundreds of thousands of characters, so it's not fast and the computer get heated (it's fans start working fast).
I'm looking for a faster way to do it. I tried list comprehension but it's still rather slow. I also tried calling the code as a function from a pandas lambda and it still takes about a minute for each string. Is that the best I can get?
For each character, the code records 3 alternatives in separate lines in a file.
The code:
bases = set(list('ACGT'))
alts = {base: list(bases.difference(base)) for base in bases}
def get_variants(data, output_path): # pb: position base, b: base
[open(output_path f'/{data.symbol}_variants.txt', 'a').writelines(
[f'{data.chromosome}\t{data.end index}\t{data.end index}\t{pb}/{b}\t{data.strand}\n' for b in alts[pb]])
for index, pb in enumerate(data.sequence)]
Calling the function for "ATACAAG":
get_variants(pandas.Series({'symbol': 'XYZ', 'sequence': 'ATACAAG', 'chromosome': 12, 'start': 9067664, 'end': 9067671, 'strand': '-'}),
'write_an_existing_output_directory_path_here')
The output is arranged in a file in the following columns:
chromosome number, start position, end position, original character/alternative character, strand (can or -)
It yields the following lines in the file XYZ_variants.txt:
12 9067664 9067664 A/T -
12 9067664 9067664 A/G -
12 9067664 9067664 A/C -
12 9067665 9067665 T/A -
12 9067665 9067665 T/G -
12 9067665 9067665 T/C -
12 9067666 9067666 A/T -
12 9067666 9067666 A/G -
12 9067666 9067666 A/C -
12 9067667 9067667 C/T -
12 9067667 9067667 C/A -
12 9067667 9067667 C/G -
12 9067668 9067668 A/T -
12 9067668 9067668 A/G -
12 9067668 9067668 A/C -
12 9067669 9067669 A/T -
12 9067669 9067669 A/G -
12 9067669 9067669 A/C -
12 9067670 9067670 G/T -
12 9067670 9067670 G/A -
12 9067670 9067670 G/C -
Thanks.
CodePudding user response:
Here is how I would do it.
Starting from a dataframe:
symbol sequence chromosome start end strand
0 XYZ ATACAAG 12 9067664 9067671 -
I would explode
the sequence, reindex
to have all combinations of A/C/G/T and keep only that where the initial base is different
import numpy as np
df2 = df.assign(base=df['sequence'].apply(list)).explode('base').reset_index()
df2 = (df2.reindex(df2.index.repeat(4))
.assign(variant=np.tile(list('ACGT'), len(df2)))
.loc[lambda d: d['base'].ne(d['variant'])]
.assign(var=lambda d:d['base'] '/' d['variant'])
)
Intermediate output:
>>> df2.head()
index symbol sequence chromosome start end strand base variant var
0 0 XYZ ATACAAG 12 9067664 9067671 - A C A/C
0 0 XYZ ATACAAG 12 9067664 9067671 - A G A/G
0 0 XYZ ATACAAG 12 9067664 9067671 - A T A/T
1 0 XYZ ATACAAG 12 9067664 9067671 - T A T/A
1 0 XYZ ATACAAG 12 9067664 9067671 - T C T/C
Then export:
df2[['start', 'end', 'var', 'strand']].to_csv('variants.txt', sep='\t', index=False, header=None)
example output (first lines):
9067664 9067671 A/C -
9067664 9067671 A/G -
9067664 9067671 A/T -
9067664 9067671 T/A -
9067664 9067671 T/C -
9067664 9067671 T/G -
9067664 9067671 A/C -
9067664 9067671 A/G -
9067664 9067671 A/T -
9067664 9067671 C/A -
optimization
Now we remove everything that is not needed to keep the size minimal:
df2 = (df.drop(columns=['symbol', 'chromosome'])
.assign(sequence=df['sequence'].apply(list))
.explode('sequence').reset_index(drop=True)
)
df2 = (df2.reindex(df2.index.repeat(4))
.assign(var=np.tile(list('ACGT'), len(df2)))
.loc[lambda d: d['sequence'].ne(d['var'])]
.assign(var=lambda d:d['sequence'] '/' d['var'])
)
df2[['start', 'end', 'var', 'strand']].to_csv('variants.txt', sep='\t', index=False, header=None)