Speed up iterating over characters in long strings-CodePudding

I have a code that takes a DNA string, in which only 4 characters are found: A, C, T and G, for example "ATACAAG", and for each character if finds the 3 other possible characters. The code includes a loop for the string and another loop for a list of the possible characters. The issue is that the strings are very long: up to hundreds of thousands of characters, so it's not fast and the computer get heated (it's fans start working fast).

I'm looking for a faster way to do it. I tried list comprehension but it's still rather slow. I also tried calling the code as a function from a pandas lambda and it still takes about a minute for each string. Is that the best I can get?

For each character, the code records 3 alternatives in separate lines in a file.

The code:

bases = set(list('ACGT'))
alts = {base: list(bases.difference(base)) for base in bases}

def get_variants(data, output_path):    # pb: position base, b: base
    [open(output_path   f'/{data.symbol}_variants.txt', 'a').writelines(
        [f'{data.chromosome}\t{data.end   index}\t{data.end   index}\t{pb}/{b}\t{data.strand}\n' for b in alts[pb]])
        for index, pb in enumerate(data.sequence)]

Calling the function for "ATACAAG":

get_variants(pandas.Series({'symbol': 'XYZ', 'sequence': 'ATACAAG', 'chromosome': 12, 'start': 9067664, 'end': 9067671, 'strand': '-'}),
             'write_an_existing_output_directory_path_here')

The output is arranged in a file in the following columns:

chromosome number, start position, end position, original character/alternative character, strand (can   or -)

It yields the following lines in the file XYZ_variants.txt:

12  9067664 9067664 A/T -
12  9067664 9067664 A/G -
12  9067664 9067664 A/C -
12  9067665 9067665 T/A -
12  9067665 9067665 T/G -
12  9067665 9067665 T/C -
12  9067666 9067666 A/T -
12  9067666 9067666 A/G -
12  9067666 9067666 A/C -
12  9067667 9067667 C/T -
12  9067667 9067667 C/A -
12  9067667 9067667 C/G -
12  9067668 9067668 A/T -
12  9067668 9067668 A/G -
12  9067668 9067668 A/C -
12  9067669 9067669 A/T -
12  9067669 9067669 A/G -
12  9067669 9067669 A/C -
12  9067670 9067670 G/T -
12  9067670 9067670 G/A -
12  9067670 9067670 G/C -

Thanks.

CodePudding user response：

Here is how I would do it.

Starting from a dataframe:

  symbol sequence chromosome    start      end strand
0    XYZ  ATACAAG         12  9067664  9067671      -

I would explode the sequence, reindex to have all combinations of A/C/G/T and keep only that where the initial base is different

import numpy as np

df2 = df.assign(base=df['sequence'].apply(list)).explode('base').reset_index()
df2 = (df2.reindex(df2.index.repeat(4))
          .assign(variant=np.tile(list('ACGT'), len(df2)))
          .loc[lambda d: d['base'].ne(d['variant'])]
          .assign(var=lambda d:d['base'] '/' d['variant'])
       )

Intermediate output:

>>> df2.head()
   index symbol sequence chromosome    start      end strand base variant  var
0      0    XYZ  ATACAAG         12  9067664  9067671      -    A       C  A/C
0      0    XYZ  ATACAAG         12  9067664  9067671      -    A       G  A/G
0      0    XYZ  ATACAAG         12  9067664  9067671      -    A       T  A/T
1      0    XYZ  ATACAAG         12  9067664  9067671      -    T       A  T/A
1      0    XYZ  ATACAAG         12  9067664  9067671      -    T       C  T/C

Then export:

df2[['start', 'end', 'var', 'strand']].to_csv('variants.txt', sep='\t', index=False, header=None)

example output (first lines):

9067664 9067671 A/C -
9067664 9067671 A/G -
9067664 9067671 A/T -
9067664 9067671 T/A -
9067664 9067671 T/C -
9067664 9067671 T/G -
9067664 9067671 A/C -
9067664 9067671 A/G -
9067664 9067671 A/T -
9067664 9067671 C/A -

optimization

Now we remove everything that is not needed to keep the size minimal:

df2 = (df.drop(columns=['symbol', 'chromosome'])
         .assign(sequence=df['sequence'].apply(list))
         .explode('sequence').reset_index(drop=True)
      )
df2 = (df2.reindex(df2.index.repeat(4))
          .assign(var=np.tile(list('ACGT'), len(df2)))
          .loc[lambda d: d['sequence'].ne(d['var'])]
          .assign(var=lambda d:d['sequence'] '/' d['var'])
       )
df2[['start', 'end', 'var', 'strand']].to_csv('variants.txt', sep='\t', index=False, header=None)