Home > Mobile >  Efficient way to transform a dictionary into a dataframe in pandas
Efficient way to transform a dictionary into a dataframe in pandas

Time:07-16

I have a dictionary such as :

  mydict=  {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}

I wondered if someone knew an efficient way to process this dictionary and create a dataframe from it by adding three columns:

  • Scaffolds column which is the keys of the dictionary
  • The Seq_length which is the length of the Seq string
  • The GC% which is the number of G and C letters within Seq divided by the Seq_length (for example len(Seq) of scaffold1 is 42, and there are 18 G and C letters (so GC% = 18/42)

I should then get :

Scaffolds Seq_length GC%
scaffold1 42         0.428 
scaffold2 53         0.453  

I'm looking for an efficient way to do this task as my real dict is really huge (1,046,544 keys)

Thanks a lot for your help

CodePudding user response:

You can rework the dictionary:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

mydict = {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}

from Bio.SeqUtils import GC

df = pd.DataFrame([{'Scaffolds': k,
                    'Seq_length': len(s.seq),
                    'GC%': GC(s.seq)}
                   for k, s in mydict.items()])

output:

   Scaffolds  Seq_length        GC%
0  scaffold1          42  42.857143
1  scaffold2          53  45.283019
  • Related