I have a dictionary such as :
mydict= {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}
I wondered if someone knew an efficient way to process this dictionary and create a dataframe from it by adding three columns:
- Scaffolds column which is the keys of the dictionary
- The Seq_length which is the length of the Seq string
- The GC% which is the number of
G
andC
letters within Seq divided by the Seq_length (for example len(Seq) of scaffold1 is 42, and there are 18 G and C letters (soGC% = 18/42
)
I should then get :
Scaffolds Seq_length GC%
scaffold1 42 0.428
scaffold2 53 0.453
I'm looking for an efficient way to do this task as my real dict is really huge (1,046,544 keys)
Thanks a lot for your help
CodePudding user response:
You can rework the dictionary:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
mydict = {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}
from Bio.SeqUtils import GC
df = pd.DataFrame([{'Scaffolds': k,
'Seq_length': len(s.seq),
'GC%': GC(s.seq)}
for k, s in mydict.items()])
output:
Scaffolds Seq_length GC%
0 scaffold1 42 42.857143
1 scaffold2 53 45.283019