Good day,
I am working on a script that outputs a file summarizing the Gene Onthology terms of multiple species. It should display the following columns: Species Name | Gene ID | GO ID. The code itself is pretty easy and I am mostly done already. The only problem I face is that the input files list the gene multiple times if they have multiple GOterms. e.g.:
Znev_18624 GO:0009987
Znev_18624 GO:0008150
Znev_18620 GO:0008150
Znev_18620 GO:0009987
Znev_18721 GO:0009987
Znev_18721 GO:0008150
I basically just want to summarize the same keys so it just outputs the key once and multiple values in the next column. I found multiple questions asking how to merge two dictionaries but never spefically how to solve my problem. The code looks like this, again pretty simple as it just rewrites the colums but I would like to emphasise that I do not decleare a dictionary first but just write out the values:
for line in ref:
if 'Protein GO term Score' in line:
continue
GeneID = line.split('\t')[0]
GOID = line.split('\t')[1]
results.write('Znev' '\t' GeneID '\t' GOID '\n')
The outputfile looks just like above. May help would be much appreciated. Thank you in advance.
CodePudding user response:
with this snippet of code:
import pprint
from collections import defaultdict
with open('file.txt', 'r') as f:
lines = f.readlines()
output_dict = defaultdict(list)
for line in lines:
if 'Protein GO term Score' in line:
continue
GeneID = line.split()[0]
GOID = line.split()[1]
output_dict[GeneID].append(GOID)
pprint.pprint(output_dict, indent=4)
with open('output.txt', 'w') as f:
for key, value in output_dict.items():
f.write(f"Znev\t{key}\t{','.join(value)}\n")
You will get this ouput:
defaultdict(<class 'list'>,
{ 'Znev_18620': ['GO:0008150', 'GO:0009987'],
'Znev_18624': ['GO:0009987', 'GO:0008150'],
'Znev_18721': ['GO:0009987', 'GO:0008150']})
and this output file:
Znev Znev_18624 GO:0009987,GO:0008150
Znev Znev_18620 GO:0008150,GO:0009987
Znev Znev_18721 GO:0009987,GO:0008150