Combining identical keys in a dictionary-CodePudding

Good day,

I am working on a script that outputs a file summarizing the Gene Onthology terms of multiple species. It should display the following columns: Species Name | Gene ID | GO ID. The code itself is pretty easy and I am mostly done already. The only problem I face is that the input files list the gene multiple times if they have multiple GOterms. e.g.:

Znev_18624  GO:0009987  
Znev_18624  GO:0008150  
Znev_18620  GO:0008150  
Znev_18620  GO:0009987  
Znev_18721  GO:0009987  
Znev_18721  GO:0008150

I basically just want to summarize the same keys so it just outputs the key once and multiple values in the next column. I found multiple questions asking how to merge two dictionaries but never spefically how to solve my problem. The code looks like this, again pretty simple as it just rewrites the colums but I would like to emphasise that I do not decleare a dictionary first but just write out the values:

for line in ref:
    if 'Protein GO term Score' in line:
        continue
    GeneID = line.split('\t')[0]            
    GOID = line.split('\t')[1]

    results.write('Znev'   '\t'   GeneID   '\t'   GOID   '\n')

The outputfile looks just like above. May help would be much appreciated. Thank you in advance.

CodePudding user response：

with this snippet of code:

import pprint
from collections import defaultdict

with open('file.txt', 'r') as f:
    lines = f.readlines()

output_dict = defaultdict(list)
for line in lines:
    if 'Protein GO term Score' in line:
        continue
    GeneID = line.split()[0]
    GOID = line.split()[1]
    output_dict[GeneID].append(GOID)
    
pprint.pprint(output_dict, indent=4)

with open('output.txt', 'w') as f:
    for key, value in output_dict.items():
        f.write(f"Znev\t{key}\t{','.join(value)}\n")

You will get this ouput:

defaultdict(<class 'list'>,
            {   'Znev_18620': ['GO:0008150', 'GO:0009987'],
                'Znev_18624': ['GO:0009987', 'GO:0008150'],
                'Znev_18721': ['GO:0009987', 'GO:0008150']})

and this output file:

Znev    Znev_18624  GO:0009987,GO:0008150
Znev    Znev_18620  GO:0008150,GO:0009987
Znev    Znev_18721  GO:0009987,GO:0008150