Is there a fast way to convert this type of data to a dictionary in python?-CodePudding

I have the following DataFrame where in the first column I have a gene pathway ID, and in the second column a list of genes involved in that patway:

gene pathway ID	gene ID
ID1	gene1,gene2,gene3
ID2	gene2,gene4,gene6
ID3	gene1,gene3,gene6
...	....

To speed up some processes I would like to create a dictionary where the keys are the genes, and values are lists of pathway id:

dict = {gene1:[ID1,ID3], gene2:[ID1,ID2], gene3:[ID1,ID3], ...}

Is there a fast way to create such dictionary? For now the only approach I used it's way too slow.

CodePudding user response：

I am not sure what solution you tried. I can propose this one. The result is the dict you ask for:

from collections import defaultdict
result = defaultdict(list)
for idx, row in df.iterrows():
    genes = row['gene ID'].split(',')
    for g in genes:
        result[g].append(row['gene pathway ID'])

CodePudding user response：

In case of

df =
  gene pathway ID     gene ID
0  ID1  [gene1, gene2, gene3]
1  ID2  [gene2, gene4, gene6]
2  ID3  [gene1, gene3, gene6]

you could use

from collections import defaultdict

genes_dict = defaultdict(list)
for i, genes in df.itertuples(index=False):
    for gene in genes:
        genes_dict[gene].append(i)

In case of

df = 
  gene pathway ID  gene ID
0  ID1  gene1,gene2,gene3
1  ID2  gene2,gene4,gene6
2  ID3  gene1,gene3,gene6

you could try

from collections import defaultdict

genes_dict = defaultdict(list)
for i, genes in df.itertuples(index=False):
    for gene in genes.split(","):
        genes_dict[gene].append(i)