I have the following DataFrame where in the first column I have a gene pathway ID, and in the second column a list of genes involved in that patway:
gene pathway ID | gene ID |
---|---|
ID1 | gene1,gene2,gene3 |
ID2 | gene2,gene4,gene6 |
ID3 | gene1,gene3,gene6 |
... | .... |
To speed up some processes I would like to create a dictionary where the keys are the genes, and values are lists of pathway id:
dict = {gene1:[ID1,ID3], gene2:[ID1,ID2], gene3:[ID1,ID3], ...}
Is there a fast way to create such dictionary? For now the only approach I used it's way too slow.
CodePudding user response:
I am not sure what solution you tried. I can propose this one. The result is the dict you ask for:
from collections import defaultdict
result = defaultdict(list)
for idx, row in df.iterrows():
genes = row['gene ID'].split(',')
for g in genes:
result[g].append(row['gene pathway ID'])
CodePudding user response:
In case of
df =
gene pathway ID gene ID
0 ID1 [gene1, gene2, gene3]
1 ID2 [gene2, gene4, gene6]
2 ID3 [gene1, gene3, gene6]
you could use
from collections import defaultdict
genes_dict = defaultdict(list)
for i, genes in df.itertuples(index=False):
for gene in genes:
genes_dict[gene].append(i)
In case of
df =
gene pathway ID gene ID
0 ID1 gene1,gene2,gene3
1 ID2 gene2,gene4,gene6
2 ID3 gene1,gene3,gene6
you could try
from collections import defaultdict
genes_dict = defaultdict(list)
for i, genes in df.itertuples(index=False):
for gene in genes.split(","):
genes_dict[gene].append(i)