I'm trying to extract values from a dictionary within a list in a column, my dataframe looks like,
id proteinIds
0 ENSG00000006194 [{'id': 'O14978', 'source': 'uniprot_swissprot...
1 ENSG00000007520 [{'id': 'Q9UJK0', 'source': 'uniprot_swissprot...
2 ENSG00000020922 [{'id': 'P49959', 'source': 'uniprot_swissprot...
3 ENSG00000036549 [{'id': 'Q8IYH5', 'source': 'uniprot_swissprot...
4 ENSG00000053524 [{'id': 'Q86YR7', 'source': 'uniprot_swissprot...
Each value in proteinIds column has multiple ids like below, I'm trying to extract only the id related to uniprot_swissprot and return none if uniprot_swissprot not present in the dictionary
[{'id': 'O60284', 'source': 'uniprot_swissprot'},
{'id': 'E5RFE8', 'source': 'uniprot_trembl'},
{'id': 'E5RHS3', 'source': 'uniprot_trembl'},
{'id': 'E5RHY1', 'source': 'uniprot_trembl'},
{'id': 'E5RID0', 'source': 'uniprot_trembl'},
{'id': 'E5RK88', 'source': 'uniprot_trembl'},
{'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]
Expected output
id proteinIds
0 ENSG00000006194 O14978
1 ENSG00000007520 Q9UJK0
2 ENSG00000020922 P49959
3 ENSG00000036549 Q8IYH5
4 ENSG00000053568 None
I tried using below code, but it was not returning the correct ids related to uniprot_swissprot, any help is appreciated, thanks.
df1 = pd.DataFrame([[y['id'] for y in x] if isinstance(x, list) else [None] for x in df['proteinIds']], index=df.index)
CodePudding user response:
You can try explode
the list in proteinIds
column into list then convert the dictionary to multiple dataframe columns and conditionally select the id
column where source
is uniprot_swissprot
df['Ids'] = (df['proteinIds'].explode() # explode will keep the original index by default so we can safely assign it back
.apply(pd.Series)
.loc[lambda d: d['source'].eq('uniprot_swissprot'), 'id'])
print(df)
id \
0 ENSG00000006194
1 ENSG00000007520
proteinIds \
0 [{'id': 'O60284', 'source': 'uniprot_swissprot'}, {'id': 'E5RFE8', 'source': 'uniprot_trembl'}]
1 [{'id': 'E5RK88', 'source': 'uniprot_trembl'}, {'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]
Ids
0 O60284
1 NaN