Home > Blockchain >  Extract specific value from a dictionary within a list in a column
Extract specific value from a dictionary within a list in a column

Time:09-08

I'm trying to extract values from a dictionary within a list in a column, my dataframe looks like,

             id                                         proteinIds
0  ENSG00000006194  [{'id': 'O14978', 'source': 'uniprot_swissprot...
1  ENSG00000007520  [{'id': 'Q9UJK0', 'source': 'uniprot_swissprot...
2  ENSG00000020922  [{'id': 'P49959', 'source': 'uniprot_swissprot...
3  ENSG00000036549  [{'id': 'Q8IYH5', 'source': 'uniprot_swissprot...
4  ENSG00000053524  [{'id': 'Q86YR7', 'source': 'uniprot_swissprot...

Each value in proteinIds column has multiple ids like below, I'm trying to extract only the id related to uniprot_swissprot and return none if uniprot_swissprot not present in the dictionary

[{'id': 'O60284', 'source': 'uniprot_swissprot'},
 {'id': 'E5RFE8', 'source': 'uniprot_trembl'},
 {'id': 'E5RHS3', 'source': 'uniprot_trembl'},
 {'id': 'E5RHY1', 'source': 'uniprot_trembl'},
 {'id': 'E5RID0', 'source': 'uniprot_trembl'},
 {'id': 'E5RK88', 'source': 'uniprot_trembl'},
 {'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]

Expected output

       id          proteinIds
0  ENSG00000006194  O14978
1  ENSG00000007520  Q9UJK0
2  ENSG00000020922  P49959
3  ENSG00000036549  Q8IYH5
4  ENSG00000053568  None

I tried using below code, but it was not returning the correct ids related to uniprot_swissprot, any help is appreciated, thanks.

df1 = pd.DataFrame([[y['id'] for y in x] if  isinstance(x, list) else [None] for x in df['proteinIds']], index=df.index)

CodePudding user response:

You can try explode the list in proteinIds column into list then convert the dictionary to multiple dataframe columns and conditionally select the id column where source is uniprot_swissprot

df['Ids'] = (df['proteinIds'].explode() # explode will keep the original index by default so we can safely assign it back
             .apply(pd.Series)
             .loc[lambda d: d['source'].eq('uniprot_swissprot'), 'id'])
print(df)

                id  \
0  ENSG00000006194
1  ENSG00000007520

                                                                                        proteinIds  \
0  [{'id': 'O60284', 'source': 'uniprot_swissprot'}, {'id': 'E5RFE8', 'source': 'uniprot_trembl'}]
1   [{'id': 'E5RK88', 'source': 'uniprot_trembl'}, {'id': 'Q17RY1', 'source': 'uniprot_obsolete'}]

      Ids
0  O60284
1     NaN
  • Related