This might be very simple question, but here's my dataframe:
id text position labels
0 39088 skin melanoma [58.0, 71.0] indication
1 39088 proteinase [137.0, 147.0] protein
2 39088 plasminogen activator [170.0, 191.0] protein
3 39088 NaN [nan, nan] NaN
4 39088 NaN [nan, nan] NaN
5 39088 proteinase substrates [36.0, 57.0] protein
6 39088 tumors [67.0, 73.0] indication
7 39088 NaN [nan, nan] NaN
8 39088 Melanoma [0.0, 8.0] indication
9 39088 EDTA [172.0, 176.0] protein
{'pmid': [39088,
39088,
39088,
39088,
39088,
39088,
39088,
39088,
39088,
39088],
'text': ['skin melanoma',
'proteinase',
'plasminogen activator',
nan,
nan,
'proteinase substrates',
'tumors',
nan,
'Melanoma',
'EDTA'],
'position': ['[58.0, 71.0]',
'[137.0, 147.0]',
'[170.0, 191.0]',
'[nan, nan]',
'[nan, nan]',
'[36.0, 57.0]',
'[67.0, 73.0]',
'[nan, nan]',
'[0.0, 8.0]',
'[172.0, 176.0]'],
'labels': ['indication',
'protein',
'protein',
nan,
nan,
'protein',
'indication',
nan,
'indication',
'protein']}
And here's the WANTED OUTPUT , where I want to create 2 new columns based on the values of the labels
column, and put as values the corresponding text
and position
depending on whether they are indication or protein, and for the rest NaN
.
id indication indication.position protein protein.position
0 39088 skin melanoma [58.0, 71.0] NaN [nan, nan]
1 39088 NaN [nan, nan] proteinase [137.0, 147.0]
2 39088 NaN [nan, nan] plasminogen activator [170.0, 191.0]
3 39088 NaN [nan, nan] NaN [nan, nan]
4 39088 NaN [nan, nan] NaN [nan, nan]
5 39088 NaN [nan, nan] proteinase substrates [36.0, 57.0]
6 39088 tumors [67.0, 73.0] NaN [nan, nan]
7 39088 NaN [nan, nan] NaN [nan, nan]
8 39088 Melanoma [0.0, 8.0] NaN [nan, nan]
9 39088 NaN [nan, nan] EDTA [172.0, 176.0]
What is the best way to do this? Can someone help?
CodePudding user response:
You can use:
out = (df
.drop(columns=['position', 'result.value.labels'])
.join(
df.reset_index().dropna(subset=['result.value.labels'])
.pivot(index='index', columns='result.value.labels', values='position')
.reindex(df.index)
.fillna('[nan, nan]')
.add_suffix('.position')
)
)
output:
pmid result.value.text indication.position protein.position
0 39088 skin melanoma [58.0, 71.0] [nan, nan]
1 39088 proteinase [nan, nan] [137.0, 147.0]
2 39088 plasminogen activator [nan, nan] [170.0, 191.0]
3 39088 NaN [nan, nan] [nan, nan]
4 39088 NaN [nan, nan] [nan, nan]
5 39088 proteinase substrates [nan, nan] [36.0, 57.0]
6 39088 tumors [67.0, 73.0] [nan, nan]
7 39088 NaN [nan, nan] [nan, nan]
8 39088 Melanoma [0.0, 8.0] [nan, nan]
9 39088 EDTA [nan, nan] [172.0, 176.0]