This is my dataframe:
text from_id to_id position
0 NaN NaN NaN [nan, nan]
1 albumin NaN NaN [0.0, 23.0]
2 hepatoma NaN NaN [93.0, 101.0]
3 NaN hepatoma albumin [nan, nan]
I want to create two new columns named from_position
and to_position
and I want them to be like this:
text from_id to_id position from_position to_position
0 NaN NaN NaN [nan, nan] [nan, nan] [nan, nan]
1 albumin NaN NaN [0.0, 23.0] [nan, nan] [nan, nan]
2 hepatoma NaN NaN [93.0, 101.0] [nan, nan] [nan, nan]
3 NaN hepatoma albumin [nan, nan] [93.0, 101.0] [0.0, 23.0]
Note that it doesn't have to be [nan,nan] it can also be NaN. Moreover, the position values aren't lists of floats, they are strings (I'm ok with converting them).
I tried this
s = full_df_normalized.dropna(subset=['text', 'position']).set_index('text')['position'].to_dict()
cols = ['from_position','to_position']
full_df_normalized[cols] = full_df_normalized[cols].apply(lambda x: x.map(d))
But I got this error: KeyError: "None of [Index(['from_position', 'to_position'], dtype='object')] are in the [columns]"
So, I created 2 empty (NaN) columns named 'from_position','to_position' and ran it again, but they are still NaN.
I also tried this :
for i,row in full_df_normalized.iterrows():
for key,value in sd.items():
if key == row['from_id']:
row['from_position'] == value
But once again, the 'from_position' column remains NaN. Can someone help?
CodePudding user response:
In your solution apply mapping for id
columns from list cols1
, is necessary same ordering like col
, last replace missing values by list [np.nan, np.nan]
:
nan = np.nan
full_df_normalized = pd.DataFrame({'text': [nan, 'albumin', 'hepatoma', nan], 'from_id': [nan, nan, nan, 'hepatoma'], 'to_id': [nan, nan, nan, 'albumin'], 'position': [[nan, nan], [0.0, 23.0], [93.0, 101.0], [nan, nan]]})
print (full_df_normalized)
text from_id to_id position
0 NaN NaN NaN [nan, nan]
1 albumin NaN NaN [0.0, 23.0]
2 hepatoma NaN NaN [93.0, 101.0]
3 NaN hepatoma albumin [nan, nan]
d = full_df_normalized.dropna(subset=['text', 'position']).set_index('text')['position'].to_dict()
cols = ['from_position','to_position']
cols1 = ['from_id','to_id']
f = lambda x:x.map(d).dropna().reindex(full_df_normalized.index,fill_value=[np.nan,np.nan])
full_df_normalized[cols] = full_df_normalized[cols1].apply(f)
print (full_df_normalized)
text from_id to_id position from_position to_position
0 NaN NaN NaN [nan,nan] [nan, nan] [nan, nan]
1 albumin NaN NaN [0.0,23.0] [nan, nan] [nan, nan]
2 hepatoma NaN NaN [93.0,101.0] [nan, nan] [nan, nan]
3 NaN hepatoma albumin [nan,nan] [93.0,101.0] [0.0,23.0]
Or you can processing each coumn separately:
for c in cols:
c1 = c.replace('_position', '_id')
full_df_normalized[c] = full_df_normalized[c1].map(d).dropna().reindex(full_df_normalized.index, fill_value=[np.nan, np.nan])
print (full_df_normalized)
text from_id to_id position from_position to_position
0 NaN NaN NaN [nan,nan] [nan, nan] [nan, nan]
1 albumin NaN NaN [0.0,23.0] [nan, nan] [nan, nan]
2 hepatoma NaN NaN [93.0,101.0] [nan, nan] [nan, nan]
3 NaN hepatoma albumin [nan,nan] [93.0,101.0] [0.0,23.0]
CodePudding user response:
Using a reshaped DataFrame as intermediate:
cols = ['from_id', 'to_id']
df.join(df
.reset_index()
.drop(columns='position')
.melt(id_vars='index', value_vars=cols, value_name='text')
.merge(df.dropna(subset='text'), on='text')
.pivot('index', 'variable', 'position')
.rename(columns=lambda s: s.replace('_id', '_position'))
)
output:
text from_id to_id position from_position to_position
0 NaN NaN NaN [nan, nan] NaN NaN
1 albumin NaN NaN [0.0, 23.0] NaN NaN
2 hepatoma NaN NaN [93.0, 101.0] NaN NaN
3 NaN hepatoma albumin [nan, nan] [93.0, 101.0] [0.0, 23.0]