I have the following dataframe :
ID St St_type st_source st_label q ... cluster
1 ax NaN NaN NaN NaN ... 1
2 ax3 X NaN NaN 90.32 ... 1
3 ax20 X Bi Ci 102.70 ... 1
4 ax10 Y NaN N2 103 ... 2
5 ax18 Y H1 NaN NaN ... 2
6 ax100 NaN NaN NaN 20.60 ... 2
... ... ... ... ... ... ...
I would like, for each cluster, that NaN values be replace by non NaN values of the concerned cluster for specific columns "St_type" "st_source" and "st_label" which should give :
ID St St_type st_source st_label q ... cluster
1 ax X Bi Ci NaN ... 1
2 ax3 X Bi Ci 90.32 ... 1
3 ax20 X Bi Ci 102.70 ... 1
4 ax10 Y H1 N2 103 ... 2
5 ax18 Y H1 N2 NaN ... 2
6 ax100 Y H1 N2 20.60 ... 2
... ... ... ... ... ... ...
How is it possible to obtain a such result ?
CodePudding user response:
Assume that for each column in the same cluster
, there is only one value to fill.
Sample set up:
df = pd.DataFrame({'name': ['a1','a2','a3','a4','b1','b2','b3','b4'],
'cluster':[1]*4 [2]*4,
'val1':[1,None,1,None,None,2,2,None],
'val2':[None,'c',None,'c','d',None, None, 'd']})
Data input:
name cluster val1 val2
0 a1 1 1.0 None
1 a2 1 NaN c
2 a3 1 1.0 None
3 a4 1 NaN c
4 b1 2 NaN d
5 b2 2 2.0 None
6 b3 2 2.0 None
7 b4 2 NaN d
Apply forward fill and backward fill:
df[['val1', 'val2']] = df.groupby('cluster')[['val1', 'val2']].apply(lambda g: g.ffill())
df[['val1', 'val2']] = df.groupby('cluster')[['val1', 'val2']].apply(lambda g: g.bfill())
Data output:
name cluster val1 val2
0 a1 1 1.0 c
1 a2 1 1.0 c
2 a3 1 1.0 c
3 a4 1 1.0 c
4 b1 2 2.0 d
5 b2 2 2.0 d
6 b3 2 2.0 d
7 b4 2 2.0 d