How to replace group NaN value with group value (qualititative)?-CodePudding

I have the following dataframe :

ID  St  St_type  st_source  st_label  q        ... cluster 
1   ax   NaN       NaN       NaN      NaN      ...  1
2   ax3   X        NaN       NaN      90.32    ...  1
3   ax20  X        Bi        Ci       102.70   ...  1
4   ax10  Y        NaN       N2       103      ...  2
5   ax18  Y        H1        NaN      NaN      ...  2
6   ax100 NaN      NaN       NaN      20.60    ...  2
    ...   ...     ...        ...      ...    ... ...

I would like, for each cluster, that NaN values be replace by non NaN values of the concerned cluster for specific columns "St_type" "st_source" and "st_label" which should give :

ID  St  St_type  st_source  st_label  q      ... cluster 
1   ax    X        Bi        Ci      NaN     ...  1
2   ax3   X        Bi        Ci      90.32   ...  1
3   ax20  X        Bi        Ci      102.70  ...  1
4   ax10  Y        H1        N2      103     ...  2
5   ax18  Y        H1        N2      NaN     ...  2
6   ax100 Y        H1        N2      20.60   ...  2
    ...   ...     ...        ...      ...    ... ...

How is it possible to obtain a such result ?

CodePudding user response：

Assume that for each column in the same cluster, there is only one value to fill.

Sample set up:

df = pd.DataFrame({'name': ['a1','a2','a3','a4','b1','b2','b3','b4'],
                   'cluster':[1]*4 [2]*4,
                   'val1':[1,None,1,None,None,2,2,None],
                   'val2':[None,'c',None,'c','d',None, None, 'd']})

Data input:

  name  cluster  val1  val2
0   a1        1   1.0  None
1   a2        1   NaN     c
2   a3        1   1.0  None
3   a4        1   NaN     c
4   b1        2   NaN     d
5   b2        2   2.0  None
6   b3        2   2.0  None
7   b4        2   NaN     d

Apply forward fill and backward fill:

df[['val1', 'val2']] = df.groupby('cluster')[['val1', 'val2']].apply(lambda g: g.ffill())
df[['val1', 'val2']] = df.groupby('cluster')[['val1', 'val2']].apply(lambda g: g.bfill())

Data output:

  name  cluster  val1 val2
0   a1        1   1.0    c
1   a2        1   1.0    c
2   a3        1   1.0    c
3   a4        1   1.0    c
4   b1        2   2.0    d
5   b2        2   2.0    d
6   b3        2   2.0    d
7   b4        2   2.0    d