Scikit learn imputer in a pandas dataframe group by id-CodePudding

I have a pandas dataframe with the following data grouped by id and ordered by seq (note that some amounts are nan). I need to run scikit-learn imputer to impute the average by id.

from sklearn import impute
import pandas as pd
import numpy as np

rows  = [{'id': 1, 'seq': 0, 'amount': 2000 },
          {'id': 1, 'seq': 1, 'amount': 4000 },
          {'id': 1, 'seq': 2, 'amount': np.nan },
          {'id': 2, 'seq': 0, 'amount': 1000 },
          {'id': 2, 'seq': 1, 'amount': 3000 },
          {'id': 2, 'seq': 2, 'amount': np.nan }]

pdf = pd.DataFrame(rows)

imputer = impute.SimpleImputer(strategy='mean')

# If I run this it will ignore the id
pdf[['amount_imputed']] = imputer.fit_transform(pdf[['amount']])

The result of amount_imputed should be 3000 for id = 1 and 2000 for id = 2. Instead, the statement above fills both amounts with the total average, 5000. How to group the imputer by id?

CodePudding user response：

Let's try

pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)['amount']
                         .transform(lambda col:
                             imputer.fit_transform(col.to_frame()).flatten(),
                         ))
# or
pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)
                         .apply(lambda g: pd.Series(
                             imputer.fit_transform(g[['amount']]).flatten(),
                             index=g.index
                         )))

print(pdf)

   id  seq  amount  amount_imputed
0   1    0  2000.0          2000.0
1   1    1  4000.0          4000.0
2   1    2     NaN          3000.0
3   2    0  1000.0          1000.0
4   2    1  3000.0          3000.0
5   2    2     NaN          2000.0

CodePudding user response：

I think, using SimpleImputer we can't solve this kind of problem like group based. But there are some alternatives.

you can try this :

# Replace amount NaN with mean amount of same id
pdf['amount_imputed'] = pdf.groupby('id').amount.transform(lambda x: x.fillna(x.mean()))
pdf.amount_imputed

Output :

0    2000.0
1    4000.0
2    3000.0
3    1000.0
4    3000.0
5    2000.0
Name: amount_imputed, dtype: float64