I have a pandas dataframe with the following data grouped by id and ordered by seq (note that some amounts are nan
). I need to run scikit-learn imputer to impute the average by id.
from sklearn import impute
import pandas as pd
import numpy as np
rows = [{'id': 1, 'seq': 0, 'amount': 2000 },
{'id': 1, 'seq': 1, 'amount': 4000 },
{'id': 1, 'seq': 2, 'amount': np.nan },
{'id': 2, 'seq': 0, 'amount': 1000 },
{'id': 2, 'seq': 1, 'amount': 3000 },
{'id': 2, 'seq': 2, 'amount': np.nan }]
pdf = pd.DataFrame(rows)
imputer = impute.SimpleImputer(strategy='mean')
# If I run this it will ignore the id
pdf[['amount_imputed']] = imputer.fit_transform(pdf[['amount']])
The result of amount_imputed should be 3000 for id = 1 and 2000 for id = 2. Instead, the statement above fills both amounts with the total average, 5000. How to group the imputer by id?
CodePudding user response:
Let's try
pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)['amount']
.transform(lambda col:
imputer.fit_transform(col.to_frame()).flatten(),
))
# or
pdf['amount_imputed'] = (pdf.groupby('id', group_keys=False)
.apply(lambda g: pd.Series(
imputer.fit_transform(g[['amount']]).flatten(),
index=g.index
)))
print(pdf)
id seq amount amount_imputed
0 1 0 2000.0 2000.0
1 1 1 4000.0 4000.0
2 1 2 NaN 3000.0
3 2 0 1000.0 1000.0
4 2 1 3000.0 3000.0
5 2 2 NaN 2000.0
CodePudding user response:
I think, using SimpleImputer
we can't solve this kind of problem like group based. But there are some alternatives.
you can try this :
# Replace amount NaN with mean amount of same id
pdf['amount_imputed'] = pdf.groupby('id').amount.transform(lambda x: x.fillna(x.mean()))
pdf.amount_imputed
Output :
0 2000.0
1 4000.0
2 3000.0
3 1000.0
4 3000.0
5 2000.0
Name: amount_imputed, dtype: float64