I have a data frame with columns of numbers one of which contains some 'tbd' values. I want to replace those 'tbd' values with mean values from the groupby of the other columns. I understand that you can fillna values using the following
df['col_d'] = df.groupby(['col_a', 'col_b'])['col_c'].transform(lambda x: x.fillna(x.round().mean()))
I cant use fillna though as I already have na values and I don't want to drop them and append them as I will need them at a later stage. I understand I could use mapping but Im wondering is there a more elegant replace method?
Input col_b contains NaNs and 'tbd' I want to use a group by of col_a and col_c to fill the 'tbd'
col_a col_b col_c
1 1 1
2 2 2
3 'tbd' 3
3 'tbd' 3
5 5 5
6 NaN 6
output used the mean of col_a and col_c
col_a col_b col_c
1 1 1
2 2 2
6 7 8
5 6 7
5 5 5
6 NaN 6
simple example but I believe it explains the issue it's essentially replacing the fillna method of the example lambda function I listed with replace method instead
CodePudding user response:
You can do with np.where
, and use to_numeric
to change to nan the 'tbd' to be able to perform the mean operation.
# dummy data
df = pd.DataFrame({
'col_a':['a']*8, 'col_b':['b','c']*4,
'col_c':[1,1,np.nan,np.nan,'tbd',3,4,'tbd']
})
df['col_d'] = np.where(
df['col_c'].eq('tbd'),
pd.to_numeric(df['col_c'], errors='coerce')
.groupby([df['col_a'],df['col_b']]).transform('mean'),
df['col_c']
)
print(df)
# col_a col_b col_c col_d
# 0 a b 1 1
# 1 a c 1 1
# 2 a b NaN NaN
# 3 a c NaN NaN
# 4 a b tbd 2.5
# 5 a c 3 3
# 6 a b 4 4
# 7 a c tbd 2.0