I have a dataset of car attributes with missing values in some columns. In the Distance
column for example, there are missing values and I want to replace them with the mean. There is a second column however, Car Type
it shows whether the car is brand new or used. A brand new car would not have that many miles driven compared to a used car. I want to replace the NaN values in Distance
with the mean of Distance
values where the Car Type == 'Brand New'
Minimal setup:
df = pd.DataFrame({'Car type': ['New','Used','New','New','New','Used','New','New'],
'Distance':[20,2222,34,np.nan,np.nan,np.nan,50,10]})
print(df)
Car type Distance
0 New 20.0
1 Used 2222.0
2 New 34.0
3 New NaN
4 New NaN
5 Used NaN
6 New 50.0
7 New 10.0
CodePudding user response:
Compute the mean for each Car Type
and broadcast the values (with transform
) to all rows then use fillna
to replace NaN by the mean value:
df['Distance'] = (df['Distance'].fillna(df.groupby('Car type')['Distance']
.transform('mean')))
print(df)
# Output
Car type Distance
0 New 20.0
1 Used 2222.0
2 New 34.0
3 New 28.5 # mean of New car
4 New 28.5 # mean of New car
5 Used 2222.0 # mean of Used car
6 New 50.0
7 New 10.0
CodePudding user response:
df['distance'].fillna((df['distance'].mean()), inplace=True)
This code above does the naive replacement. Note that the inplace statement makes the original object modified with this change
import numpy as np
df['distance'] = np.where(df['Car type'] == "Brand New", df['distance'].mean(), df['distance'])
This uses numpy so be sure to import numpy as np. Elegant and fast