Is there a way to fill in specific missing rows in a column using rows in another column as a filter-CodePudding

I have a dataset of car attributes with missing values in some columns. In the Distance column for example, there are missing values and I want to replace them with the mean. There is a second column however, Car Type it shows whether the car is brand new or used. A brand new car would not have that many miles driven compared to a used car. I want to replace the NaN values in Distance with the mean of Distance values where the Car Type == 'Brand New'

Minimal setup:

df = pd.DataFrame({'Car type': ['New','Used','New','New','New','Used','New','New'],
                   'Distance':[20,2222,34,np.nan,np.nan,np.nan,50,10]})
print(df)

  Car type  Distance
0      New      20.0
1     Used    2222.0
2      New      34.0
3      New       NaN
4      New       NaN
5     Used       NaN
6      New      50.0
7      New      10.0

CodePudding user response：

Compute the mean for each Car Type and broadcast the values (with transform) to all rows then use fillna to replace NaN by the mean value:

df['Distance'] = (df['Distance'].fillna(df.groupby('Car type')['Distance']
                                .transform('mean')))
print(df)

# Output
  Car type  Distance
0      New      20.0
1     Used    2222.0
2      New      34.0
3      New      28.5  # mean of New car
4      New      28.5  # mean of New car
5     Used    2222.0  # mean of Used car
6      New      50.0
7      New      10.0

CodePudding user response：

df['distance'].fillna((df['distance'].mean()), inplace=True)

This code above does the naive replacement. Note that the inplace statement makes the original object modified with this change

import numpy as np
df['distance'] = np.where(df['Car type'] == "Brand New", df['distance'].mean(), df['distance'])

This uses numpy so be sure to import numpy as np. Elegant and fast