Home > database >  Is there a way to fill in specific missing rows in a column using rows in another column as a filter
Is there a way to fill in specific missing rows in a column using rows in another column as a filter

Time:06-06

I have a dataset of car attributes with missing values in some columns. In the Distance column for example, there are missing values and I want to replace them with the mean. There is a second column however, Car Type it shows whether the car is brand new or used. A brand new car would not have that many miles driven compared to a used car. I want to replace the NaN values in Distance with the mean of Distance values where the Car Type == 'Brand New'

Minimal setup:

df = pd.DataFrame({'Car type': ['New','Used','New','New','New','Used','New','New'],
                   'Distance':[20,2222,34,np.nan,np.nan,np.nan,50,10]})
print(df)

  Car type  Distance
0      New      20.0
1     Used    2222.0
2      New      34.0
3      New       NaN
4      New       NaN
5     Used       NaN
6      New      50.0
7      New      10.0

CodePudding user response:

Compute the mean for each Car Type and broadcast the values (with transform) to all rows then use fillna to replace NaN by the mean value:

df['Distance'] = (df['Distance'].fillna(df.groupby('Car type')['Distance']
                                .transform('mean')))
print(df)

# Output
  Car type  Distance
0      New      20.0
1     Used    2222.0
2      New      34.0
3      New      28.5  # mean of New car
4      New      28.5  # mean of New car
5     Used    2222.0  # mean of Used car
6      New      50.0
7      New      10.0

CodePudding user response:

df['distance'].fillna((df['distance'].mean()), inplace=True) 

This code above does the naive replacement. Note that the inplace statement makes the original object modified with this change

import numpy as np
df['distance'] = np.where(df['Car type'] == "Brand New", df['distance'].mean(), df['distance'])

This uses numpy so be sure to import numpy as np. Elegant and fast

  • Related