I'm trying to impute NA values in engine_capacity column with the median of engine

I want to search for every NA value in nancap dataframe and if there's an NA value replace it with median engine_capacity in cap dataframe(only if it's the same car_model), I tried doing the following code but it didn't work. (sorry if my question is not clear)

url = 'https://raw.githubusercontent.com/YousefAlotaibi/saudi_used_cars_price_prediciton/main/data/cars_cleaned_data.csv'
df = pd.read_csv(url)
df.head()

cap = df.groupby('car_model')['engine_capacity'].median().reset_index()
nancap = df[['engine_capacity', 'car_model']]
for i, z in nancap.itertuples(index=False):
    if i.is_integer() == False: # if NA 
        for c, ca in cap.itertuples(index=False):
            if c == z: # if car_model in c of cap == car_model of z in cap
                i = ca # assign median engine capacity which is ca to i

CodePudding user response：

try:

nancap = df[['engine_capacity', 'car_model']]
nancap = (
    nancap
    .set_index('car_model')
    .fillna(
        nancap
            .groupby('car_model')
            .agg(pd.Series.median)
            .to_dict()
    )
    .reset_index()
)

But take into account that there are lots of models with all engine_capacity values as NaN and their median will then NaN. If you want to fill those residual NaN you can add a .fillna('No data available') after .reset_index().

Like:

nancap = df[['engine_capacity', 'car_model']]
nancap = (
    nancap
    .set_index('car_model')
    .fillna(
        nancap
            .groupby('car_model')
            .agg(pd.Series.median)
            .to_dict()
    )
    .reset_index()
    .fillna('No data available')
)