Filling NaN using groupby dict-CodePudding

I am attempting to replace the NaN in the 'cylinders' column using a dictionary with the median model cylinder number. I think it should work easily but I have been getting stuck each way I try.

cylinders_model_med = df.groupby('model').agg({'cylinders': 'median'})

cylinders_model_med=cylinders_model_med.to_dict()

#output:

  'bmw x5': 6.0,
  'buick enclave': 6.0,
  'cadillac escalade': 8.0,
  'chevrolet camaro': 6.0,
  'chevrolet camaro lt coupe 2d': 6.0,
  'chevrolet colorado': 5.0,
  'chevrolet corvette': 8.0,
  'chevrolet cruze': 4.0,
  'chevrolet equinox': 4.0,
  'chevrolet impala': 6.0,
  'chevrolet malibu': 4.0,
  'chevrolet silverado': 8.0,
  'chevrolet silverado 1500': 8.0,
  'chevrolet silverado 1500 crew': 8.0,
  'chevrolet silverado 2500hd': 8.0,
  'chevrolet silverado 3500hd': 8.0,
  'chevrolet suburban': 8.0,
  'chevrolet tahoe': 8.0,
  'chevrolet trailblazer': 6.0,
  'chevrolet traverse': 6.0,
  'chrysler 200': 4.0,
  'chrysler 300': 6.0,
  'chrysler town & country': 6.0,
  'dodge charger': 6.0,
  'dodge dakota': 6.0,
  'dodge grand caravan': 6.0,
  'ford econoline': 8.0,
  'ford edge': 6.0,
  'ford escape': 4.0,
  'ford expedition': 8.0,
  'ford explorer': 6.0,
  'ford f-150': 8.0,
  'ford f-250': 8.0,
  'ford f-250 sd': 8.0,
  'ford f-250 super duty': 8.0,
  'ford f-350 sd': 8.0,
  'ford f150': 8.0,
  'ford f150 supercrew cab xlt': 6.0,
  'ford f250': 8.0,
  'ford f250 super duty': 8.0,
  'ford f350': 8.0,
  'ford f350 super duty': 8.0,
  'ford focus': 4.0,
  'ford focus se': 4.0,
  'ford fusion': 4.0,
  'ford fusion se': 4.0,
  'ford mustang': 6.0,
  'ford mustang gt coupe 2d': 8.0,
  'ford ranger': 6.0,
  'ford taurus': 6.0,
  'gmc acadia': 6.0,
  'gmc sierra': 8.0,
  'gmc sierra 1500': 8.0,
  'gmc sierra 2500hd': 8.0,
  'gmc yukon': 8.0,
  'honda accord': 4.0,
  'honda civic': 4.0,
  'honda civic lx': 4.0,
  'honda cr-v': 4.0,
  'honda odyssey': 6.0,
  'honda pilot': 6.0,
  'hyundai elantra': 4.0,
  'hyundai santa fe': 6.0,
  'hyundai sonata': 4.0,
  'jeep cherokee': 6.0,
  'jeep grand cherokee': 6.0,
  'jeep grand cherokee laredo': 6.0,
  'jeep liberty': 6.0,
  'jeep wrangler': 6.0,
  'jeep wrangler unlimited': 6.0,
  'kia sorento': 4.0,
  'kia soul': 4.0,
  'mercedes-benz benze sprinter 2500': 6.0,
  'nissan altima': 4.0,
  'nissan frontier': 6.0,
  'nissan frontier crew cab sv': 6.0,
  'nissan maxima': 6.0,
  'nissan murano': 6.0,
  'nissan rogue': 4.0,
  'nissan sentra': 4.0,
  'nissan versa': 4.0,
  'ram 1500': 8.0,
  'ram 2500': 6.0,
  'ram 3500': 6.0,
  'subaru forester': 4.0,
  'subaru impreza': 4.0,
  'subaru outback': 4.0,
  'toyota 4runner': 6.0,
  'toyota camry': 4.0,
  'toyota camry le': 4.0,
  'toyota corolla': 4.0,
  'toyota highlander': 6.0,
  'toyota prius': 4.0,
  'toyota rav4': 4.0,
  'toyota sienna': 6.0,
  'toyota tacoma': 6.0,
  'toyota tundra': 8.0,
  'volkswagen jetta': 4.0,
  'volkswagen passat': 4.0}} 
#input:
df['cylinders']=df['cylinders'].fillna(cylinders_model_med)
df['cylinders'].isna().sum()
#output
5260

This is the same number of NaN I started with. I am new here so let me know if you need more (or less) information.

Thank you for your time!

CodePudding user response：

Alignment in pandas is based on index, so you need to create defaults which either you explicitly align to your dataframe or which will be aligned automatically. The easiest way to do this would be to create a default series with the same index as df using replace:

defaults = df['model'].replace(cylinders_model_med)
df['cylinders'] = df['cylinders'].fillna(defaults)

See the docs for more info: vectorized operations and label alignment

CodePudding user response：

Filling in NaN values from corresponding values is what combine_first does for a living. You could calculate the median cylinder number by model then fill in the original dataframe NaN cylinder numbers by model.

Assume this starting dataframe

            model  cylinders
0   nissan maxima        6.0
1   nissan maxima        6.0
2   nissan maxima        4.0
3   nissan murano        6.0
4   nissan murano        NaN
5   nissan murano        4.0
6   nissan murano        6.0
7    nissan rogue        4.0
8    nissan rogue        4.0
9    nissan rogue        NaN
10   nissan rogue        6.0
11  nissan sentra        6.0
12  nissan sentra        4.0
13  nissan sentra        4.0
14   nissan versa        4.0
15   nissan versa        4.0
16   nissan versa        NaN
17   nissan versa        4.0

Calculate the median cylinders by model and fill in the NaNs

df.assign(cylinders=df['cylinders'].combine_first(df[['model','cylinders']].groupby('model').transform('median').squeeze()))

Result

            model  cylinders
0   nissan maxima        6.0
1   nissan maxima        6.0
2   nissan maxima        4.0
3   nissan murano        6.0
4   nissan murano        6.0
5   nissan murano        4.0
6   nissan murano        6.0
7    nissan rogue        4.0
8    nissan rogue        4.0
9    nissan rogue        4.0
10   nissan rogue        6.0
11  nissan sentra        6.0
12  nissan sentra        4.0
13  nissan sentra        4.0
14   nissan versa        4.0
15   nissan versa        4.0
16   nissan versa        4.0
17   nissan versa        4.0