I try to fill missing value with the most appeared one in its group . Code :
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = df.groupby('VehicleType')['FuelType'].transform(f)
df['FuelType']=df['FuelType'].fillna(s)
Error: ValueError: Length mismatch: Expected axis has 316879 elements, new values have 354369 elements
Possible solutions: I think that maybe the VehicleType data has missing values, therefore it gives an error .Because when I use another column that has no missing values, it works. But I have to use VehicleType for this task .
CodePudding user response:
This problem appears to have been fixed in newer versions of pandas. (Works without issue on 1.4.0). But for older versions of pandas...
The issue is caused by NaN
values in your grouping column together with .transform
. To get around this problem instead of grouping by the column name, group by the Series where you first .fillna()
with some value that doesn't occur in that column. This will succeed in assiging the NaN
'VehicleType'
rows with the modal value for 'FuelType'
among those NaN
rows.
I'll assign the result as a separate column below for illustration.
Sample data to reproduce problem
import pandas as pd
import numpy as np
df = pd.DataFrame({'VehicleType': ['a', 'b', 'c', 'a', np.NaN, np.NaN, np.NaN, 'a'],
'FuelType': ['Y', np.NaN, 'Y', 'X', 'Z', 'Z', 'Y', 'X']})
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
df.groupby('VehicleType')['FuelType'].transform(f)
#ValueError: Length mismatch: Expected axis has 5 elements, new values have 8 elements
Solution
df['FuelType_mode'] = (df.groupby(df['VehicleType'].fillna('SPECIAL_MISSING'))
['FuelType'].transform(f))
print(df)
VehicleType FuelType FuelType_mode
0 a Y X
1 b NaN NaN
2 c Y Y
3 a X X
4 NaN Z Z
5 NaN Z Z
6 NaN Y Z
7 a X X
With newer versions of pandas
the dropna
arg can be used to specify whether you want to ignore NaN
rows entirely when you group, or if you want to consider them their own unique group. Depending upon your desired behavior you would do:
# Still assigns NAN Vehicle Typethe modal Fuel Type.
# Same logic as above
df['FT3'] = df.groupby('VehicleType', dropna=False)['FuelType'].transform(f)
# NAN Vehicle Types get NAN Fuel
df['FT4'] = df.groupby('VehicleType')['FuelType'].transform(f)
VehicleType FuelType FuelType_mode FT3 FT4
0 a Y X X X
1 b NaN NaN NaN NaN
2 c Y Y Y Y
3 a X X X X
4 NaN Z Z Z NaN
5 NaN Z Z Z NaN
6 NaN Y Z Z NaN
7 a X X X X