def preprocessing(X_train):
cat_cols = []
num_cols = []
for cols in X_train.columns:
if X_train[cols].nunique()<10 and X_train[cols].dtype =="object":
cat_cols.append(cols)
elif X_train[cols].dtype in ["int64","float64"]:
num_cols.append(cols)
full_cols = cat_cols num_cols
num_transformer = SimpleImputer(strategy = "constant")
cat_transformer = Pipeline(steps =[
("imputer", SimpleImputer(strategy = "most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
transformers = [
("num", num_transformer, num_cols),
("cat", cat_transformer, cat_cols)
])
return preprocessor.fit_transform(X_train)
The Code is creating a preprocessor function for transforming data se. i train the model i got 226 features. But when i tried to transform the testing dataset for prediction, i only got 217.
the Error message : Feature shape mismatch, expected: 226, got 217
The dataset i am using: https://www.kaggle.com/competitions/home-data-for-ml-course
I wanna know what this happen and how to solve it
CodePudding user response:
You should fit preprocessor on training data but only transform test data. If you refit it on test data it will find a completely different mapping, the fact that shapes are mismatched is a lucky error, as otherwise you would just get a silent issue where code runs, but model gets completely scrambled representation. This is especially important with things like one hotting. Imagine your training data has, for feature 1, values ["cat", "dog", "duck"] and so cat=>(1,0,0), dog=>(0,1,0), duck=>(0,0,1). But in test you only see ["cat", "duck"] and thus cat=>(1, 0), duck=>(0,1), so you have a shape mismatch and a duck became sort of a dog!