Home > Blockchain >  AI preprocessor: Transformed Data result in difference Feature shape
AI preprocessor: Transformed Data result in difference Feature shape

Time:08-07

def preprocessing(X_train):
    cat_cols = []
    num_cols = []

    for cols in X_train.columns:
        if X_train[cols].nunique()<10 and X_train[cols].dtype =="object":
            cat_cols.append(cols)
        elif X_train[cols].dtype in ["int64","float64"]:
            num_cols.append(cols)

    full_cols = cat_cols num_cols


    num_transformer = SimpleImputer(strategy = "constant")

    cat_transformer = Pipeline(steps =[
        ("imputer", SimpleImputer(strategy = "most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ])


    preprocessor = ColumnTransformer(
        transformers = [
            ("num", num_transformer, num_cols),
            ("cat", cat_transformer, cat_cols)
    ])
    
    

    return preprocessor.fit_transform(X_train)

The Code is creating a preprocessor function for transforming data se. i train the model i got 226 features. But when i tried to transform the testing dataset for prediction, i only got 217.

the Error message : Feature shape mismatch, expected: 226, got 217

The dataset i am using: https://www.kaggle.com/competitions/home-data-for-ml-course

I wanna know what this happen and how to solve it

CodePudding user response:

You should fit preprocessor on training data but only transform test data. If you refit it on test data it will find a completely different mapping, the fact that shapes are mismatched is a lucky error, as otherwise you would just get a silent issue where code runs, but model gets completely scrambled representation. This is especially important with things like one hotting. Imagine your training data has, for feature 1, values ["cat", "dog", "duck"] and so cat=>(1,0,0), dog=>(0,1,0), duck=>(0,0,1). But in test you only see ["cat", "duck"] and thus cat=>(1, 0), duck=>(0,1), so you have a shape mismatch and a duck became sort of a dog!

  • Related