How do I ensure make_column_transformer correctly labels object?-CodePudding

I built an XGBoost model, made predictions, and evaluated the model's accuracy; however, I'm running into issues with using the model on a new DataFrame.

New DataFrame code:

new_data = [['Academic', 'A', 'Male', 'Less Interested', 'Urban', 56, 6950000, 83.0, 84.09, 
False]]

new = pd.DataFrame(data=new_data, columns = ['type_school', 'school_accreditation', 'gender', 
'interest', 'residence', 'parent_age', 'parent_salary', 'house_area', 'average_grades', 
'parent_was_in_college'])

column_trans = make_column_transformer(
(OneHotEncoder(), ['type_school','school_accreditation',
              'gender','interest','residence','parent_was_in_college']),
     remainder='passthrough')

X_new = column_trans.fit_transform(new)

preds = optimal_params.predict(X_new)

After running the above code, I get the following error:

"ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 
'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18'] ['f0', 'f1', 'f2', 'f3', 
'f4', 'f5', 'f6', 'f7', 'f8', 'f9']
expected f17, f13, f18, f15, f10, f12, f16, f14, f11 in input data"

However, the column_trans is the exact same used on the training DataFrame, so I'm not sure what's going on. Is there something off about my column_trans?

CodePudding user response：

When running prediction, then new data should be just transformed with .transform (not .fit_transform). Here's pseudocode:

model = ... # some specification
model.fit(old_data) # learns the parameters
transformed_new_data = model.transform(new_data)

CodePudding user response：

As I understand, you dont save your column_trans, which fit on your training model.

The mechanism here is

1. Fit on training dataset
1. Save your preprocessor (here is column_trans)
1. When you make inference (predict on new data), you load your preprocessor and make transform

You can find more information about these things on this link