I built an XGBoost model, made predictions, and evaluated the model's accuracy; however, I'm running into issues with using the model on a new DataFrame.
New DataFrame code:
new_data = [['Academic', 'A', 'Male', 'Less Interested', 'Urban', 56, 6950000, 83.0, 84.09,
False]]
new = pd.DataFrame(data=new_data, columns = ['type_school', 'school_accreditation', 'gender',
'interest', 'residence', 'parent_age', 'parent_salary', 'house_area', 'average_grades',
'parent_was_in_college'])
column_trans = make_column_transformer(
(OneHotEncoder(), ['type_school','school_accreditation',
'gender','interest','residence','parent_was_in_college']),
remainder='passthrough')
X_new = column_trans.fit_transform(new)
preds = optimal_params.predict(X_new)
After running the above code, I get the following error:
"ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8',
'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18'] ['f0', 'f1', 'f2', 'f3',
'f4', 'f5', 'f6', 'f7', 'f8', 'f9']
expected f17, f13, f18, f15, f10, f12, f16, f14, f11 in input data"
However, the column_trans is the exact same used on the training DataFrame, so I'm not sure what's going on. Is there something off about my column_trans?
CodePudding user response:
When running prediction, then new data should be just transformed with .transform
(not .fit_transform
). Here's pseudocode:
model = ... # some specification
model.fit(old_data) # learns the parameters
transformed_new_data = model.transform(new_data)
CodePudding user response:
As I understand, you dont save your column_trans
, which fit
on your training model.
The mechanism here is
-
- Fit on training dataset
-
- Save your preprocessor (here is
column_trans
)
- Save your preprocessor (here is
-
- When you make inference (predict on new data), you load your preprocessor and make
transform
- When you make inference (predict on new data), you load your preprocessor and make
You can find more information about these things on this link