xgboost model prediction error : Input numpy.ndarray must be 2 dimensional-CodePudding

I have a model that's trained locally and deployed to an engine, so that I can make inferences / invoke endpoint. When I try to make predictions, I get the following exception.

raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional

My model is a xgboost model with some pre-processing (variable encoding) and hyper-parameter tuning. Code to train the model:

import pandas as pd
import pickle
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder 


# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.1)

X_train.shape
(1000,21)

# Encode categorical variables  
cat_vars = ['cat1','cat2','cat3']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')

encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

X_train.shape
(1000,420)

# Define a xgboost regression model
model = XGBRegressor()

# Do hyper-parameter tuning
.....

# Fit model
model.fit(X_train, y_train)

Here's what model object looks like:

XGBRegressor(colsample_bytree=xxx, gamma=xxx,
             learning_rate=xxx, max_depth=x, n_estimators=xxx,
             subsample=xxx)

My test data is a string of float values which is turned into an array as the data must be passed as numpy array.

testdata = [........., 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2000, 200, 85, 412412, 123, 41, 552, 50000, 512, 0.1, 10.0, 2.0, 0.05]

I have tried to reshape the numpy array from 1d to 2d, however, that doesn't work as the number of features between test data and trained model do not match.

My question is how do I pass a numpy array same as the length of # of features in trained model? Any work around ideas? I am able to make predictions by passing test data as a list locally.

More info on inference script here: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py

Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
return fn(*args, **kwargs)
File "/opt/ml/code/inference.py", line 75, in predict_fn
prediction = model.predict(input_data)
File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 448, in predict
test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 404, in __init__
self._init_from_npy2d(data, missing, nthread)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 474, in _init_from_npy2d
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional

When I attempt to reshape the test data to 2d numpy array, using testdata.reshape(-1,1), I run into feature_names mismatch exception.

File "/opt/ml/code/inference.py", line 75, in predict_fn
3n0u6hucsr-algo-1-qbiyg  |     prediction = model.predict(input_data)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 456, in predict
3n0u6hucsr-algo-1-qbiyg  |     validate_features=validate_features)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1284, in predict
3n0u6hucsr-algo-1-qbiyg  |     self._validate_features(data)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1690, in _validate_features
3n0u6hucsr-algo-1-qbiyg  |     data.feature_names))
3n0u6hucsr-algo-1-qbiyg  | ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15',

Update: I can retrieve the feature names for the model by running model.get_booster().feature_names. Is there a way I can use these names and assign to test data point so that they are consistent?

['f0', 'f1', 'f2', 'f3', 'f4', 'f5',......'f417','f418','f419']

CodePudding user response：

My process looks like this:

train encoders on labeled train data
transform labeled train/test/validation data, passing in an enviroment variable LABELED=True
with the same code, for prediction, do not pass in environment variable

my predict_fn looks like this:

def predict_fn(input_object, model):
    
    labeled = os.environ.get("LABELED", False)

    if not labeled: input_object = np.insert(input_object, 0, np.NaN, axis=1) # add fake label

    result = model.transform(pd.DataFrame(input_object)).to_numpy()
    
    if not labeled: result = np.delete(result, 0, axis=1) # remove fake label

    return result

CodePudding user response：

I think the solution is to provide the test data as the same data type as the test data. Currently you provide the training data as pandas dataframe and the test data as numpy array. My understanding is that you use the sklearn API of XGboost.

So you could use numpy arrays throughout by using:

model.fit(X_train.values, y_train.values)

Or Pandas Dataframes throughout. By doing:

Y_test = pd.DataFrame(testdata)
Y_test.columns = model.get_booster().feature_names
model.predict(Y_test)

I think both should work.