Home > Enterprise >  XGBoost XGBRegressor predict with different dimensions than fit
XGBoost XGBRegressor predict with different dimensions than fit

Time:01-26

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:

    model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
    model.fit(trainX, trainy, verbose=False)

trainX is 2000 x 19, and trainy is 2000 x 1.

In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.

When I am making a prediction:

yhat = model.predict(x_input)

x_input has to be 19 dimensions. I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.

Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

CodePudding user response:

If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.

That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).

For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.

You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.

CodePudding user response:

Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).

Does xgboost supports such a feature?

Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.

Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).

I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

You should represent missing values as float("NaN") (not as None).

  • Related