using ColumnTransformer for predicting values-CodePudding

I am currently running a logistic regression model using keras.

I have 1 numeric variable and around 6 categorical variables.

I am currently using a column transformer for training and testing the model and it works perfect (code shown below):

numeric_variables = ["var1"]
cat_variables = ["var2","var3","var4","var5","var6","var7"]

pipeline = ColumnTransformer([('num',StandardScaler(), numeric_variables), ('cat',OneHotEncoder(handle_unknown = "ignore"), cat_variables)], remainder = "passthrough")

pipeline.fit(X_Train)

pipeline.fit_transform(X_Train)

This works perfectly when I run the train and test dataset.

However, when I deploy the model to get the probability of a customer renewing, I am sending the data as a dataframe with one row.

While the fit_transform for X_Train and X_Test gives out a nx17 array (because of the onehotencoding of the 7 factors), the transform of the predictions only gives nx7.

My theory here is that the pipeline is dropping one hot encoded fields. For instance, if var2 can take 3 values (say "M","F" and "O"), the X_Train gives out 3 columns for each (isM, isF and isO) while the transform for the predictions is only giving the output for "isM" if the value of Var2 is "M"

How do I address this issue?

I get this error when I run the model.predict on the single customer example:

Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 19), found shape=(None, 7)

CodePudding user response：

After the discussion in the comments:

It appears that you are using pipeline.fit_transform(X_test). This means you are fitting your pipeline with X_test before transforming it. This is a problem in your case for two reasons:

You are re-fitting the StandardScaler, which means you will scale your features differently than what you did with the train set.
You are re-fitting the OneHotEncoder. Hence, you could miss some categories in cat_variables that were present only in the train set. Consequently, your output shape is smaller.

Simply use .transform(X_train) instead.