I am working with a regression problem and have to predict Sales for a e-mart. In train data I have 10 columns - 'Store_id', 'Store_Type', 'Location_Type', 'Region_Code', 'Holiday', 'Discount', '#Order', 'Sales', 'date', 'month', 'year'
In test data the #Order column is missing.
The order column is the no of orders the particular store had in a day.
Now if I don't drop the order column in training data, while predicting test data I will dimension mismatch error.
Should I just drop that #Order column or is there any other way?
CodePudding user response:
It is normal that you have a mistmatch error, since your model was trained with #Order column.
You can either try to recover the #Order column, which get lost at some step in the process (client not giving whole data, cleaning data,etc...)
In the end the problem is that if your regression problem underline mechanism depends on the #Order column, your model will be way less accurate. On the other hand, if you know that what you are trying to predict is completely independent from #Order, then you can just drop the column.
CodePudding user response:
As far as I understand this question, you need to predict the sales for a store for a day. We know that #order
and sales
must be correlated with each other. Since we have to predict the sales
, we surely do not have the #order
for the day. In my opinion, you will have to rely on only other columns for the prediction and drop the #order
column while training.
Even if you can estimate #order
column from other columns and use it in the test dataset, you are indirectly just determining #order
column from other columns which means something like column_#order = f(other_columns)
and sales = g(column_#order, all_other_columns)
which means sales
is ultimately a function of other columns. So you can just drop that column.