I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction. **
/usr/local/lib/python3.8/dist-packages/sklearn/base.py in check_n_features(self, X, reset) 398 399 if n_features != self.n_features_in: --> 400 raise ValueError( 401 f"X has {n_features} features, but {self.class.name} " 402 f"is expecting {self.n_features_in_} features as input."
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
CodePudding user response:
The problem is due to your creation of a new TfidfVectorizer
by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier
fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.