How to use multiclassification model to make predicitions in entire dataframe-CodePudding

I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.

My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).

data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)

First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.

tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])

y = df['categories']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Second, I run the SVC model getting good results:

from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))

y_pred = model.predict(X_test) 

print(metrics.classification_report(y_test, y_pred))

Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:

tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])

y_pred_entire_df = model.predict(X_entire_df)

Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.

I would appreciate your help!

CodePudding user response：

These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).

I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.

From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.