Machine Learning Model Only Predicting Mode in Data Set-CodePudding

I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

Now, I define both columns as 'phrases' and 'anger':

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

Subsequently, I split this data such that 20% is used for testing and 80% is used for training:

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

Now, I convert the words in x_train to numerical data using TfidfVectorizer:

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

Now, I convert x_traincv to an array:

a = x_traincv.toarray()

I also convert x_testcv to a numerical array:

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

Now, I have

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score   (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

However, an example of the results I get are:

Bring it back [0] It is greatly appreciatd when [0] Apologies in advance [0] Can you please [0] See you then [0] I hope this email finds you well. [0] Thanks in advance [0] I am sorry to inform [0] You’re absolutely right [0] I am deeply regretful [0] Shoot me through [0] I’m looking forward to [0] As I already stated [0] Hello [0] We expect all students [0] If it’s not too late [0]

and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.

Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach?

Many thanks.

CodePudding user response：

Two things, you are fitting The MultinomialNB with the test set. In your loop you have mnb.fit(x_testcv,y_test) but you should do mnb.fit(x_traincv,y_train)

Second, when performing pre-processing you should call the fit_transform only on the training data while on the test you should call only the transform method.