I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES
I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.
df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
'The wireless internet was unreliable. ', 'i am still her . :). ',
'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
'positive', 'negative', 'neutral', 'positive', 'neutral']})
But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.
But when I add this line of the code to remove stop_words
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']
It keeps raises this error:
ValueError: All sentences should be Unicode-encoded!
Also, the error raises in the tokenization step:
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
I want to know what is happening here that it causes this error, and the correct solution to fix the code.
(I have tried different encodings like uff-8
, etc but non worked)
CodePudding user response:
I don't know the reason yet but when I did
df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')
it worked.
Still very curious to know why this is happening only when I do stop words removal