Home > front end >  Unicode error in the tokenization step only when doing stop words removal in python 2
Unicode error in the tokenization step only when doing stop words removal in python 2

Time:03-15

I am trying to run this script: enter link description here (The only difference is that instead of this TEST_SENTENCES I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.

But when I add this line of the code to remove stop_words

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

It keeps raises this error: ValueError: All sentences should be Unicode-encoded!

Also, the error raises in the tokenization step:

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

I want to know what is happening here that it causes this error, and the correct solution to fix the code.

(I have tried different encodings like uff-8 , etc but non worked)

CodePudding user response:

I don't know the reason yet but when I did

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

it worked.

Still very curious to know why this is happening only when I do stop words removal

  • Related