I am trying to solve tokenization problem in my dataset with comments from social media. I want to tokenize, lemmatize, remove punctuations and stop-words from the pandas column. I am struggling how to do it for each of the comment. I receive the following error when trying to get tokens:
import pandas as pd
import nltk
...
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message']), axis=1)
TypeError: expected string or bytes-like object
When I am trying to tell pandas that I am passing it a string object, it gives me the following error message:
merged['message_tokens'] = merged.apply(lambda x: nltk.tokenize.word_tokenize(x['Clean_message'].str), axis=1)
AttributeError: 'str' object has no attribute 'str'
What am I doing wrong?
CodePudding user response:
You can use astype
to force the column type to string
merged['Clean_message'] = merged['Clean_message'].astype(str)
If you want to look at what's wrong in original column, you can use
m = merged['Clean_message'].apply(type).ne(str)
out = merged[m]
out
dataframe contains the rows where the type of Clean_message
column is not string.