def preprocess_text(text):
tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\'] ')
cleaned_tokens = [word.lower() for word in tokenized_document if word.lower() not in stop_words]
stemmed_text = [nltk.stem.PorterStemmer().stem(word) for word in cleaned_tokens]
return stemmed_text
data["Text"] = data["Text"].apply(preprocess_text)
data.head()
Error message:
TypeError: 'RegexpTokenizer' object is not iterable
CodePudding user response:
Your tokenized_document
object is an instance of nltk.tokenize.RegexpTokenizer
. You are trying to iterate over the values of tokenized_document
(in the for word in tokenized_document
expression) but the nltk.tokenize.RegexpTokenizer
doesn't support that usage. (That's what the 'RegexpTokenizer' object is not iterable
message is telling you.)
CodePudding user response:
The source of the problem is that you have not called the tokenize
method, and haven't used the text
parameter at all.
Fix: call .tokenize(text)
:
tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\'] ').tokenize(text)