Home > Enterprise >  I defined a function to Tokenize my text but calling the function generated an error as seen below,
I defined a function to Tokenize my text but calling the function generated an error as seen below,


def preprocess_text(text):
    tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\'] ')
    cleaned_tokens = [word.lower() for word in tokenized_document if word.lower() not in stop_words]
    stemmed_text = [nltk.stem.PorterStemmer().stem(word) for word in cleaned_tokens]
    return stemmed_text

data["Text"] = data["Text"].apply(preprocess_text)


Error message:

TypeError: 'RegexpTokenizer' object is not iterable

CodePudding user response:

Your tokenized_document object is an instance of nltk.tokenize.RegexpTokenizer. You are trying to iterate over the values of tokenized_document (in the for word in tokenized_document expression) but the nltk.tokenize.RegexpTokenizer doesn't support that usage. (That's what the 'RegexpTokenizer' object is not iterable message is telling you.)

CodePudding user response:

The source of the problem is that you have not called the tokenize method, and haven't used the text parameter at all.

Fix: call .tokenize(text):

    tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\'] ').tokenize(text)
  • Related