Home > Software design >  Modify Stopword-Removal-Code to remove numbers as well
Modify Stopword-Removal-Code to remove numbers as well

Time:01-18

I have a tokenized text in a df column. The code to remove the stopwords from it works, but I like to remove punctuation, numbers and special characters as well, without spelling them out. Like I want to be sure it also deletes numbers that are larger / tokenized as one token.

My current code is:

eng_stopwords = stopwords.words('english')
punctuation = ['.', ',', ';', ':', '!' #and so on] 
complete_stopwords = punctuation   eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

CodePudding user response:

You can get the punctuations from the string module:

import string
print(string.punctuation)

'!"#$%&\'()* ,-./:;<=>?@[\\]^_`{|}~'

eng_stopwords = stopwords.words('english')

punctuation = list(string.punctuation) 

complete_stopwords = punctuation   eng_stopwords

df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])
  • Related