I am trying to remove stopwords in my data with nltk, but after several attempts I am unable to remove the stopwords. The tokenization part of my code works, but I am unable to understand why stopwords does not work.
def pre_process(text):
# remove special characters and digits
text=re.sub("(\\d|\\W|_) "," ",text)
text=re.split("\W ",text)
return text
text = dat['text'].apply(lambda x:pre_process(x))
nltk.download('stopwords')
def remove_stopwords(text):
for word in text:
if word in stopwords.words('english'):
text.remove(word)
return text
text_stopword = text.apply(lambda x:remove_stopwords(x))
The code should remove words such as 'the', but after running my csv through the code, that words such as 'the' is still present.
Current results:
text
returns:
[tv, future, in, the, hands, of, viewers, with...
text_stopword
returns:
[tv, future, in, the, hands, of, viewers, with...
CodePudding user response:
Your return statement in remove_stopwords
function is wrongly indented. Due to that function returns text right after the first iteration.
Please go with:
def remove_stopwords(text):
for word in text:
if word in stopwords.words('english'):
text.remove(word)
return text