I'm working on analyzing a long list of survey responses. I can remove the stopwords in the standard nltk list perfectly fine. However, I've created a modified list and can't seem to noodle how to incorporate it into the code. The original code I used for the standard list was:
#creating a column where the stopwords are removed from a column where I have removed the punctuation from responses that have also been tokenized and made all lowercase.
stop_words = set(stopwords.words('english'))
df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])
df.head()
I've added to the standard list using the following code:
stop_words = set(stopwords.words('english'))
new_stopwords = ['satisfying', 'satisfy', 'satisfied', 'clemson', 'university', 'institution', 'disappointing', 'disappoint', 'disappointed', 'experience', 'would', 'should']
new_stopwords_list = stop_words.union(new_stopwords)
My question is how would I modify my original code to include the new_stopwords_list instead of the standard one?
CodePudding user response:
I am not sure if I understand completely, but why can't you use the same line of code but then check for membership in the new set? So:
df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in new_stopwords])