Home > OS >  Removing a custom stop words list from a column in a pandas data frame
Removing a custom stop words list from a column in a pandas data frame

Time:06-30

I'm working on analyzing a long list of survey responses. I can remove the stopwords in the standard nltk list perfectly fine. However, I've created a modified list and can't seem to noodle how to incorporate it into the code. The original code I used for the standard list was:

#creating a column where the stopwords are removed from a column where I have removed the punctuation from responses that have also been tokenized and made all lowercase.

stop_words = set(stopwords.words('english'))

df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])

df.head()

I've added to the standard list using the following code:

stop_words = set(stopwords.words('english'))

new_stopwords = ['satisfying', 'satisfy', 'satisfied', 'clemson', 'university', 'institution', 'disappointing', 'disappoint', 'disappointed', 'experience', 'would', 'should']

new_stopwords_list = stop_words.union(new_stopwords)

My question is how would I modify my original code to include the new_stopwords_list instead of the standard one?

CodePudding user response:

I am not sure if I understand completely, but why can't you use the same line of code but then check for membership in the new set? So:

df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in new_stopwords])
  • Related