remove entire rows from df if the word occurs-CodePudding

list of stowwords:

stop_w = ["in", "&", "the", "|", "and", "is", "of", "a", "an", "as", "for", "was"]

df:

words	frequency
the company	10
green energy	9
founded in	8
gases for	8
electricity	5

I would like to remove entire row if it contains ANY of given stopwords, in this example output should be:

words	frequency
green energy	9
electricity	5

CodePudding user response：

The | character has a meaning, it means or in python's terms, so you need to escape that meaning in order to use it in your stop words list. You escape that with a backslash \ (see more here)

Having said that you can do:

stop_w = ["in", "&", "the", "\|", "and", "is", "of", "a", "an", "as", "for", "was"]
df.loc[~df['words'].str.contains('|'.join(stop_w))]

prints:

          words  frequency
1  green energy          9
4   electricity          5

CodePudding user response：

You can create sub_df like this:

sub_df = df[df.words.str not in stop_w]

Or get ids of rows i want to remove:

idx = df[df.words.str in stop_w].index
df.drop(idx)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html