Removing columns from pandas dataframe based on function output-CodePudding

I have the following dataframe:

d = {'col1': ['challenging', 'swimming', ',', 'it', 'my', 'there!'], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
          col1  col2
0  challenging     3
1     swimming     4
2            ,     5
3           it     6
4           my     7
5       there!     8

I have defined this function that removes punctuations and stopwords:

from nltk.corpus import stopwords
import string
stop_words = set(stopwords.words('english'))

def remove_stopwords_punc(sent_words):
   # print(sent_words)
    return [ww for ww in sent_words 
            if ww.lower() not in stop_words and ww not in string.punctuation]
from nltk.tokenize import word_tokenize
remove_stopwords_punc(word_tokenize('challenging swimming , it my there!'))

['challenging', 'swimming']

I want to run this function on the col1 and only keep the rows that are not deemed at stopwords or punctuation, in this example leaving this output:

          col1  col2
0  challenging     3
1     swimming     4

CodePudding user response：

Use isin and str.contains:

# Create a boolean mask per condition
m1 = df['col1'].str.lower().isin(stop_words)
m2 = df['col1'].str.contains(fr"[{string.punctuation}]")

# Filter out your dataframe according conditions
df = df[~(m1|m2)]
print(df)

# Output
          col1  col2
0  challenging     3
1     swimming     4

CodePudding user response：

Use your above return list

l = ['challenging', 'swimming']

out = df.loc[df['col1'].isin(l)]

CodePudding user response：

If you preprocess the list of acceptable words, it would likely be faster (I haven't tested though):

valid_words = set(remove_stopwords_punc(word_tokenize(" ".join(df['col1']))))
df = df[df['col1'] in valid_words]

CodePudding user response：

You have sets, just compile a set of all words to remove:

remove = stop_words|set(string.punctuation)

df[~(df['col1'].str.lower().isin(remove))]