I have the following dataframe:
d = {'col1': ['challenging', 'swimming', ',', 'it', 'my', 'there!'], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
col1 col2
0 challenging 3
1 swimming 4
2 , 5
3 it 6
4 my 7
5 there! 8
I have defined this function that removes punctuations and stopwords:
from nltk.corpus import stopwords
import string
stop_words = set(stopwords.words('english'))
def remove_stopwords_punc(sent_words):
# print(sent_words)
return [ww for ww in sent_words
if ww.lower() not in stop_words and ww not in string.punctuation]
from nltk.tokenize import word_tokenize
remove_stopwords_punc(word_tokenize('challenging swimming , it my there!'))
['challenging', 'swimming']
I want to run this function on the col1 and only keep the rows that are not deemed at stopwords or punctuation, in this example leaving this output:
col1 col2
0 challenging 3
1 swimming 4
CodePudding user response:
Use isin
and str.contains
:
# Create a boolean mask per condition
m1 = df['col1'].str.lower().isin(stop_words)
m2 = df['col1'].str.contains(fr"[{string.punctuation}]")
# Filter out your dataframe according conditions
df = df[~(m1|m2)]
print(df)
# Output
col1 col2
0 challenging 3
1 swimming 4
CodePudding user response:
Use your above return list
l = ['challenging', 'swimming']
out = df.loc[df['col1'].isin(l)]
CodePudding user response:
If you preprocess the list of acceptable words, it would likely be faster (I haven't tested though):
valid_words = set(remove_stopwords_punc(word_tokenize(" ".join(df['col1']))))
df = df[df['col1'] in valid_words]
CodePudding user response:
You have sets, just compile a set of all words to remove:
remove = stop_words|set(string.punctuation)
df[~(df['col1'].str.lower().isin(remove))]