In Python, I would like to remove all stop words including bad characters in one go from my pandas dataframe.
This is what I have tried:
stop_words_dataset = pd.read_csv(r'./stop.csv')
stop_words = stop_words_dataset['StopWords'].tolist()
dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))
However, my dataset still contains some characters that are in my stop.csv...
stop words = ['OF','FT', ' ', '*', '-', '/', ')', '(']
For example, *
and /
are still in my dataset where it successfully removed OF
and FT
, why?
I have also done the same with regex, [^A-Za-z0-9]
however, I prefer the stop worklist solution and would like to get this working.
Concrete example:
stop_words_dataset = pd.read_csv(r'./stops.csv')
stop_words = stop_words_dataset['StopWords'].tolist()
# Remove stop words including bad characters.
dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))
print of stopword list
['JUN', 'JUNE', 'JUL', 'JULY', 'AUG', 'OCT', 'NOV', 'DEC', 'FT', ' ', '*', '-', '/', ')', '(']
Example dataset after cleaning
Before cleaning
*CHIMNEY CAKE PARAD LONDON
PUMPKIN CAFE DEC
After:
This contains *
but DEC
was removed
*CHIMNEY CAKE PARAD LONDON
PUMPKIN CAFE
CodePudding user response:
You can check for every letter in the word. That's because you're trying to delete the "/", and maybe it is "inside" a word.
Try this:
' '.join([''.join([l for l in item if l not in stop_words]) for item in x.split() if item not in stop_words])
CodePudding user response:
You can also split based on the regular expression \W
to split on any non-alphanumeric characters.
dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in re.split(r'\W ', x) if item not in stop_words and item != '']))
Output:
>>> dataframe
description
0 CHIMNEY CAKE PARAD LONDON
1 PUMPKIN CAFE