Home > Software engineering >  In Python remove all stop words bad characters from my pandas dataframe
In Python remove all stop words bad characters from my pandas dataframe

Time:12-24

In Python, I would like to remove all stop words including bad characters in one go from my pandas dataframe.

This is what I have tried:

stop_words_dataset = pd.read_csv(r'./stop.csv')
stop_words = stop_words_dataset['StopWords'].tolist()

dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

However, my dataset still contains some characters that are in my stop.csv...

stop words = ['OF','FT', ' ', '*', '-', '/', ')', '(']

For example, * and / are still in my dataset where it successfully removed OF and FT, why?

I have also done the same with regex, [^A-Za-z0-9] however, I prefer the stop worklist solution and would like to get this working.

Concrete example:

stop_words_dataset = pd.read_csv(r'./stops.csv')
stop_words = stop_words_dataset['StopWords'].tolist()
# Remove stop words including bad characters.
dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

print of stopword list

['JUN', 'JUNE', 'JUL', 'JULY', 'AUG', 'OCT', 'NOV', 'DEC', 'FT', ' ', '*', '-', '/', ')', '(']

Example dataset after cleaning

Before cleaning

*CHIMNEY CAKE PARAD LONDON
PUMPKIN CAFE DEC

After: This contains * but DEC was removed

*CHIMNEY CAKE PARAD LONDON
PUMPKIN CAFE

CodePudding user response:

You can check for every letter in the word. That's because you're trying to delete the "/", and maybe it is "inside" a word.

Try this:

' '.join([''.join([l for l in item if l not in stop_words]) for item in x.split() if item not in stop_words])

CodePudding user response:

You can also split based on the regular expression \W to split on any non-alphanumeric characters.

dataframe['description'] = dataframe['description'].apply(lambda x: ' '.join([item for item in re.split(r'\W ', x) if item not in stop_words and item != '']))

Output:

>>> dataframe
                 description
0  CHIMNEY CAKE PARAD LONDON
1               PUMPKIN CAFE
  • Related