Home > Enterprise >  Replacing words in a dataframe given a list of common misspellings in Python?
Replacing words in a dataframe given a list of common misspellings in Python?

Time:09-22

I have a large dataset with one column of multi-word names that contain frequent spelling errors. We have a separate dataframe with a column of common misspellings. We want to replace all of the misspellings in the large dataset with the one correct spelling.

This is what I tried so far (with a simplified dataset). It does replace the word but I'm finding that there are extra characters at the end of the word each time. I believe it's replacing the word (but not the whole word) every time it finds one of the common misspellings but I want it to replace the whole word if it's an exact match to that misspelling (but not necessarily to the whole name). I'm guessing there is something I could do with regex?

df = pd.DataFrame({'city': ['City of Cleveland', 'City of Clvland', 'City of Boston', 'City New York', 'City of Clev', 'City of Miami', 'City of Cland', 'Cle Ci']})

df_spelling = pd.DataFrame({'spellings': ['Clvland','Clev', 'Cland']})


for item in df_spelling['spellings']:
 df['city'] = df['city'].str.replace(item,'Cleveland') 

This code gives the output in the attached image.

Output

CodePudding user response:

It appears that the previous answerer has not tried their code, as if they do they will get the following error, since df['city'] is a pandas Series, not a string:

TypeError: expected string or bytes-like object

The general idea of using RegEx to match only whole words is correct, though. Series.str.replace accepts a parameter regex, so there is no need to use the re module. Also \b in RegEx is the most concise way to match word boundaries (which may be space characters, or the beginning/end of the string).

The following gives the output you're looking for:

for item in df_spelling['spellings']:
    df['city'] = df['city'].str.replace(rf"\b{item}\b", 'Cleveland', regex=True) 

CodePudding user response:

We can construct a singular regex pattern with your spellings DataFrame that matches the whole cell, rather than trying to fix certain pieces:

pattern = '|'.join(f'(.*{i}.*)' for i in df_spelling.spellings)
df.city = df.city.replace(pattern, 'City of Cleveland', regex=True)
print(df)

Output:

                city
0  City of Cleveland
1  City of Cleveland
2     City of Boston
3      City New York
4  City of Cleveland
5      City of Miami
6  City of Cleveland
7             Cle Ci

Underlying Regex Idea

CodePudding user response:

Could something like this work:

import re
...
df['city'] = df['city'].str.replace(r'\b{}\b'.format(item), 'Cleveland', regex=True)

Basically matches the whole word that contains the mistyped string.

  • Related