I have a large dataset with one column of multi-word names that contain frequent spelling errors. We have a separate dataframe with a column of common misspellings. We want to replace all of the misspellings in the large dataset with the one correct spelling.
This is what I tried so far (with a simplified dataset). It does replace the word but I'm finding that there are extra characters at the end of the word each time. I believe it's replacing the word (but not the whole word) every time it finds one of the common misspellings but I want it to replace the whole word if it's an exact match to that misspelling (but not necessarily to the whole name). I'm guessing there is something I could do with regex?
df = pd.DataFrame({'city': ['City of Cleveland', 'City of Clvland', 'City of Boston', 'City New York', 'City of Clev', 'City of Miami', 'City of Cland', 'Cle Ci']})
df_spelling = pd.DataFrame({'spellings': ['Clvland','Clev', 'Cland']})
for item in df_spelling['spellings']:
df['city'] = df['city'].str.replace(item,'Cleveland')
This code gives the output in the attached image.
CodePudding user response:
It appears that the previous answerer has not tried their code, as if they do they will get the following error, since df['city']
is a pandas Series, not a string:
TypeError: expected string or bytes-like object
The general idea of using RegEx to match only whole words is correct, though. Series.str.replace
accepts a parameter regex
, so there is no need to use the re
module. Also \b
in RegEx is the most concise way to match word boundaries (which may be space characters, or the beginning/end of the string).
The following gives the output you're looking for:
for item in df_spelling['spellings']:
df['city'] = df['city'].str.replace(rf"\b{item}\b", 'Cleveland', regex=True)
CodePudding user response:
We can construct a singular regex pattern with your spellings DataFrame that matches the whole cell, rather than trying to fix certain pieces:
pattern = '|'.join(f'(.*{i}.*)' for i in df_spelling.spellings)
df.city = df.city.replace(pattern, 'City of Cleveland', regex=True)
print(df)
Output:
city
0 City of Cleveland
1 City of Cleveland
2 City of Boston
3 City New York
4 City of Cleveland
5 City of Miami
6 City of Cleveland
7 Cle Ci
CodePudding user response:
Could something like this work:
import re
...
df['city'] = df['city'].str.replace(r'\b{}\b'.format(item), 'Cleveland', regex=True)
Basically matches the whole word that contains the mistyped string.