I have this dataframe and I am trying to clean the typos where people enter the first and last names incorrectly. How would I clean the dataset? Can I use conditional statements to help?
Date Last First City Type
2016-01-01 smith john Riley Park Staff
2016-01-02 smit john Riley Park Staff
2016-01-03 smith john Riley Park Staff
2016-01-04 smith joh Riley Park Staff
2016-01-08 smith john Riley Park Contractor
2016-01-04 smith john Fairview Staff
2016-01-02 baker bob Strathcona Staff
2016-01-03 bake bob Strathcona Staff
2016-01-04 baker bob Strathcona Staff
Desired cleaned dataset
Date Last First City Type
2016-01-01 smith john Riley Park Staff
2016-01-02 smith john Riley Park Staff
2016-01-03 smith john Riley Park Staff
2016-01-04 smith john Riley Park Staff
2016-01-08 smith john Riley Park Contractor
2016-01-04 smith john Fairview Staff
2016-01-02 baker bob Strathcona Staff
2016-01-03 baker bob Strathcona Staff
2016-01-04 baker bob Strathcona Staff
I got really confused how I would clean this, I thought about creating other dataframes then merging it but I am hoping for some expert to help me with this.
EDIT: I want to have it only replace if the city and type staff are the same.
CodePudding user response:
from thefuzz import fuzz
def correct_typo(typo, ref_names, ratio=80):
for name in ref_names :
if fuzz.ratio(typo, name) > ratio :
return name
return typo
CodePudding user response:
You could use Where with condition of selection and change your value if not completed
df=pd.DataFrame({"Date":["2016-01-01","2016-01-02","2016-01-03"],"Name['smith','smi',"Fathallah"],"LastName":["john","jon","Mohamed"]})
Date Name LastName
2016-01-01 smith john
2016-01-02 smi jon
2016-01-03 Fathallah Mohamed
df["LastName"].where(lambda x:x[:2]=="jo","john",inplace=True)
df["Name"].where(lambda x:x[:2]=="sm","smith",inplace=True)
Date Name LastName
2016-01-01 smith john
2016-01-02 smith john
2016-01-03 Fathallah Mohamed
CodePudding user response:
If you have a list of all typos just use the replace:
df.replace(['smit', 'joh', 'bake'], ['smith', 'john', 'baker'])
If you're sure there's always a correct value in the row above the typo, use replace with 'ffill' method:
df.replace(['joh', 'bake', 'smit'], method='ffill')
Replacement if only City and Type are the same:
df_gby = df.groupby(['City', 'Type'])
pd.concat(
[
df_gby.get_group(group).replace(['joh', 'bake', 'smit'], method='ffill')
for group in df_gby.groups
]
)
Above, we grouped the df by City and Type, iteracting over each group and doing the replacement.
This way we are working on groups with the same values.