How do I clean a python dataframe? Name misspelling, pandas?-CodePudding

I have this dataframe and I am trying to clean the typos where people enter the first and last names incorrectly. How would I clean the dataset? Can I use conditional statements to help?

Date           Last    First   City         Type
2016-01-01     smith   john    Riley Park   Staff
2016-01-02     smit    john    Riley Park   Staff
2016-01-03     smith   john    Riley Park   Staff
2016-01-04     smith   joh     Riley Park   Staff
2016-01-08     smith   john    Riley Park   Contractor
2016-01-04     smith   john    Fairview     Staff
2016-01-02     baker   bob     Strathcona   Staff
2016-01-03     bake    bob     Strathcona   Staff
2016-01-04     baker   bob     Strathcona   Staff

Desired cleaned dataset

Date           Last    First   City         Type
2016-01-01     smith   john    Riley Park   Staff
2016-01-02     smith   john    Riley Park   Staff
2016-01-03     smith   john    Riley Park   Staff
2016-01-04     smith   john    Riley Park   Staff
2016-01-08     smith   john    Riley Park   Contractor
2016-01-04     smith   john    Fairview     Staff
2016-01-02     baker   bob     Strathcona   Staff
2016-01-03     baker   bob     Strathcona   Staff
2016-01-04     baker   bob     Strathcona   Staff

I got really confused how I would clean this, I thought about creating other dataframes then merging it but I am hoping for some expert to help me with this.

EDIT: I want to have it only replace if the city and type staff are the same.

CodePudding user response：

from thefuzz import fuzz


def correct_typo(typo, ref_names, ratio=80):

    for name in ref_names :
        if fuzz.ratio(typo, name) > ratio :
            return name

    return typo

CodePudding user response：

You could use Where with condition of selection and change your value if not completed

df=pd.DataFrame({"Date":["2016-01-01","2016-01-02","2016-01-03"],"Name['smith','smi',"Fathallah"],"LastName":["john","jon","Mohamed"]})


Date        Name        LastName
2016-01-01  smith       john
2016-01-02  smi          jon
2016-01-03  Fathallah   Mohamed


df["LastName"].where(lambda x:x[:2]=="jo","john",inplace=True)
df["Name"].where(lambda x:x[:2]=="sm","smith",inplace=True)

Date          Name        LastName
2016-01-01    smith         john
2016-01-02    smith         john
2016-01-03   Fathallah     Mohamed

CodePudding user response：

If you have a list of all typos just use the replace:

df.replace(['smit', 'joh', 'bake'], ['smith', 'john', 'baker'])

If you're sure there's always a correct value in the row above the typo, use replace with 'ffill' method:

df.replace(['joh', 'bake', 'smit'], method='ffill')

Replacement if only City and Type are the same:

df_gby = df.groupby(['City', 'Type'])

pd.concat( 
    [
        df_gby.get_group(group).replace(['joh', 'bake', 'smit'], method='ffill') 
             for group in df_gby.groups
    ]
)

Above, we grouped the df by City and Type, iteracting over each group and doing the replacement.

This way we are working on groups with the same values.