Dataframe text column with spelling mistakes-CodePudding

Sorry for bad title, I wasn't sure how best to describe the issue.

I have a dataframe with a column for occupation, df['occupation'], users can enter their occupuation during signup using whatever terms they like.

I'm attempting to do an EDA on the column, however, I'm not sure how to clean the column to get it from this,

Occupation
a-level student
a level
alavls
university physics student
physics student
6th form student
builder

Into something like this,

Occupation
a-levels
University student
Full time employment

Without writing out hundreds of lines renaming each unique entry.

TYIA

Any help or links to useful modules would be great.

CodePudding user response：

Hi the approach you can possibly use in this problem is similar to the Solution Covered here Using apply map with Regex

The Regex approach will allow you to use Wildcards for cases you have not explored in your Dataset.

CodePudding user response：

The simplest way to do this is by applying a function that measures the similarity between the two sentences, there are plenty of similiraty mesures that could be used in this context like that hamming distance, however they are all relatively very limited, and you might to be forced at some point -if in production- to have a machine learning model for this task.

import pandas as pd

def hamming_distance(chaine1, chaine2):
    """mesures the similiraty between two chaines of caracters
    Note: this is a very limited as it only examines the positions of the letters in both chaines.
    """
    return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))

OCCUPATIONS = [ "Occupation","a-levels" , "University student" , "Full time employment"]
def get_most_similar(ocup,OCCUPATIONS):
    """return the most similar occupation from the unique values OCCUPATIONs to the entry ocup
    """
    return min([(oc,hamming_distance(ocup.lower(),oc.lower())) for oc in OCCUPATIONS],key=lambda item:item[1])[0]

column = ["Occupation","a-level student","a level","alavls","university physics student","physics student","6th form student","builder"]
df = pd.DataFrame(column,columns=['occupation'])  # this is just a reconstruction of your dataframe you probably don't need this line.

df['occupation']=df['occupation'].apply(lambda ocup : get_most_similar(ocup,OCCUPATIONS))
df.head(100)