Home > Enterprise >  Removing misspelled items in the pandas dataset
Removing misspelled items in the pandas dataset

Time:02-25

The following dataset is a minor representation of an example of the problem I am facing:

data = {'Name': ['Tom', 'Joseph', 'Tom', 'Joseph', 'John'], 'id_product': [1, 1, 2, 2, 3],
        'product_name': ["candy", "candy", "cookie", "coockie", "pepsi"],  'qtd_purchase': [1, 2, 4, 2, 2]}  

the pandas form:

enter image description here

The problem is that, for example, if we look at the unique values ​​of each column we will find the following error:

for col in data.columns:
    if not 'no' in col and not 'code' in col and not 'id' in col:
        un_n = data[col].nunique()
        print(f'n of unique {col}: {un_n}')

n of unique Name: 3

n of unique product_name: 4

n of unique qtd_purchase: 3

the id_product should have the same number of product_name. But because of a typo mistake, we have items with the same code and different names. For example, "cookie" and "cookie".

I was able to find these items using the following code:

pd.options.display.max_rows = None
data.groupby(['id_product', 'product_name']).count()['qtd_purchase']

enter image description here

For that scale, this code solves just fine. But in my real problem, I have 21000 unique id_products and 23000 product_name. It becomes impossible to identify them using the code above. Would there be any way to just print the values ​​that are badly written?

For example, the answer would look like this:

enter image description here

Thank you very much in advance

CodePudding user response:

Here's a possible approach.

data = {'Name': ['Tom', 'Joseph', 'Tom', 'Joseph', 'John'], 'id_product': [1, 1, 2, 2, 3],
        'product_name': ["candy", "candy", "cookie", "coockie", "pepsi"],  'qtd_purchase': [1, 2, 4, 2, 2]} 

df = pd.DataFrame(data)
print(df.loc[:,['id_product','product_name']].groupby(
    'id_product').filter(
    lambda g:any(g.nunique()>1)))

Result:

   id_product product_name
2           2       cookie
3           2      coockie

CodePudding user response:

You can try to use groupby on the id_product column and then count the values of each product:

df.groupby("id_product").product_name.value_counts()

For your example, this will return:

id_product  product_name
1           candy           2
2           coockie         1
            cookie          1
3           pepsi           1

This could help you to learn more about your data. If we assume that the misspellings happen only occasionally, you can get the most common spelling for each id with mode(). Let's add another purchase...

data = {'Name': ['Tom', 'Joseph', 'Tom', 'Joseph', 'John', "Cookie monster"], 
       'id_product': [1, 1, 2, 2, 3, 2],
       'product_name': ["candy", "candy", "cookie", "coockie", "pepsi", "cookie"],  
       'qtd_purchase': [1, 2, 4, 2, 2, 42]}  

df = pd.DataFrame(data)

df.head(6)

gives

             Name  id_product product_name  qtd_purchase
0             Tom           1        candy             1
1          Joseph           1        candy             2
2             Tom           2       cookie             4
3          Joseph           2      coockie             2
4            John           3        pepsi             2
5  Cookie monster           2       cookie            42

and then:

>>> df.groupby("id_product").product_name.apply(pd.Series.mode)
id_product   
1           0     candy
2           0    cookie
3           0     pepsi
Name: product_name, dtype: object
  • Related