Home > front end >  Remove duplicates taking into account two columns, lower case and accents
Remove duplicates taking into account two columns, lower case and accents

Time:07-13

I have the following DataFrame in pandas:

code town district suburb
02 Benalmádena Málaga Arroyo de la Miel
03 Alicante Jacarilla Jacarilla, Correntias Bajas (Jacarilla)
04 Cabrera d'Anoia Barcelona Cabrera D'Anoia
07 Lanjarón Granada Lanjaron
08 Santa Cruz de Tenerife Santa Cruz de Tenerife Centro-Ifara
09 Córdoba Córdoba Cordoba

For each row in the suburb column, if the value it contains is equal (in lower case and without accents) to district or town columns, it becomes NaN.

# Function to remove accents and shift to lower case.
def rm_accents_lowcase(a):
    return unidecode.unidecode(a).lower()

Example:

code town district suburb
02 Benalmádena Málaga Arroyo de la Miel
03 Alicante Jacarilla Jacarilla, Correntias Bajas (Jacarilla)
04 Cabrera d'Anoia Barcelona NaN
07 Lanjarón Granada NaN
08 Santa Cruz de Tenerife Santa Cruz de Tenerife Centro-Ifara
09 Córdoba Córdoba NaN

CodePudding user response:

You can remove accents and make lower with this code df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()

df['check'] = np.where(
    ((df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['town'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
    |  (df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['district'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
    ),
    np.nan, df['suburb']) 
df 

enter image description here

CodePudding user response:

You can write a function and check each row of pandas with a written function and apply, axis=1.

# !pip install unidecode
import numpy as np
import unidecode

def check_unidecode(row):
    lst = [unidecode.unidecode(r).lower() for r in row]
    if lst[2]==lst[0] or lst[2]==lst[1]:
        return np.nan
    return row['suburb']
df['suburb'] = df.apply(check_unidecode, axis=1)
print(df)

                     town                district  \
0             Benalmádena                  Málaga   
1                Alicante               Jacarilla   
2         Cabrera d'Anoia               Barcelona   
3                Lanjarón                 Granada   
4  Santa Cruz de Tenerife  Santa Cruz de Tenerife   
5                 Córdoba                 Córdoba   

                                    suburb  
0                        Arroyo de la Miel  
1  Jacarilla, Correntias Bajas (Jacarilla)  
2                                      NaN  
3                                      NaN  
4                             Centro-Ifara  
5                                      NaN 
  • Related