I have the following DataFrame in pandas:
code | town | district | suburb |
---|---|---|---|
02 | Benalmádena | Málaga | Arroyo de la Miel |
03 | Alicante | Jacarilla | Jacarilla, Correntias Bajas (Jacarilla) |
04 | Cabrera d'Anoia | Barcelona | Cabrera D'Anoia |
07 | Lanjarón | Granada | Lanjaron |
08 | Santa Cruz de Tenerife | Santa Cruz de Tenerife | Centro-Ifara |
09 | Córdoba | Córdoba | Cordoba |
For each row in the suburb
column, if the value it contains is equal (in lower case and without accents) to district
or town
columns, it becomes NaN.
# Function to remove accents and shift to lower case.
def rm_accents_lowcase(a):
return unidecode.unidecode(a).lower()
Example:
code | town | district | suburb |
---|---|---|---|
02 | Benalmádena | Málaga | Arroyo de la Miel |
03 | Alicante | Jacarilla | Jacarilla, Correntias Bajas (Jacarilla) |
04 | Cabrera d'Anoia | Barcelona | NaN |
07 | Lanjarón | Granada | NaN |
08 | Santa Cruz de Tenerife | Santa Cruz de Tenerife | Centro-Ifara |
09 | Córdoba | Córdoba | NaN |
CodePudding user response:
You can remove accents and make lower with this code
df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()
df['check'] = np.where(
((df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['town'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
| (df['suburb'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower() == df['district'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower())
),
np.nan, df['suburb'])
df
CodePudding user response:
You can write a function and check each row of pandas with a written function and apply
, axis=1
.
# !pip install unidecode
import numpy as np
import unidecode
def check_unidecode(row):
lst = [unidecode.unidecode(r).lower() for r in row]
if lst[2]==lst[0] or lst[2]==lst[1]:
return np.nan
return row['suburb']
df['suburb'] = df.apply(check_unidecode, axis=1)
print(df)
town district \
0 Benalmádena Málaga
1 Alicante Jacarilla
2 Cabrera d'Anoia Barcelona
3 Lanjarón Granada
4 Santa Cruz de Tenerife Santa Cruz de Tenerife
5 Córdoba Córdoba
suburb
0 Arroyo de la Miel
1 Jacarilla, Correntias Bajas (Jacarilla)
2 NaN
3 NaN
4 Centro-Ifara
5 NaN