Below are two dataframes. The dataframe df1 is a cleanfile which has to be used as the mapping file to another dataframe df2 which has uncleaned information.
df1=pd.DataFrame({'Make': {0: 'ASHOK LEYLAND', 1: 'ASTON MARTIN', 2: 'ASTON MARTIN'},
'Model': {0: 'STILE', 1: 'DB9', 2: 'RAPIDE'},
'Variant': {0: 'LE 7 STR', 1: 'VOLANTE', 2: 'LUXE'},
'Fuel': {0: 'DIESEL', 1: 'PETROL', 2: 'PETROL'}})
df2=pd.DataFrame({'Make': {0: 'ASHOK LEYLANDSTILELE 7 STR',
1: 'ASTON MARTINDB9VOLANTE',
2: 'ASTON MARTINRAPIDELUXE'},
'Model': {0: 'STILELE 7 STR', 1: 'DB9VOLANTE', 2: 'RAPIDELUXE'},
'Variant': {0: 'LE 7 STRSTILE', 1: 'VOLANTEDB9', 2: 'LUXERAPIDE'},
'Fuel': {0: 'Dieseel', 1: 'Dieseel', 2: 'PETROLjlljlj'}})
I have used the below code to clean the 'Make' column of df2 basis on the 'Make' column of df1. However i dont get any result on my new column 'Make_new'.Below is the code:
df2['Make_new'] = df2['Make'].apply(lambda v: [Make for Make in df1 if Make in ('ASHOK','ASTON')])
CodePudding user response:
You could use a regular expression to match a given set of "allowed" names to the the defective values and extract only the matched part. Note that following solution will only cover the situation in which the defective names have correct prefixes.
import re
def clean_col(correct_names_expression):
def clean_row_func(x):
if match_res := re.match(correct_names_expression, x): # if no match is asserted, then None is returned
return match_res[0]
else:
return None
return clean_row_func
func_to_correct_rows = clean_col('(' '|'.join(df1.Make.tolist()) ')')
df2.Make.apply(func_to_correct_rows )
Out:
0 ASHOK LEYLAND
1 ASTON MARTIN
2 ASTON MARTIN
CodePudding user response:
I don't see why you would need df1
at all. The "cleaning" logic is pretty simple. This will do the trick.
def clean(row):
if 'ASHOK' in row.Make:
return 'ASHOK LEYLAND'
if 'ASTON' in row.Make:
return 'ASTON MARTIN'
return row.Make
df2['cleaned'] = df2.apply(clean, axis=1)