Suppose I have the following dataframe:
ID Country Employee Location
1 AE Jay AAA
2 AE Mary aa
3 AE Peter bbb
3 AE Peter ddd
6 DK Donk ddd
7 CZ Cesar fff
7 CZ Cesar GGg
7 CZ Cesar
8 CZ Carlos #
I need to use the below dataframe to confirm if the Location values are valid (according to their Country) and create an extra column named "Legacy Location Name" with the following:
If values match with lookup dataframe (no matter if uppercase or not), add to the "Legacy Location Name" column "CORRECT VALUE"
If value of Location is incorrect, please add to the "Legacy Location Name" the value previously used in the "Location" column and, in "Location", add the first value of an existing Location of the lookup dataframe
If the value of Location is blank (like in the second to last row), add to "Legacy Location Name" the value "LOCATION NOT PROVIDED" and, in "Location", add the first value of an exising Location of the lookup dataframe
lookup df:
Country Location
AE bbb
AE aaa
AE ccc
DK ddd
DK eee
DK fff
CZ ggg
CZ hhh
Output Expected
ID Country Employee Location Legacy Location
1 AE Jay AAA CORRECT VALUE
2 AE Mary bbb aa
3 AE Peter bbb CORRECT VALUE
3 AE Peter bbb ddd
6 DK Donk ddd CORRECT VALUE
7 CZ Cesar ggg fff
7 CZ Cesar GGg CORRECT VALUE
7 CZ Cesar LOCATION NOT PROVIDED
8 CZ Carlos ggg #
What is the best way to achieve it?
Thank you!
CodePudding user response:
Not complicated, but requires many steps:
s = (lookup_df.drop_duplicates('Country')
.set_index('Country')['Location']
)
out = (df
# handle location independently of case
.assign(Location=df['Location'].str.casefold())
# identify the correct values by merging
.merge(lookup_df.assign(**{'Legacy Location': 'CORRECT VALUE'}),
how='left')
# replace invalid locations
.assign(**{'Location': lambda d: df['Location'].mask(d['Legacy Location'].isna()).fillna(df['Country'].map(s).mask(df['Location'].isna())),
# add previous invalid locations
'Legacy Location': lambda d: d['Legacy Location'].fillna(df['Location'].fillna('LOCATION NOT PROVIDED'))})
)
print(out)
NB. Assuming all empty cells are NaNs for simplicity.
Output:
ID Country Employee Location Legacy Location
0 1 AE Jay AAA CORRECT VALUE
1 2 AE Mary bbb aa
2 3 AE Peter bbb CORRECT VALUE
3 3 AE Peter bbb ddd
4 6 DK Donk ddd CORRECT VALUE
5 7 CZ Cesar ggg fff
6 7 CZ Cesar GGg CORRECT VALUE
7 7 CZ Cesar NaN LOCATION NOT PROVIDED
8 8 CZ Carlos ggg #