I have the following dataframe with employees and the business they are located (with possible incorrect and blank info as in the example):
Country ID Employee Valid Business
AE 1 John ok ADMIN
AE 1 John ok *
CZ 3 Paulo ok
FR 4 Peter ok PATHO
I need to look at this other reference dataframe, that contains the Business per Country, and do the following:
- Create a column named "Legacy Business"
- If Business exist in the reference dataframe (considering a match of both "Country" and "Business") replace the "Business" column in the first dataframe to the value in "Business Name" of reference dataframe and leave the column "Legacy Business" blank
- If Business do not exist in reference dataframe, get the first value of "Business Name" for that country and add that value to "Business" column. In "Levacy Business" column add the previous value (or a "nan" so I can identify that column was blank)
Reference dataframe:
Country Business Business Name
AE ADMIN AE ADMIN
AE COVID 19 AE COVID 19
AE LAB IVD AE LAB IVD
CH ADMIN CH ADMIN
CH COVID 19 CH COVID 19
CH GENETICS CH GENETICS
CZ PATHO CZ PATHO
CZ ADMIN DK ADMIN
DK PHARMA DK PHARMA
ES ADMIN ES ADMIN
ES COVID 19 ES COVID 19
FR LAB IVD FR LAB IVD
FR PATHO FR PATHO
This would be the output:
Country ID Employee Valid Business Legacy Business
AE 1 John ok AE ADMIN
AE 1 John ok AE ADMIN *
CZ 3 Paulo ok CZ PATHO nan
FR 4 Peter ok PATHO
What would be the best way to do it? Thank you!
---------- dataframes ----------
employees = {
"Country": ["AE", "AE", "CZ", "FR"],
"ID": [1, 1, 3, 4],
"Employee": ["John", "John", "Paulo", "Peter"],
"Valid": ["ok", "ok", "ok", "ok"],
"Business": ["ADMIN", "*", "", "PATHO"],
}
business = {
"Country": ["AE", "AE", "AE", "CH", "CH", "CH", "CZ", "CZ", "DK", "ES", "ES", "FR", "FR"],
"Business": ["ADMIN", "COVID 19", "LAB IVD", "ADMIN", "COVID 19", "GENETICS", "PATHO", "ADMIN", "PHARMA", "ADMIN", "COVID 19", "LAB IVD", "PATHO"],
"Business Name": ["AE ADMIN", "AE COVID 19", "AE LAB IVD", "CH ADMIN", "CH COVID 19", "CH GENETICS", "CZ PATHO", "DK ADMIN", "DK PHARMA", "ES ADMIN", "ES COVID 19", "FR LAB IVD", "FR PATHO"],
}
CodePudding user response:
You could first put nan
instead of empty value, next merge both tables, and later clean it - rename column, add first Business Name
if missing, etc.
data1 = '''Country ID Employee Valid Business
AE 1 John ok ADMIN
AE 1 John ok *
CZ 3 Paulo ok
FR 4 Peter ok PATHO'''
data2 = '''Country Business Business Name
AE ADMIN AE ADMIN
AE COVID 19 AE COVID 19
AE LAB IVD AE LAB IVD
CH ADMIN CH ADMIN
CH COVID 19 CH COVID 19
CH GENETICS CH GENETICS
CZ PATHO CZ PATHO
CZ ADMIN DK ADMIN
DK PHARMA DK PHARMA
ES ADMIN ES ADMIN
ES COVID 19 ES COVID 19
FR LAB IVD FR LAB IVD
FR PATHO FR PATHO'''
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(data1), sep='\s ')
df1 = df1.fillna('')
print(df1)
df2 = pd.read_csv(io.StringIO(data2), sep='\s{2,}')
df2 = df2.fillna('')
print(df2)
# --- replace empty elements ---
df1.loc[ df1['Business'] == '', 'Business'] = 'nan'
print(df1)
# --- merger ---
df = pd.merge(df1, df2, how='left', on=['Country', 'Business'])
print(df)
# --- clean ---
first = df2.groupby('Country').first()
first = first['Business Name'].to_dict()
print(first)
def clean(row):
# remove if similar
if row['Country'] ' ' row['Business'] == row['Business Name']:
row['Business'] = ''
# add first if missing
if not isinstance(row['Business Name'], str):
row['Business Name'] = first[row['Country']]
return row
df = df.apply(clean, axis=1)
# --- rename columns ---
df = df.rename(columns={'Business':'Legacy Business'})
print(df)