Look for if combination of values exist in another dataframe and return value of a 3rd column-CodePudding

I have the following dataframe with employees and the business they are located (with possible incorrect and blank info as in the example):

Country ID  Employee    Valid   Business
AE      1   John          ok    ADMIN
AE      1   John          ok    *
CZ      3   Paulo         ok    
FR      4   Peter         ok    PATHO

I need to look at this other reference dataframe, that contains the Business per Country, and do the following:

Create a column named "Legacy Business"
If Business exist in the reference dataframe (considering a match of both "Country" and "Business") replace the "Business" column in the first dataframe to the value in "Business Name" of reference dataframe and leave the column "Legacy Business" blank
If Business do not exist in reference dataframe, get the first value of "Business Name" for that country and add that value to "Business" column. In "Levacy Business" column add the previous value (or a "nan" so I can identify that column was blank)

Reference dataframe:

Country Business    Business Name
AE       ADMIN        AE ADMIN
AE       COVID 19     AE COVID 19
AE       LAB IVD      AE LAB IVD
CH       ADMIN        CH ADMIN
CH       COVID 19     CH COVID 19
CH       GENETICS     CH GENETICS
CZ       PATHO        CZ PATHO
CZ       ADMIN        DK ADMIN
DK       PHARMA       DK PHARMA
ES       ADMIN        ES ADMIN
ES       COVID 19     ES COVID 19
FR       LAB IVD      FR LAB IVD
FR       PATHO        FR PATHO

This would be the output:

Country ID  Employee    Valid   Business    Legacy Business
AE      1     John       ok     AE ADMIN    
AE      1     John       ok     AE ADMIN         *
CZ      3     Paulo      ok     CZ PATHO        nan
FR      4     Peter      ok     PATHO

What would be the best way to do it? Thank you!

---------- dataframes ----------

employees = {
    "Country": ["AE", "AE", "CZ", "FR"],
    "ID": [1, 1, 3, 4],
    "Employee": ["John", "John", "Paulo", "Peter"],
    "Valid": ["ok", "ok", "ok", "ok"],
    "Business": ["ADMIN", "*", "", "PATHO"],
}

business = {
    "Country": ["AE", "AE", "AE", "CH", "CH", "CH", "CZ", "CZ", "DK", "ES", "ES", "FR", "FR"],
    "Business": ["ADMIN", "COVID 19", "LAB IVD", "ADMIN", "COVID 19", "GENETICS", "PATHO", "ADMIN", "PHARMA", "ADMIN", "COVID 19", "LAB IVD", "PATHO"],
    "Business Name": ["AE ADMIN", "AE COVID 19", "AE LAB IVD", "CH ADMIN", "CH COVID 19", "CH GENETICS", "CZ PATHO", "DK ADMIN", "DK PHARMA", "ES ADMIN", "ES COVID 19", "FR LAB IVD", "FR PATHO"],
}

CodePudding user response：

You could first put nan instead of empty value, next merge both tables, and later clean it - rename column, add first Business Name if missing, etc.

data1 = '''Country ID  Employee    Valid   Business
AE      1   John          ok    ADMIN
AE      1   John          ok    *
CZ      3   Paulo         ok    
FR      4   Peter         ok    PATHO'''

data2 = '''Country  Business    Business Name
AE       ADMIN        AE ADMIN
AE       COVID 19     AE COVID 19
AE       LAB IVD      AE LAB IVD
CH       ADMIN        CH ADMIN
CH       COVID 19     CH COVID 19
CH       GENETICS     CH GENETICS
CZ       PATHO        CZ PATHO
CZ       ADMIN        DK ADMIN
DK       PHARMA       DK PHARMA
ES       ADMIN        ES ADMIN
ES       COVID 19     ES COVID 19
FR       LAB IVD      FR LAB IVD
FR       PATHO        FR PATHO'''

import pandas as pd
import io

df1 = pd.read_csv(io.StringIO(data1), sep='\s ')
df1 = df1.fillna('')
print(df1)

df2 = pd.read_csv(io.StringIO(data2), sep='\s{2,}')
df2 = df2.fillna('')
print(df2)

# --- replace empty elements ---

df1.loc[ df1['Business'] == '', 'Business'] = 'nan'
print(df1)

# --- merger ---

df = pd.merge(df1, df2, how='left', on=['Country', 'Business'])
print(df)

# --- clean ---

first = df2.groupby('Country').first()
first = first['Business Name'].to_dict()
print(first)

def clean(row):
    # remove if similar
    if row['Country']   ' '   row['Business'] == row['Business Name']:
        row['Business'] = ''
    # add first if missing
    if not isinstance(row['Business Name'], str):
        row['Business Name'] = first[row['Country']]
    return row

df = df.apply(clean, axis=1)

# --- rename columns ---

df = df.rename(columns={'Business':'Legacy Business'})

print(df)