Home > front end >  Match partial strings across two data frames and merge
Match partial strings across two data frames and merge

Time:09-13

I have two data frames as given below. The master data:

    df1 = pd.DataFrame(
    data=
    [[['mayor', 'lord'], 25],
     [['prince', 'sheriff'], 33],
     [['king george', 'queen'], 32],
     [['bristol, counsellor'], 43],
     [['exchequer'], 45]],
    columns=['V1', 'V2']
)
df1.update(df1[['V1']].applymap('{}'.format))
print(df1)

and the other one is:

df2 = pd.DataFrame(
    data=
    [['mayor', 55],
     ['sheriff', 54],
     ['counsellor', 53],
     ['exchequer', 50]],
    columns=['Var1', 'Var2']
)
print(df2)

I want to map Var2 from df2 in the master data if V1 contains a corresponding Var1 value. That is my end goal dataframe is:

dfgoal = pd.DataFrame(
    data=
    [[['mayor', 'lord'], 25, 55],
     [['prince', 'sheriff'], 33, 54],
     [['king george', 'queen'], 32, ],
     [['bristol, counsellor'], 43, 53],
     [['exchequer'], 45, 50]],
    columns=['V1', 'V2', 'V3']
)
dfgoal.update(dfgoal[['V1']].applymap('{}'.format))
print(dfgoal)

I have tried quite a few things from Stack Overflow (like this) but as the V1 is a list object and the match can be in any element in the list (like the counsellor obs), I am unsuccessful so far.

CodePudding user response:

I think this solves it:

df1 = pd.DataFrame(
    data=
    [[['mayor', 'lord'], 25],
     [['prince', 'sheriff'], 33],
     [['king george', 'queen'], 32],
     [['bristol', 'counsellor'], 43],
     [['exchequer'], 45]],
    columns=['V1', 'V2']
)
# df1.update(df1[['V1']].applymap('{}'.format))
print(df1)


df2 = pd.DataFrame(
    data=
    [['mayor', 55],
     ['sheriff', 54],
     ['counsellor', 53],
     ['exchequer', 50]],
    columns=['Var1', 'Var2']
)
print(df2)


mapper = df2.set_index('Var1')['Var2'].to_dict()
def get_V3(l):
    for x in l:
        if x in mapper:
            return mapper[x]
        
    return None
df1['V3'] = df1['V1'].apply(get_V3)

I'm assuming if two or more elements appear in df2, then we use the first one, hence the return inside the loop.

  • Related