Merge two dataframe where rows have no exact match-CodePudding

How can I merge two dataframe in pandas where one column contains only starting letters of matching rows?

For example, I would like to merge df1 with df3. ("*" is replacing any characters like in excel

df1 = pd.DataFrame({'lkey': ['B0204040*', 'AZS12*'],
                    'value': [1, 3]})

df2 = pd.DataFrame({'rkey': ['B0204040AL-5W', 'B0204040AL-7.5W', 'AZS12AD4-5S', 'AZS12-AS']})

and get df3

df3 = pd.DataFrame({'rkey': ['B0204040AL-5W', 'B0204040AL-7.5W', 'AZS12AD4-5S', 'AZS12-AS']})
                    'value': [1, 1, 3, 3]})

CodePudding user response：

For this specific case ('*' in the end), you could use a regex to find the matches:

import re
s1 = df1['lkey'].str.rstrip('*')
regex = '|'.join(map(re.escape, s1))
# 'B0204040|AZS12'

s2 = df2['rkey'].str.extract(f'({regex})', expand=False)

df3 = df1.merge(df2, left_on=s1, right_on=s2).drop(columns='key_0')

output:

        lkey  value             rkey
0  B0204040*      1    B0204040AL-5W
1  B0204040*      1  B0204040AL-7.5W
2     AZS12*      3      AZS12AD4-5S
3     AZS12*      3         AZS12-AS

CodePudding user response：

One possible way is to iterate over df1 rows and check if df2 rkey column matches the lkey column.

df2['value'] = pd.NaT
for idx, row in df1.iterrows():
    df2['value'] = df2['value'].mask(df2['rkey'].str.match(row['lkey']), row['value'])

print(df2)

              rkey value
0    B0204040AL-5W     1
1  B0204040AL-7.5W     1
2      AZS12AD4-5S     3
3         AZS12-AS     3