How can I merge two dataframe in pandas where one column contains only starting letters of matching rows?
For example, I would like to merge df1 with df3. ("*" is replacing any characters like in excel
df1 = pd.DataFrame({'lkey': ['B0204040*', 'AZS12*'],
'value': [1, 3]})
df2 = pd.DataFrame({'rkey': ['B0204040AL-5W', 'B0204040AL-7.5W', 'AZS12AD4-5S', 'AZS12-AS']})
and get df3
df3 = pd.DataFrame({'rkey': ['B0204040AL-5W', 'B0204040AL-7.5W', 'AZS12AD4-5S', 'AZS12-AS']})
'value': [1, 1, 3, 3]})
CodePudding user response:
For this specific case ('*' in the end), you could use a regex to find the matches:
import re
s1 = df1['lkey'].str.rstrip('*')
regex = '|'.join(map(re.escape, s1))
# 'B0204040|AZS12'
s2 = df2['rkey'].str.extract(f'({regex})', expand=False)
df3 = df1.merge(df2, left_on=s1, right_on=s2).drop(columns='key_0')
output:
lkey value rkey
0 B0204040* 1 B0204040AL-5W
1 B0204040* 1 B0204040AL-7.5W
2 AZS12* 3 AZS12AD4-5S
3 AZS12* 3 AZS12-AS
CodePudding user response:
One possible way is to iterate over df1
rows and check if df2
rkey
column matches the lkey
column.
df2['value'] = pd.NaT
for idx, row in df1.iterrows():
df2['value'] = df2['value'].mask(df2['rkey'].str.match(row['lkey']), row['value'])
print(df2)
rkey value
0 B0204040AL-5W 1
1 B0204040AL-7.5W 1
2 AZS12AD4-5S 3
3 AZS12-AS 3