merge two datasets based on a string match-CodePudding

I have two datasets;

DF1 contains a column that has a short word.

DF2 has a column with a series of words, where in some cases they start with the short word from DF1

I would like to create a new column in DF2, and everytime the DF1 word appears at the beginning of DF2's word, to then place that DF1 word into the newly created column at that location.

The idea being that once completed, I can then merge the two datasets using the matching word.

How would I do this?

DF1

ref
ABC
DEF
GHI

DF2

word
ABC123
DEF456
GHI789

DF2 - Desired output

word	new column
ABC123	ABC
DEF456	DEF
GHI789	GHI

CodePudding user response：

You can create a list of pattern from ref column then try to extract it from word column:

pattern = fr"({'|'.join(df1['ref'])})"
df2['new column'] = df2['word'].str.extract(pattern)
print(df2)

# Output
     word new column
0  ABC123        ABC
1  DEF456        DEF
2  GHI789        GHI

If it's too simple, you have to use fuzzy logic. You can check this Question & Answer