Home > Software engineering >  Inner merge two DataFrames on string partial match
Inner merge two DataFrames on string partial match

Time:11-23

We have the following two data frames

temp = pd.DataFrame(np.array([['I am feeling very well',1],['It is hard to believe this happened',0],
                                  ['What is love?',1], ['No new friends',0],
                             ['I love this show',1],['Amazing day today',1]]),
                                columns = ['message','sentiment'])

temp_truncated = pd.DataFrame(np.array([['I am feeling very',1],['It is hard to believe',1],
                                  ['What is',1], ['Amazing day',1]]),
                                columns = ['message','cutoff'])

My idea is to create a third DataFrame that would represent the inner join between temp and temp_truncated by finding matches in temp that start with / contain the strings in temp_truncated

Desired Output:

     message                             sentiment   cutoff            
0    I am feeling very well               1          1
1    It is hard to believe this happened  0          1
2    What is love                         1          1
3    Amazing day today                    1          1

CodePudding user response:

You can use:

import re
pattern = '|'.join(map(re.escape, temp_truncated['message']))

key = temp['message'].str.extract(f'({pattern})', expand=False)

out = (temp
 .merge(temp_truncated.rename(columns={'message': 'sub'}),
        left_on=key, right_on='sub')
 .drop(columns='sub')
)

Output:

                               message sentiment cutoff
0               I am feeling very well         1      1
1  It is hard to believe this happened         0      1
2                        What is love?         1      1
3                    Amazing day today         1      1

CodePudding user response:

Here is an approach using rapidfuzz with pandas.merge :

#pip install rapidfuzz
from rapidfuzz import process

out = (
        temp_truncated
           .assign(message_adapted = (temp_truncated['message']
                                       .map(lambda x: process.extractOne(x, temp['message']))).str[0])
           .merge(temp, left_on="message_adapted", right_on="message", how="left", suffixes=("_", ""))
           .drop(columns=["message_adapted", "message_"])
      )[["message", "sentiment", "cutoff"]]

# Output :

print(out)
                               message sentiment cutoff
0               I am feeling very well         1      1
1  It is hard to believe this happened         0      1
2                        What is love?         1      1
3                    Amazing day today         1      1
  • Related