We have the following two data frames
temp = pd.DataFrame(np.array([['I am feeling very well',1],['It is hard to believe this happened',0],
['What is love?',1], ['No new friends',0],
['I love this show',1],['Amazing day today',1]]),
columns = ['message','sentiment'])
temp_truncated = pd.DataFrame(np.array([['I am feeling very',1],['It is hard to believe',1],
['What is',1], ['Amazing day',1]]),
columns = ['message','cutoff'])
My idea is to create a third DataFrame that would represent the inner join between temp
and temp_truncated
by finding matches in temp
that start with / contain the strings in temp_truncated
Desired Output:
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love 1 1
3 Amazing day today 1 1
CodePudding user response:
You can use:
import re
pattern = '|'.join(map(re.escape, temp_truncated['message']))
key = temp['message'].str.extract(f'({pattern})', expand=False)
out = (temp
.merge(temp_truncated.rename(columns={'message': 'sub'}),
left_on=key, right_on='sub')
.drop(columns='sub')
)
Output:
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love? 1 1
3 Amazing day today 1 1
CodePudding user response:
Here is an approach using rapidfuzz
with pandas.merge
:
#pip install rapidfuzz
from rapidfuzz import process
out = (
temp_truncated
.assign(message_adapted = (temp_truncated['message']
.map(lambda x: process.extractOne(x, temp['message']))).str[0])
.merge(temp, left_on="message_adapted", right_on="message", how="left", suffixes=("_", ""))
.drop(columns=["message_adapted", "message_"])
)[["message", "sentiment", "cutoff"]]
# Output :
print(out)
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love? 1 1
3 Amazing day today 1 1