I have 2 datasets, which contain unique values. The first one, which is the one below has the full name of a few hospitals.
-----------------------
| HOSPITAL_NAME_FULL |
-----------------------
| St. Christine |
| Californian Hospital |
| Holy Mercy Hospital |
| Germanic NW Hospital |
| Trauma Center Hospital|
| Holy Spirit Hospital |
| Mater Hospital |
-----------------------
The other one has the short name of the same hospitals above.
---------------------
| HOSPITAL_NAME_SHORT |
---------------------
| Christine |
| Californian |
| Mercy |
| Germanic |
| Trauma |
| Holy |
| Mater |
---------------------
The thing is, I need to join them, so I can have both full name and short name. Can I join dataframes while using some kind of regex so I can have this result?
----------------------- ---------------------
| HOSPITAL_NAME_FULL | HOSPITAL_NAME_SHORT |
----------------------- ---------------------
| St. Christine | Christine |
| Californian Hospital | Californian |
| Holy Mercy Hospital | Mercy |
| Germanic NW Hospital | Germanic |
| Trauma Center Hospital| Trauma |
| Holy Spirit Hospital | Holy |
| Mater Hospital | Mater |
----------------------- ---------------------
Thanks!
CodePudding user response:
TL;DR, If your words in HOSPITAL_NAME_SHORT
exactly exist in HOSPITAL_NAME_FULL
I recommend the first approach and if you want to consider similarity between words, I recommend second approach:
First Approach: (word exactly exist)
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
lambda x: next(st for st in df2['HOSPITAL_NAME_SHORT'] if st in x))
print(df1)
Second Approach (consider Similarty)
You can use difflib.SequenceMatcher
and find similarities between each two words from two dataframes and return a word that have max similarity for each row:
What is difflib.SequenceMatcher
:
>>> SequenceMatcher(None, 'Vitamin_A', 'Vitamin_C').ratio()
0.8888888888888888
Solving this problem with SequenceMatcher
:
from difflib import SequenceMatcher
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
lambda x: max([(st, SequenceMatcher(None, x, st).ratio())
for st in df2['HOSPITAL_NAME_SHORT']], key=lambda x: x[1])[0])
print(df1)
Output:
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Mercy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater
Input two dataframe:
print(df1)
# HOSPITAL_NAME_FULL
# 0 St. Christine
# 1 Californian Hospital
# 2 Holy Mercy Hospital
# 3 Germanic NW Hospital
# 4 Trauma Center Hospital
# 5 Holy Spirit Hospital
# 6 Mater Hospital
print(df2)
# HOSPITAL_NAME_SHORT
# 0 Christine
# 1 Californian
# 2 Mercy
# 3 Germanic
# 4 Trauma
# 5 Holy
# 6 Mater
CodePudding user response:
You can try .str.extract
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].str.extract('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
print(df1)
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Holy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater
You can also consider str.extractall
to show to all possible matches
df1 = (df1
.join(
df1['HOSPITAL_NAME_FULL'].str.extractall('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
.unstack().droplevel(0, axis=1)
)
)
print(df1)
HOSPITAL_NAME_FULL 0 1
0 St. Christine Christine NaN
1 Californian Hospital Californian NaN
2 Holy Mercy Hospital Holy Mercy
3 Germanic NW Hospital Germanic NaN
4 Trauma Center Hospital Trauma NaN
5 Holy Spirit Hospital Holy NaN
6 Mater Hospital Mater NaN
CodePudding user response:
Here's a way to do it:
df1['HOSPITAL_NAME_SHORT'] = df1.HOSPITAL_NAME_FULL.apply(lambda x: [y for y in df2.HOSPITAL_NAME_SHORT if y in x][0])
Output:
HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0 St. Christine Christine
1 Californian Hospital Californian
2 Holy Mercy Hospital Mercy
3 Germanic NW Hospital Germanic
4 Trauma Center Hospital Trauma
5 Holy Spirit Hospital Holy
6 Mater Hospital Mater