Home > Blockchain >  How to join two dataframes base the words exist in them when they are not equal?
How to join two dataframes base the words exist in them when they are not equal?

Time:06-27

I have 2 datasets, which contain unique values. The first one, which is the one below has the full name of a few hospitals.

 ----------------------- 
| HOSPITAL_NAME_FULL    |
 ----------------------- 
| St. Christine         |
| Californian Hospital  |
| Holy Mercy Hospital   |
| Germanic NW Hospital  |
| Trauma Center Hospital|
| Holy Spirit Hospital  |
| Mater Hospital        |
 ----------------------- 

The other one has the short name of the same hospitals above.

 --------------------- 
| HOSPITAL_NAME_SHORT |
 --------------------- 
| Christine           |
| Californian         |
| Mercy               |
| Germanic            |
| Trauma              |
| Holy                |
| Mater               |
 --------------------- 

The thing is, I need to join them, so I can have both full name and short name. Can I join dataframes while using some kind of regex so I can have this result?

 ----------------------- ---------------------  
| HOSPITAL_NAME_FULL    | HOSPITAL_NAME_SHORT |
 ----------------------- --------------------- 
| St. Christine         | Christine           |
| Californian Hospital  | Californian         |
| Holy Mercy Hospital   | Mercy               |
| Germanic NW Hospital  | Germanic            |
| Trauma Center Hospital| Trauma              |
| Holy Spirit Hospital  | Holy                |
| Mater Hospital        | Mater               |
 ----------------------- --------------------- 

Thanks!

CodePudding user response:

TL;DR, If your words in HOSPITAL_NAME_SHORT exactly exist in HOSPITAL_NAME_FULL I recommend the first approach and if you want to consider similarity between words, I recommend second approach:

First Approach: (word exactly exist)

df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
    lambda x: next(st for st in df2['HOSPITAL_NAME_SHORT'] if st in x))

print(df1)

Second Approach (consider Similarty)

You can use difflib.SequenceMatcher and find similarities between each two words from two dataframes and return a word that have max similarity for each row:

What is difflib.SequenceMatcher:

>>> SequenceMatcher(None, 'Vitamin_A', 'Vitamin_C').ratio()
0.8888888888888888

Solving this problem with SequenceMatcher:

from difflib import SequenceMatcher
df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].apply(
    lambda x: max([(st, SequenceMatcher(None, x, st).ratio()) 
                   for st in df2['HOSPITAL_NAME_SHORT']], key=lambda x: x[1])[0])

print(df1)

Output:

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital               Mercy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater

Input two dataframe:

print(df1)
#        HOSPITAL_NAME_FULL
# 0           St. Christine
# 1    Californian Hospital
# 2     Holy Mercy Hospital
# 3    Germanic NW Hospital
# 4  Trauma Center Hospital
# 5    Holy Spirit Hospital
# 6          Mater Hospital

print(df2)
#   HOSPITAL_NAME_SHORT
# 0           Christine
# 1         Californian
# 2               Mercy
# 3            Germanic
# 4              Trauma
# 5                Holy
# 6               Mater

CodePudding user response:

You can try .str.extract

df1['HOSPITAL_NAME_SHORT'] = df1['HOSPITAL_NAME_FULL'].str.extract('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
print(df1)

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital                Holy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater

You can also consider str.extractall to show to all possible matches

df1 = (df1
       .join(
           df1['HOSPITAL_NAME_FULL'].str.extractall('(' '|'.join(df2['HOSPITAL_NAME_SHORT']) ')')
           .unstack().droplevel(0, axis=1)
       )
)
print(df1)

       HOSPITAL_NAME_FULL            0      1
0           St. Christine    Christine    NaN
1    Californian Hospital  Californian    NaN
2     Holy Mercy Hospital         Holy  Mercy
3    Germanic NW Hospital     Germanic    NaN
4  Trauma Center Hospital       Trauma    NaN
5    Holy Spirit Hospital         Holy    NaN
6          Mater Hospital        Mater    NaN

CodePudding user response:

Here's a way to do it:

df1['HOSPITAL_NAME_SHORT'] = df1.HOSPITAL_NAME_FULL.apply(lambda x: [y for y in df2.HOSPITAL_NAME_SHORT if y in x][0])

Output:

       HOSPITAL_NAME_FULL HOSPITAL_NAME_SHORT
0           St. Christine           Christine
1    Californian Hospital         Californian
2     Holy Mercy Hospital               Mercy
3    Germanic NW Hospital            Germanic
4  Trauma Center Hospital              Trauma
5    Holy Spirit Hospital                Holy
6          Mater Hospital               Mater
  • Related