Join pandas dataframes using regex on a column using Python 3-CodePudding

I have two pandas dataframe df1 and df2. I would like to join the two dataframes using regex on column 'CODE'.

df1

STR                                 CODE
Nonrheumatic aortic valve disorders I35
Glaucoma suspect                    H40.0

df2

STR                                 CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1                      I35.1
Nonrheumatic 2                      I35.2
Nonrheumatic 3                      I35.3
Glaucoma suspect                    H40.0
Glaucoma 2                          H40.1
Diabetes                            H50
Diabetes 1                          H50.1                         
Diabetes 1                          H50.2

The final output should be like this:

STR                                 CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1                      I35.1
Nonrheumatic 2                      I35.2
Nonrheumatic 3                      I35.3
Glaucoma suspect                    H40.0
Glaucoma 2                          H40.1

Any help is highly appreciated!

CodePudding user response：

You can "align" the codes of 2 dataframes with matching their essential prefixes (obtained by regex substitution) in pandas.Series.where condition:

df1_codes = df1.CODE.str.replace(r'\.. ', '', regex=True)
df2.loc[df2.CODE.str.replace(r'\.. ', '', regex=True)\
    .pipe(lambda s: s.where(s.isin(df1_codes))).dropna().index]

                                  STR   CODE
0  Nonrheumatic aortic valve disorders    I35
1                       Nonrheumatic 1  I35.1
2                       Nonrheumatic 2  I35.2
3                       Nonrheumatic 3  I35.3
4                     Glaucoma suspect  H40.0
5                           Glaucoma 2  H40.1

CodePudding user response：

A possible solution, based on pandas.DataFrame.merge and pandas.Series.str.split:

(df1.assign(
    CODEA=df1['CODE'].str.split('\.', expand=True)[0])
 .merge(df2.assign(
     CODEA=df2['CODE'].str.split('\.', expand=True)[0]), 
        on='CODEA', suffixes=['_x', '']).loc[:, df1.columns])

Output:

                                   STR   CODE
0  Nonrheumatic aortic valve disorders    I35
1                       Nonrheumatic 1  I35.1
2                       Nonrheumatic 2  I35.2
3                       Nonrheumatic 3  I35.3
4                     Glaucoma suspect  H40.0
5                           Glaucoma 2  H40.1