I have two pandas dataframe df1 and df2. I would like to join the two dataframes using regex on column 'CODE'.
df1
STR CODE
Nonrheumatic aortic valve disorders I35
Glaucoma suspect H40.0
df2
STR CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1 I35.1
Nonrheumatic 2 I35.2
Nonrheumatic 3 I35.3
Glaucoma suspect H40.0
Glaucoma 2 H40.1
Diabetes H50
Diabetes 1 H50.1
Diabetes 1 H50.2
The final output should be like this:
STR CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1 I35.1
Nonrheumatic 2 I35.2
Nonrheumatic 3 I35.3
Glaucoma suspect H40.0
Glaucoma 2 H40.1
Any help is highly appreciated!
CodePudding user response:
You can "align" the codes of 2 dataframes with matching their essential prefixes (obtained by regex substitution) in pandas.Series.where
condition:
df1_codes = df1.CODE.str.replace(r'\.. ', '', regex=True)
df2.loc[df2.CODE.str.replace(r'\.. ', '', regex=True)\
.pipe(lambda s: s.where(s.isin(df1_codes))).dropna().index]
STR CODE
0 Nonrheumatic aortic valve disorders I35
1 Nonrheumatic 1 I35.1
2 Nonrheumatic 2 I35.2
3 Nonrheumatic 3 I35.3
4 Glaucoma suspect H40.0
5 Glaucoma 2 H40.1
CodePudding user response:
A possible solution, based on pandas.DataFrame.merge
and pandas.Series.str.split
:
(df1.assign(
CODEA=df1['CODE'].str.split('\.', expand=True)[0])
.merge(df2.assign(
CODEA=df2['CODE'].str.split('\.', expand=True)[0]),
on='CODEA', suffixes=['_x', '']).loc[:, df1.columns])
Output:
STR CODE
0 Nonrheumatic aortic valve disorders I35
1 Nonrheumatic 1 I35.1
2 Nonrheumatic 2 I35.2
3 Nonrheumatic 3 I35.3
4 Glaucoma suspect H40.0
5 Glaucoma 2 H40.1