Home > Back-end >  Outer join to check existence each records of two pandas dataframes like SQL
Outer join to check existence each records of two pandas dataframes like SQL

Time:11-18

I have to tables looks like following:

Table T1

ColumnA ColumnB
A 1
A 3
B 1
C 2

Table T2

ColumnA ColumnB
A 1
A 4
B 1
D 2

in SQL I will do following query to check the existence of each record

select 
COALESCE(T1.ColumnA,T2.ColumnA) as ColumnA
,T1.ColumnB as ExistT1
,T2.ColumnB as ExistT2
from T1
full join T2 on 
   T1.ColumnA=T2.ColumnA
   and T1.ColumnB=T2.ColumnB
where 
 (T1.ColumnA is null or T2.ColumnA is null)

I have tried many way in Pandas like concate, join, merge, etc, but it seems that the two merge keys would be combined into one. I think the problem is that I want to check is not 'data columns' but 'key columns'. Is there any good idea to do this in Python? Thanks!

ColumnA ExistT1 ExistT2
A 3 null
A null 4
C 2 null
D null 2

CodePudding user response:

pd.merge has an indicator parameter that could be helpful here:

(t1
.merge(t2, how = 'outer', indicator=True)
.loc[lambda df: df._merge!="both"]
.assign(ExistT1 = lambda df: df.ColumnB.where(df._merge.eq('left_only')), 
        ExistT2 = lambda df: df.ColumnB.where(df._merge.eq('right_only')) )
.drop(columns=['ColumnB', '_merge'])
)

  ColumnA  ExistT1  ExistT2
1       A      3.0      NaN
3       C      2.0      NaN
4       A      NaN      4.0
5       D      NaN      2.0

CodePudding user response:

First

  • merge 2 dataframes

following code:

(df1.assign(ExistT1=df1['ColumnB'])
 .merge(df2.assign(ExistT2=df2['ColumnB']), how='outer'))

output:

    ColumnA ColumnB ExistT1 ExistT2
0   A       1       1.00    1.00
1   A       3       3.00    NaN
2   B       1       1.00    1.00
3   C       2       2.00    NaN
4   A       4       NaN     4.00
5   D       2       NaN     2.00



Second

  • drop ColumnB and same value rows (like row 0 and row2)

include full code:

 (df1.assign(ExistT1=df1['ColumnB'])
     .merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
     .drop('ColumnB', axis=1)
     .loc[lambda x: x.isnull().any(axis=1)])

output:

    ColumnA ExistT1 ExistT2
1   A       3.00    NaN
3   C       2.00    NaN
4   A       NaN     4.00
5   D       NaN     2.00



Final

sort_values and reset_index (full code)

(df1.assign(ExistT1=df1['ColumnB'])
 .merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
 .drop('ColumnB', axis=1)
 .loc[lambda x: x.isnull().any(axis=1)]
 .sort_values(['ColumnA']).reset_index(drop=True))

result:

    ColumnA ExistT1 ExistT2
0   A       3.00    NaN
1   A       NaN     4.00
2   C       2.00    NaN
3   D       NaN     2.00
  • Related