I have to tables looks like following:
Table T1
ColumnA | ColumnB |
---|---|
A | 1 |
A | 3 |
B | 1 |
C | 2 |
Table T2
ColumnA | ColumnB |
---|---|
A | 1 |
A | 4 |
B | 1 |
D | 2 |
in SQL I will do following query to check the existence of each record
select
COALESCE(T1.ColumnA,T2.ColumnA) as ColumnA
,T1.ColumnB as ExistT1
,T2.ColumnB as ExistT2
from T1
full join T2 on
T1.ColumnA=T2.ColumnA
and T1.ColumnB=T2.ColumnB
where
(T1.ColumnA is null or T2.ColumnA is null)
I have tried many way in Pandas like concate, join, merge, etc, but it seems that the two merge keys would be combined into one. I think the problem is that I want to check is not 'data columns' but 'key columns'. Is there any good idea to do this in Python? Thanks!
ColumnA | ExistT1 | ExistT2 |
---|---|---|
A | 3 | null |
A | null | 4 |
C | 2 | null |
D | null | 2 |
CodePudding user response:
pd.merge
has an indicator
parameter that could be helpful here:
(t1
.merge(t2, how = 'outer', indicator=True)
.loc[lambda df: df._merge!="both"]
.assign(ExistT1 = lambda df: df.ColumnB.where(df._merge.eq('left_only')),
ExistT2 = lambda df: df.ColumnB.where(df._merge.eq('right_only')) )
.drop(columns=['ColumnB', '_merge'])
)
ColumnA ExistT1 ExistT2
1 A 3.0 NaN
3 C 2.0 NaN
4 A NaN 4.0
5 D NaN 2.0
CodePudding user response:
First
- merge 2 dataframes
following code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer'))
output:
ColumnA ColumnB ExistT1 ExistT2
0 A 1 1.00 1.00
1 A 3 3.00 NaN
2 B 1 1.00 1.00
3 C 2 2.00 NaN
4 A 4 NaN 4.00
5 D 2 NaN 2.00
Second
- drop
ColumnB
and same value rows (like row 0 and row2)
include full code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)])
output:
ColumnA ExistT1 ExistT2
1 A 3.00 NaN
3 C 2.00 NaN
4 A NaN 4.00
5 D NaN 2.00
Final
sort_values
and reset_index
(full code)
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)]
.sort_values(['ColumnA']).reset_index(drop=True))
result:
ColumnA ExistT1 ExistT2
0 A 3.00 NaN
1 A NaN 4.00
2 C 2.00 NaN
3 D NaN 2.00