i have two dataframes and i want to compare the values of two columns and display those who are different, for exemple: compare this Table 1
A | B | C | D |
---|---|---|---|
O1 | 2 | E1 | 2 |
O1 | 3 | E1 | 1 |
O1 | 2 | E1 | 0 |
O1 | 5 | E2 | 2 |
O1 | 2 | E2 | 3 |
O1 | 2 | E2 | 2 |
O1 | 5 | E2 | 1 |
O2 | 8 | E1 | 2 |
O2 | 8 | E1 | 0 |
O2 | 0 | E1 | 1 |
O2 | 2 | E1 | 4 |
O2 | 9 | E1 | 2 |
O2 | 2 | E2 | 1 |
O2 | 9 | E2 | 4 |
O2 | 2 | E2 | 2 |
with this table 2
A | B | C | D |
---|---|---|---|
O1 | 2 | E1 | 2 |
O1 | 2 | E2 | 3 |
O2 | 2 | E1 | 4 |
O2 | 9 | E2 | 4 |
i tried
cond= [table1.A == table2.A, table1.C == table2.C, table1.D == table2.D]
join = table1.join(table2,cond,"leftsemi")
and since i have a lot of data in it, i don't know how to check if the result i've got is correct
CodePudding user response:
Create a list of tuple from one dataframe
check_list = list(df2.apply(tuple, axis=1))
and use it to compare with other
df1 =df1[~df1.apply(tuple, axis=1).isin(check_list)]
CodePudding user response:
Since your dataframes has the same schema, you can use subtract
df1
df1 = spark.createDataFrame([
(1, 2, 3, 4),
(5, 6, 7, 8),
], ['a', 'b', 'c', 'd'])
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| 1| 2| 3| 4|
| 5| 6| 7| 8|
--- --- --- ---
df2
df2 = spark.createDataFrame([
(5, 6, 7, 8),
], ['a', 'b', 'c', 'd'])
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| 5| 6| 7| 8|
--- --- --- ---
subtract
to get data that exists in df1
but does not exists in df2
df1.subtract(df2).show()
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| 1| 2| 3| 4|
--- --- --- ---