i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
--- --- --- ---
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
--- --- --- ---
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
--- --- --- ---
CodePudding user response:
I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
--- --- --- ---
| a| b| c| d|
--- --- --- ---
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
--- --- --- ---
I also agree with @pltc that call out you might have made a mistake in your output table.