I've got two dataframes with different values of "d" but have the same values of "a" and "b"
this is df1
df1 = spark.createDataFrame([
('c', 'd', 8),
('e', 'f', 8),
('c', 'j', 9),
], ['a', 'b', 'd'])
df1.show()
--- --- ---
| a| b| d|
--- --- ---
| c| d| 8|
| e| f| 8|
| c| j| 9|
--- --- ---
and this is df 2
df2 = spark.createDataFrame([
('c', 'd', 7),
('e', 'f', 3),
('c', 'j', 8),
], ['a', 'b', 'd'])
df2.show()
--- --- ---
| a| b| d|
--- --- ---
| c| d| 7|
| e| f| 3|
| c| j| 8|
--- --- ---
and i want to obtain the difference between the values of column "d" but also i want to keep the columns "a" and "b"
df3
--- --- ---
| a| b| d|
--- --- ---
| c| d| 1|
| e| f| 5|
| c| j| 1|
--- --- ---
i tried doing a subtract between the two dataframes but it didn't work
df1.subtract(df2).show()
--- --- ---
| a| b| d|
--- --- ---
| c| d| 8|
| e| f| 8|
| c| j| 9|
--- --- ---
CodePudding user response:
Here is how you can do it:
df3 = df1.join(df2, on = ['b', 'a'], how = 'outer').select('a', 'b', (df1.d - df2.d).alias('diff'))
df3.show()