Home > Net >  display the difference of two values of two columns from two different dataframes without losing the
display the difference of two values of two columns from two different dataframes without losing the

Time:04-23

I've got two dataframes with different values of "d" but have the same values of "a" and "b"

this is df1

df1 = spark.createDataFrame([
    ('c', 'd', 8),
    ('e', 'f', 8),
    ('c', 'j', 9),
], ['a', 'b', 'd'])
​
df1.show()
 --- --- --- 
|  a|  b|  d|
 --- --- --- 
|  c|  d|  8|
|  e|  f|  8|
|  c|  j|  9|
 --- --- --- 

and this is df 2

df2 = spark.createDataFrame([
    ('c', 'd', 7),
    ('e', 'f', 3),
    ('c', 'j', 8),
], ['a', 'b', 'd'])
df2.show()
 --- --- --- 
|  a|  b|  d|
 --- --- --- 
|  c|  d|  7|
|  e|  f|  3|
|  c|  j|  8|
 --- --- --- 

and i want to obtain the difference between the values of column "d" but also i want to keep the columns "a" and "b"

df3 
 --- --- --- 
|  a|  b|  d|
 --- --- --- 
|  c|  d|  1|
|  e|  f|  5|
|  c|  j|  1|
 --- --- --- 

i tried doing a subtract between the two dataframes but it didn't work

df1.subtract(df2).show()
 --- --- --- 
|  a|  b|  d|
 --- --- --- 
|  c|  d|  8|
|  e|  f|  8|
|  c|  j|  9|
 --- --- --- 

CodePudding user response:

Here is how you can do it:

df3 = df1.join(df2, on = ['b', 'a'], how = 'outer').select('a', 'b', (df1.d - df2.d).alias('diff'))

df3.show()


  • Related