Home > Net >  Compare two couple of columns from two different pyspark dataframe to display the data that are diff
Compare two couple of columns from two different pyspark dataframe to display the data that are diff

Time:04-23

i've got this dataframe with four columns

df1 = spark.createDataFrame([
    ('c', 'd', 3.0, 4),
    ('c', 'd', 7.3, 8),
    ('c', 'd', 7.3, 2),
    ('c', 'd', 7.3, 8),
    ('e', 'f', 6.0, 3),
    ('e', 'f', 6.0, 8),
    ('e', 'f', 6.0, 3),
    ('c', 'j', 4.2, 3),
    ('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])

df1.show()
 --- --- --- --- 
|  a|  b|  c|  d|
 --- --- --- --- 
|  c|  d|3.0|  4|
|  c|  d|7.3|  8|
|  c|  d|7.3|  2|
|  c|  d|7.3|  8|
|  e|  f|6.0|  3|
|  e|  f|6.0|  8|
|  e|  f|6.0|  3|
|  c|  j|4.2|  3|
|  c|  j|4.3|  9|
 --- --- --- --- 

and i also got this other dataframe df2 with the same schema as the dataframe df1


df2 = spark.createDataFrame([
    ('c', 'd', 3.0, 4),
    ('c', 'd', 3.3, 5),
    ('c', 'd', 7.3, 2),
    ('c', 'd', 7.3, 7),
    ('e', 'f', 6.0, 3),
    ('c', 'j', 4.2, 1),
    ('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
 --- --- --- --- 
|  a|  b|  c|  d|
 --- --- --- --- 
|  c|  d|3.0|  4|
|  c|  d|3.3|  5|
|  c|  d|7.3|  2|
|  c|  d|7.3|  7|
|  e|  f|6.0|  3|
|  c|  j|4.2|  1|
|  c|  j|4.3|  9|
 --- --- --- --- 

I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this

df3
 --- --- --- --- 
|  a|  b|  c|  d|
 --- --- --- --- 
|  c|  d|3.3|  5|
|  c|  d|7.3|  7|
|  c|  j|4.2|  1|
 --- --- --- --- 

CodePudding user response:

I think what you want is:

df2.subtract(df1.intersect(df2)).show()

I want what is in df2 that is not in both df1 and df2.

 --- --- --- --- 
|  a|  b|  c|  d|
 --- --- --- --- 
|  c|  j|4.2|  1|
|  c|  d|3.3|  5|
|  c|  d|7.3|  7|
 --- --- --- --- 

I also agree with @pltc that call out you might have made a mistake in your output table.

  • Related