Home > Blockchain >  Compare columns from two different dataframes based on id
Compare columns from two different dataframes based on id

Time:06-16

I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)

Example: consider cataframes df1 and df2

df1:

 --- ------- ----- 
| id|student|marks|
 --- ------- ----- 
|  1|  Vijay|   23|
|  4| Vithal|   24|
|  2|    Ram|   21|
|  3|  Rahul|   25|
 --- ------- ----- 

df2:

 ----- -------- ------ 
|newId|student1|marks1|
 ----- -------- ------ 
|    3|   Rahul|    25|
|    2|     Ram|    23|
|    1|   Vijay|    23|
|    4|  Vithal|    24|
 ----- -------- ------ 

Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks

In this example student with id 2 has 21 marks but in df2 23 marks

CodePudding user response:

I think diff will give the result you are looking for.

scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])

CodePudding user response:

df1.exceptAll(df2).show()
//  --- ------- -----                                                              
// | id|student|marks|
//  --- ------- ----- 
// |  2|    Ram|   21|
//  --- ------- ----- 
  • Related