I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)
Example: consider cataframes df1 and df2
df1:
--- ------- -----
| id|student|marks|
--- ------- -----
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
--- ------- -----
df2:
----- -------- ------
|newId|student1|marks1|
----- -------- ------
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
----- -------- ------
Here based on id
and newId
, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks
In this example student with id 2
has 21
marks but in df2 23
marks
CodePudding user response:
I think diff
will give the result you are looking for.
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])
CodePudding user response:
df1.exceptAll(df2).show()
// --- ------- -----
// | id|student|marks|
// --- ------- -----
// | 2| Ram| 21|
// --- ------- -----