Home > OS >  Subtract columns in two dataframes to get differences in Spark Scala
Subtract columns in two dataframes to get differences in Spark Scala

Time:09-09

Having two dataframes with same columns, I would like to create a resultant dataframe with the difference between the columns, having into account that the dataframes have a lot of columns (and rows).

I guess the approach is first doing an inner join, and then do a "WithColumn" with a subtract inside, but I don't know how to do this in an automated way for a lot of columns.

Example:

first dataframe:

Id col1 col2 col3 ... colXX
1 1.1 1.2 1.6 ... 1.8

second dataframe:

Id col1 col2 col3 ... colXX
1 1.2 1.2 2.1 ... 2.1

Expected dataframe:

Id diff_col1 diff_col2 diff_col3 ... diff_colXX
1 0.1 0.0 0.5 ... 0.3

Thanks beforehand!

CodePudding user response:

First prepare the selection of the differences and then apply it to the resulting dataframe from the join.

val selection = 
    df1.columns.diff(Seq("Id"))
       .map(x => (col(s"df1.$x") - col(s"df2.$x")) as s"diff_$x")

val query = 
    df1.as("df1")
       .join(df2.as("df2"), Seq("Id"), "inner")
       .select((Seq(col("df1.Id"))    selection):_*)

Notice the alias on the dataframes in the join matching the names being used in the difference calculations.

  • Related