Having two dataframes with same columns, I would like to create a resultant dataframe with the difference between the columns, having into account that the dataframes have a lot of columns (and rows).
I guess the approach is first doing an inner join, and then do a "WithColumn" with a subtract inside, but I don't know how to do this in an automated way for a lot of columns.
Example:
first dataframe:
Id | col1 | col2 | col3 | ... | colXX |
---|---|---|---|---|---|
1 | 1.1 | 1.2 | 1.6 | ... | 1.8 |
second dataframe:
Id | col1 | col2 | col3 | ... | colXX |
---|---|---|---|---|---|
1 | 1.2 | 1.2 | 2.1 | ... | 2.1 |
Expected dataframe:
Id | diff_col1 | diff_col2 | diff_col3 | ... | diff_colXX |
---|---|---|---|---|---|
1 | 0.1 | 0.0 | 0.5 | ... | 0.3 |
Thanks beforehand!
CodePudding user response:
First prepare the selection of the differences and then apply it to the resulting dataframe from the join.
val selection =
df1.columns.diff(Seq("Id"))
.map(x => (col(s"df1.$x") - col(s"df2.$x")) as s"diff_$x")
val query =
df1.as("df1")
.join(df2.as("df2"), Seq("Id"), "inner")
.select((Seq(col("df1.Id")) selection):_*)
Notice the alias on the dataframes in the join matching the names being used in the difference calculations.