How to join and divide unknown number of columns in a dataframe with columns in another dataframe wi-CodePudding

And most importantly number of col1, col2....are not known, there can be hundreds of columns. So we will have to do the join and division dynamically

How to this is spark scala?

CodePudding user response：

Dataset objects have the columns property that gives you an array of column names. It is easy to filter the columns of df1 and leave those that are present in df2, then use map to derive the needed columns:

val df1 = Seq(("xyz", 10.0, 12.0),
              ("abc", 42.0, 7.0)).toDF("join_col", "col1", "col2")
val df2 = Seq(("xyz", 7.0, 22.0, 11.0),
              ("abc", 11.0, 9.0, 42.0)).toDF("join_col", "col1", "col2", "col3")

// Common columns in both datasets
val cols = df1.columns.filter(df2.columns.toSet)

val join_col = cols(0)

val joined = df1.join(df2, df1(join_col) === df2(join_col))

// Columns from df1
val df1Cols = cols.map(df1(_))

// Division columns renamed to div_whatever
val divCols = cols.drop(1).map((name) => df1(name) / df2(name) as s"div_${name}")

val finalTable = joined.select((df1Cols    divCols) :_*)

finalTable.show(false)
//  -------- ---- ---- ------------------ ------------------ 
// |join_col|col1|col2|div_col1          |div_col2          |
//  -------- ---- ---- ------------------ ------------------ 
// |xyz     |10.0|12.0|1.4285714285714286|0.5454545454545454|
// |abc     |42.0|7.0 |3.8181818181818183|0.7777777777777778|
//  -------- ---- ---- ------------------ ------------------

This one assumes that the join column is the first column in df1.