I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition.
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]
can I do this in scala?
What I have tried so far is:
df1.join(df2, Seq("col1"), "outer").select(col("col1"), coalesce(df1.col("col2"), df2.col("col2")).as(col("col2"), coalesce(df1.col("col3")..........as(col("col30"))
is there a better way to add them with a loop instead of expanding this?
CodePudding user response:
You can try this
var columns: Seq[org.apache.spark.sql.Column] = Seq()
for( element <- df1.columns) {
val c = coalesce(df1(element), df2(element)).alias(element)
columns = columns : c
}
df1.join(df2, Seq("col1"), "outer").select(columns:_*).show
CodePudding user response:
The condition you have in pySpark can be translated to Scala. Check. this:
df1.columns
.filter(name => !Array("firstname", "middlename", "lasstname").contains(name))
.map(c => {
when(!(df1.col(c) === df2.col(c)), lit(c)).otherwise("")
})