I need to add a couple of columns to a Spark DataFrame.
The value for both columns is conditional, using a when
clause, but the condition is the same for both of them.
val df: DataFrame = ???
df
.withColumn("colA", when(col("condition").isNull, f1).otherwise(f2))
.withColumn("colB", when(col("condition").isNull, f3).otherwise(f4))
Since the condition in both when
clauses is the same, is there a way I can rewrite this without repeating myself? I don't mean just extracting the condition to a variable, but actually reducing it to a single when
clause, to avoid having to run the test multiple times on the DataFrame.
Also, in case I leave it like that, will Spark calculate the condition twice, or will it be able to optimize the work plan and run it only once?
CodePudding user response:
The corresponding columns f1
/f3
and f2
/f4
can be packed into an array and then separated into two different columns after evaluating the condition.
df.withColumn("colAB", when(col("condition").isNull, array('f1, 'f3)).otherwise(array('f2, 'f4)))
.withColumn("colA", 'colAB(0))
.withColumn("colB", 'colAB(1))
The physical plans for my code and the code in the question are (ignoring the intermediate column colAB
) the same:
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colA#71, colB#78]
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colAB#47, colA#54, colB#62]
so in both cases the condition is evaluated only once. This is at least true if condition
is a regular column.
A reason to combine the two when
statements could be that the code is better readable, although this judgement depends on the reader.