For example : assume we have the following dataset :
Student | Grade |
---|---|
Bob | 10 |
Sam | 30 |
Tom | 30 |
Vlad | 30 |
when spark executes the following transformation :
df.withColumn("Grade_minus_average", df("Grade") - lit(average) )
will spark compute "30 - average" 3 times or will it reuse the computation ?
(let's assume there is only one partition)
CodePudding user response:
No.
An excellent source: https://www.linkedin.com/pulse/catalyst-tungsten-apache-sparks-speeding-engine-deepak-rajak/
Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime. This is not in any particular way specific to Catalyst. It is just a standard compilation technique and its benefits should be obvious. It is better to compute expression once than repeat this for each row.
As you have a constant and a variable, it will be done for each row and there is no process within Spark, that at run-time track if the previous invocation had the same input, nor a cache of previous values outcomes.