Home > Mobile >  When to cache in pyspark?
When to cache in pyspark?

Time:11-26

I've been reading about pyspark caching and how execution works. It is clear for me how using .cache() when multiple actions trigger the same computation:

df = sc.sql("select * from table")
df.count()
df  = df.where({something})
df.count()

can be improved by doing:

df = sc.sql("select * from table").cache()
df.count()
df  = df.where({something})
df.count()

However, it is not clear for me if and why it would be advantageous without intermediate actions:

df = sc.sql("select * from table")
df2 = sc.sql("select * from table2")
df  = df.where({something})
df2  = df2.where({something})
df3 = df.join(df2).where({something})
df3.count()

In this type of code (where we have only one final action) is cache() useful?

CodePudding user response:

Being straight to the point: no, in that case it would not be useful.

Transformations have lazy evaluation in Spark. I.e., they are recorded but the execution needs to be triggered by an Action (such as your count).

So, when you execute df3.count() it will evaluate all the transformations up to that point.

If you do not perform another action, then it is certain that adding .cache() anywhere will not provide any performance improvement.

However, even if you do more than one action, .cache() [or .checkpoint(), depending on your problem] sometimes does not provide any performance increase. It will highly depends on your problem, and the transformation costs you have - e.g., a join can be very costly.

Also if you are running Spark using its interactive shell, eventually sometimes .checkpoint() can be better suited after costly transformations.

  • Related