Home > OS >  Spark cache/persist vs shuffle files
Spark cache/persist vs shuffle files

Time:09-11

I am trying to understand the benefit of spark caching/persist mechanism

AFAIK, spark always persists the RDDs after a shuffle as an optimizaiton. So what benefit does cache/persist call provide? I am assuming the cache/persist happens in memory so the only benefit is that it won't read from the disk, correct?

CodePudding user response:

Two differences come to mind

  1. During shuffle, intermediate data (data that need to be shuffled across nodes) gets saved so as to avoid reshuffling. This gets reflected in Spark UI as skipped stages. With cache/persist, you are caching the processed data.
  2. You are in control of what need to be cached but you doesn't have explicit control on caching shuffled data (it is behind the scenes optimization).

CodePudding user response:

what benefit does cache/persist call provide?

one example that comes to me at once is that the spark reading process, if you read some data from file system and you do two separate sets of transformations(and finally action) on it, you will load two times the source data(you can check your UI and you will see two loads), but if you cache it, the load process will only happen once.

cache/persist happens in memory so the only benefit is that it won't read from the disk

nope, cache and persist happen in different levels and memory is the default level: check out here: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

  • Related